flax-community
/

gpt-2-german

Model card Files Files and versions

xet

Community

cakiki commited on Jul 1, 2021

Commit

c25e744

1 Parent(s): fdcc753

Add model card

Browse files

Files changed (2) hide show

README.md +70 -0
training.md +146 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+language:
+-
+-
+thumbnail:
+tags:
+-
+-
+-
+license:
+datasets:
+-
+-
+metrics:
+-
+-
+---
+# GPT-2 GERMAN
+## Model description
+TODO
+## Intended uses & limitations
+#### How to use
+```python
+# You can include sample code which will be formatted
+```
+#### Limitations and bias
+Provide examples of latent issues and potential remediations.
+## Training data
+https://huggingface.co/datasets/german-nlp-group/german_common_crawl
+```json
+{'url': 'http://my-shop.ru/shop/books/545473.html',
+'date_download': '2016-10-20T19:38:58Z',
+ 'digest': 'sha1:F62EMGYLZDIKF4UL5JZYU47KWGGUBT7T',
+ 'length': 1155,
+ 'nlines': 4,
+ 'source_domain': 'my-shop.ru',
+ 'title': 'Grammatikalische Liebeslieder. Methodische Vorschläge',
+ 'raw_content': 'Grammatikalische Liebeslieder. [....]',
+ 'cc_segment': 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/wet/CC-MAIN-20161020183837-00354-ip-10-171-6-4.ec2.internal.warc.wet.gz',
+ 'original_nlines': 99,
+ 'original_length': 2672,
+ 'language': 'de',
+ 'language_score': 1.0,
+ 'perplexity': 283.0,
+ 'bucket': 'head'}"
+ ```
+## Training procedure
+TODO (See training.md)
+## Eval results
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{...,
+  year={2021}
+}
+```

training.md ADDED Viewed

	@@ -0,0 +1,146 @@

+<!---
+Copyright 2021 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# Language model training examples
+The following example showcases how to train a language model from scratch
+using the JAX/Flax backend.
+JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU.
+Models written in JAX/Flax are **immutable** and updated in a purely functional
+way which enables simple and efficient model parallelism.
+## Causal language modeling
+In the following, we demonstrate how to train an auto-regressive causal transformer model
+in JAX/Flax.
+More specifically, we pretrain a randomely initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8.
+to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2)
+in Norwegian on a single TPUv3-8 pod.
+The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
+Let's start by creating a model repository to save the trained model and logs.
+Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like.
+You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
+you are logged in) or via the command line:
+```
+huggingface-cli repo create norwegian-gpt2
+```
+Next we clone the model repository to add the tokenizer and model files.
+```
+git clone https://huggingface.co/<your-username>/norwegian-gpt2
+```
+To ensure that all tensorboard traces will be uploaded correctly, we need to
+track them. You can run the following command inside your model repo to do so.
+```
+cd norwegian-gpt2
+git lfs track "*tfevents*"
+```
+Great, we have set up our model repository. During training, we will automatically
+push the training logs and model weights to the repo.
+Next, let's add a symbolic link to the `run_clm_flax.py`.
+```bash
+export MODEL_DIR="./norwegian-gpt2"
+ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
+```
+### Train tokenizer
+In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**.
+The tokenizer is trained on the complete Norwegian dataset of OSCAR
+and consequently saved in `${MODEL_DIR}`
+This can take up to 10 minutes depending on your hardware ☕.
+```python
+from datasets import load_dataset
+from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
+model_dir = "./norwegian-roberta-base"  # ${MODEL_DIR}
+# load dataset
+dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")
+# Instantiate tokenizer
+tokenizer = ByteLevelBPETokenizer()
+def batch_iterator(batch_size=1000):
+    for i in range(0, len(dataset), batch_size):
+        yield dataset[i: i + batch_size]["text"]
+# Customized training
+tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[
+    "<s>",
+    "<pad>",
+    "</s>",
+    "<unk>",
+    "<mask>",
+])
+# Save files to disk
+tokenizer.save(f"{model_dir}/tokenizer.json")
+```
+### Create configuration
+Next, we create the model's configuration file. This is as simple
+as loading and storing [`**gpt2**`](https://huggingface.co/gpt2)
+in the local model folder:
+```python
+from transformers import GPT2Config
+model_dir = "./norwegian-gpt2"  # ${MODEL_DIR}
+config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
+config.save_pretrained(model_dir)
+```
+### Train model
+Next we can run the example script to pretrain the model:
+```bash
+./run_clm_flax.py \
+    --output_dir="${MODEL_DIR}" \
+    --model_type="gpt2" \
+    --config_name="${MODEL_DIR}" \
+    --tokenizer_name="${MODEL_DIR}" \
+    --dataset_name="oscar" \
+    --dataset_config_name="unshuffled_deduplicated_no" \
+    --do_train --do_eval \
+    --block_size="512" \
+    --per_device_train_batch_size="64" \
+    --per_device_eval_batch_size="64" \
+    --learning_rate="5e-3" --warmup_steps="1000" \
+    --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
+    --overwrite_output_dir \
+    --num_train_epochs="20" \
+    --push_to_hub
+```
+Training should converge at a loss and perplexity
+of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8.
+This should take less than ~21 hours.
+Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA).