Add model card
Browse files- README.md +70 -0
- training.md +146 -0
README.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
-
|
| 4 |
+
-
|
| 5 |
+
thumbnail:
|
| 6 |
+
tags:
|
| 7 |
+
-
|
| 8 |
+
-
|
| 9 |
+
-
|
| 10 |
+
license:
|
| 11 |
+
datasets:
|
| 12 |
+
-
|
| 13 |
+
-
|
| 14 |
+
metrics:
|
| 15 |
+
-
|
| 16 |
+
-
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# GPT-2 GERMAN
|
| 20 |
+
|
| 21 |
+
## Model description
|
| 22 |
+
|
| 23 |
+
TODO
|
| 24 |
+
## Intended uses & limitations
|
| 25 |
+
|
| 26 |
+
#### How to use
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
+
# You can include sample code which will be formatted
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
#### Limitations and bias
|
| 33 |
+
|
| 34 |
+
Provide examples of latent issues and potential remediations.
|
| 35 |
+
|
| 36 |
+
## Training data
|
| 37 |
+
|
| 38 |
+
https://huggingface.co/datasets/german-nlp-group/german_common_crawl
|
| 39 |
+
|
| 40 |
+
```json
|
| 41 |
+
{'url': 'http://my-shop.ru/shop/books/545473.html',
|
| 42 |
+
'date_download': '2016-10-20T19:38:58Z',
|
| 43 |
+
'digest': 'sha1:F62EMGYLZDIKF4UL5JZYU47KWGGUBT7T',
|
| 44 |
+
'length': 1155,
|
| 45 |
+
'nlines': 4,
|
| 46 |
+
'source_domain': 'my-shop.ru',
|
| 47 |
+
'title': 'Grammatikalische Liebeslieder. Methodische Vorschläge',
|
| 48 |
+
'raw_content': 'Grammatikalische Liebeslieder. [....]',
|
| 49 |
+
'cc_segment': 'crawl-data/CC-MAIN-2016-44/segments/1476988717783.68/wet/CC-MAIN-20161020183837-00354-ip-10-171-6-4.ec2.internal.warc.wet.gz',
|
| 50 |
+
'original_nlines': 99,
|
| 51 |
+
'original_length': 2672,
|
| 52 |
+
'language': 'de',
|
| 53 |
+
'language_score': 1.0,
|
| 54 |
+
'perplexity': 283.0,
|
| 55 |
+
'bucket': 'head'}"
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Training procedure
|
| 59 |
+
|
| 60 |
+
TODO (See training.md)
|
| 61 |
+
|
| 62 |
+
## Eval results
|
| 63 |
+
|
| 64 |
+
### BibTeX entry and citation info
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@inproceedings{...,
|
| 68 |
+
year={2021}
|
| 69 |
+
}
|
| 70 |
+
```
|
training.md
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!---
|
| 2 |
+
Copyright 2021 The HuggingFace Team. All rights reserved.
|
| 3 |
+
|
| 4 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 5 |
+
you may not use this file except in compliance with the License.
|
| 6 |
+
You may obtain a copy of the License at
|
| 7 |
+
|
| 8 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 9 |
+
|
| 10 |
+
Unless required by applicable law or agreed to in writing, software
|
| 11 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 12 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 13 |
+
See the License for the specific language governing permissions and
|
| 14 |
+
limitations under the License.
|
| 15 |
+
-->
|
| 16 |
+
|
| 17 |
+
# Language model training examples
|
| 18 |
+
|
| 19 |
+
The following example showcases how to train a language model from scratch
|
| 20 |
+
using the JAX/Flax backend.
|
| 21 |
+
|
| 22 |
+
JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU.
|
| 23 |
+
Models written in JAX/Flax are **immutable** and updated in a purely functional
|
| 24 |
+
way which enables simple and efficient model parallelism.
|
| 25 |
+
|
| 26 |
+
## Causal language modeling
|
| 27 |
+
|
| 28 |
+
In the following, we demonstrate how to train an auto-regressive causal transformer model
|
| 29 |
+
in JAX/Flax.
|
| 30 |
+
More specifically, we pretrain a randomely initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8.
|
| 31 |
+
to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2)
|
| 32 |
+
in Norwegian on a single TPUv3-8 pod.
|
| 33 |
+
|
| 34 |
+
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
| 35 |
+
|
| 36 |
+
Let's start by creating a model repository to save the trained model and logs.
|
| 37 |
+
Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like.
|
| 38 |
+
|
| 39 |
+
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
| 40 |
+
you are logged in) or via the command line:
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
huggingface-cli repo create norwegian-gpt2
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Next we clone the model repository to add the tokenizer and model files.
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
git clone https://huggingface.co/<your-username>/norwegian-gpt2
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
To ensure that all tensorboard traces will be uploaded correctly, we need to
|
| 53 |
+
track them. You can run the following command inside your model repo to do so.
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
cd norwegian-gpt2
|
| 57 |
+
git lfs track "*tfevents*"
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
Great, we have set up our model repository. During training, we will automatically
|
| 61 |
+
push the training logs and model weights to the repo.
|
| 62 |
+
|
| 63 |
+
Next, let's add a symbolic link to the `run_clm_flax.py`.
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
export MODEL_DIR="./norwegian-gpt2"
|
| 67 |
+
ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### Train tokenizer
|
| 71 |
+
|
| 72 |
+
In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**.
|
| 73 |
+
The tokenizer is trained on the complete Norwegian dataset of OSCAR
|
| 74 |
+
and consequently saved in `${MODEL_DIR}`
|
| 75 |
+
This can take up to 10 minutes depending on your hardware ☕.
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from datasets import load_dataset
|
| 79 |
+
from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
|
| 80 |
+
|
| 81 |
+
model_dir = "./norwegian-roberta-base" # ${MODEL_DIR}
|
| 82 |
+
|
| 83 |
+
# load dataset
|
| 84 |
+
dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train")
|
| 85 |
+
|
| 86 |
+
# Instantiate tokenizer
|
| 87 |
+
tokenizer = ByteLevelBPETokenizer()
|
| 88 |
+
|
| 89 |
+
def batch_iterator(batch_size=1000):
|
| 90 |
+
for i in range(0, len(dataset), batch_size):
|
| 91 |
+
yield dataset[i: i + batch_size]["text"]
|
| 92 |
+
|
| 93 |
+
# Customized training
|
| 94 |
+
tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[
|
| 95 |
+
"<s>",
|
| 96 |
+
"<pad>",
|
| 97 |
+
"</s>",
|
| 98 |
+
"<unk>",
|
| 99 |
+
"<mask>",
|
| 100 |
+
])
|
| 101 |
+
|
| 102 |
+
# Save files to disk
|
| 103 |
+
tokenizer.save(f"{model_dir}/tokenizer.json")
|
| 104 |
+
```
|
| 105 |
+
### Create configuration
|
| 106 |
+
|
| 107 |
+
Next, we create the model's configuration file. This is as simple
|
| 108 |
+
as loading and storing [`**gpt2**`](https://huggingface.co/gpt2)
|
| 109 |
+
in the local model folder:
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
from transformers import GPT2Config
|
| 113 |
+
|
| 114 |
+
model_dir = "./norwegian-gpt2" # ${MODEL_DIR}
|
| 115 |
+
|
| 116 |
+
config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0)
|
| 117 |
+
config.save_pretrained(model_dir)
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
### Train model
|
| 121 |
+
|
| 122 |
+
Next we can run the example script to pretrain the model:
|
| 123 |
+
|
| 124 |
+
```bash
|
| 125 |
+
./run_clm_flax.py \
|
| 126 |
+
--output_dir="${MODEL_DIR}" \
|
| 127 |
+
--model_type="gpt2" \
|
| 128 |
+
--config_name="${MODEL_DIR}" \
|
| 129 |
+
--tokenizer_name="${MODEL_DIR}" \
|
| 130 |
+
--dataset_name="oscar" \
|
| 131 |
+
--dataset_config_name="unshuffled_deduplicated_no" \
|
| 132 |
+
--do_train --do_eval \
|
| 133 |
+
--block_size="512" \
|
| 134 |
+
--per_device_train_batch_size="64" \
|
| 135 |
+
--per_device_eval_batch_size="64" \
|
| 136 |
+
--learning_rate="5e-3" --warmup_steps="1000" \
|
| 137 |
+
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
|
| 138 |
+
--overwrite_output_dir \
|
| 139 |
+
--num_train_epochs="20" \
|
| 140 |
+
--push_to_hub
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
Training should converge at a loss and perplexity
|
| 144 |
+
of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8.
|
| 145 |
+
This should take less than ~21 hours.
|
| 146 |
+
Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA).
|