| <!--- | |
| Copyright 2021 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); | |
| you may not use this file except in compliance with the License. | |
| You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software | |
| distributed under the License is distributed on an "AS IS" BASIS, | |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| See the License for the specific language governing permissions and | |
| limitations under the License. | |
| --> | |
| # Language model training examples | |
| The following example showcases how to train a language model from scratch | |
| using the JAX/Flax backend. | |
| JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU. | |
| Models written in JAX/Flax are **immutable** and updated in a purely functional | |
| way which enables simple and efficient model parallelism. | |
| ## Causal language modeling | |
| In the following, we demonstrate how to train an auto-regressive causal transformer model | |
| in JAX/Flax. | |
| More specifically, we pretrain a randomely initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8. | |
| to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2) | |
| in Norwegian on a single TPUv3-8 pod. | |
| The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. | |
| Let's start by creating a model repository to save the trained model and logs. | |
| Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like. | |
| You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that | |
| you are logged in) or via the command line: | |
| ``` | |
| huggingface-cli repo create norwegian-gpt2 | |
| ``` | |
| Next we clone the model repository to add the tokenizer and model files. | |
| ``` | |
| git clone https://huggingface.co/<your-username>/norwegian-gpt2 | |
| ``` | |
| To ensure that all tensorboard traces will be uploaded correctly, we need to | |
| track them. You can run the following command inside your model repo to do so. | |
| ``` | |
| cd norwegian-gpt2 | |
| git lfs track "*tfevents*" | |
| ``` | |
| Great, we have set up our model repository. During training, we will automatically | |
| push the training logs and model weights to the repo. | |
| Next, let's add a symbolic link to the `run_clm_flax.py`. | |
| ```bash | |
| export MODEL_DIR="./norwegian-gpt2" | |
| ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py | |
| ``` | |
| ### Train tokenizer | |
| In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**. | |
| The tokenizer is trained on the complete Norwegian dataset of OSCAR | |
| and consequently saved in `${MODEL_DIR}` | |
| This can take up to 10 minutes depending on your hardware ☕. | |
| ```python | |
| from datasets import load_dataset | |
| from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer | |
| model_dir = "./norwegian-roberta-base" # ${MODEL_DIR} | |
| # load dataset | |
| dataset = load_dataset("oscar", "unshuffled_deduplicated_no", split="train") | |
| # Instantiate tokenizer | |
| tokenizer = ByteLevelBPETokenizer() | |
| def batch_iterator(batch_size=1000): | |
| for i in range(0, len(dataset), batch_size): | |
| yield dataset[i: i + batch_size]["text"] | |
| # Customized training | |
| tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=2, special_tokens=[ | |
| "<s>", | |
| "<pad>", | |
| "</s>", | |
| "<unk>", | |
| "<mask>", | |
| ]) | |
| # Save files to disk | |
| tokenizer.save(f"{model_dir}/tokenizer.json") | |
| ``` | |
| ### Create configuration | |
| Next, we create the model's configuration file. This is as simple | |
| as loading and storing [`**gpt2**`](https://huggingface.co/gpt2) | |
| in the local model folder: | |
| ```python | |
| from transformers import GPT2Config | |
| model_dir = "./norwegian-gpt2" # ${MODEL_DIR} | |
| config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0) | |
| config.save_pretrained(model_dir) | |
| ``` | |
| ### Train model | |
| Next we can run the example script to pretrain the model: | |
| ```bash | |
| ./run_clm_flax.py \ | |
| --output_dir="${MODEL_DIR}" \ | |
| --model_type="gpt2" \ | |
| --config_name="${MODEL_DIR}" \ | |
| --tokenizer_name="${MODEL_DIR}" \ | |
| --dataset_name="oscar" \ | |
| --dataset_config_name="unshuffled_deduplicated_no" \ | |
| --do_train --do_eval \ | |
| --block_size="512" \ | |
| --per_device_train_batch_size="64" \ | |
| --per_device_eval_batch_size="64" \ | |
| --learning_rate="5e-3" --warmup_steps="1000" \ | |
| --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \ | |
| --overwrite_output_dir \ | |
| --num_train_epochs="20" \ | |
| --push_to_hub | |
| ``` | |
| Training should converge at a loss and perplexity | |
| of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8. | |
| This should take less than ~21 hours. | |
| Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA). |