[Flax] Add T5 pretraining script (#12355)

* fix_torch_device_generate_test * remove @ * add length computatan * finish masking * finish * upload * fix some bugs * finish * fix dependency table * correct tensorboard * Apply suggestions from code review * correct processing * slight change init * correct some more mistakes * apply suggestions * improve readme * fix indent * Apply suggestions from code review Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> * correct tokenizer * finish * finish * finish * finish Co-authored-by: Patrick von Platen <patrick@huggingface.co> Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
2021-06-28 20:11:29 +01:00
parent e277074889
commit 31c3e7e75b
7 changed files with 1014 additions and 16 deletions
--- a/examples/flax/language-modeling/README.md
+++ b/examples/flax/language-modeling/README.md
@@ -241,6 +241,140 @@ of 3.24 and 25.72 respectively after 20 epochs on a single TPUv3-8.
 This should take less than ~21 hours.
 Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experiment/2zEhLwJ0Qp2FAkI3WVH9qA).

+## T5-like span-masked language modeling
+
+In the following, we demonstrate how to train a T5 model using the span-masked language model 
+objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
+More specifically, we demonstrate how JAX/Flax can be leveraged 
+to pre-train [**`t5-small`**](https://huggingface.co/t5-small)
+in Norwegian on a single TPUv3-8 pod.
+
+The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
+
+Let's start by creating a model repository to save the trained model and logs.
+Here we call the model `"norwegian-t5-small"`, but you can change the model name as you like.
+
+You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
+you are logged in) or via the command line:
+
+```
+huggingface-cli repo create norwegian-t5-small
+```
+
+Next we clone the model repository to add the tokenizer and model files.
+
+```
+git clone https://huggingface.co/<your-username>/norwegian-t5-small
+```
+
+To ensure that all tensorboard traces will be uploaded correctly, we need to 
+track them. You can run the following command inside your model repo to do so.
+
+```
+cd norwegian-t5-small
+git lfs track "*tfevents*"
+```
+
+Great, we have set up our model repository. During training, we will automatically
+push the training logs and model weights to the repo.
+
+Next, let's add a symbolic link to the `run_t5_mlm_flax.py` and `t5_tokenizer_model` scripts.
+
+```bash
+export MODEL_DIR="./norwegian-t5-small"
+ln -s ~/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py
+ln -s ~/transformers/examples/flax/language-modeling/t5_tokenizer_model.py t5_tokenizer_model.py
+```
+
+### Train tokenizer
+
+In the first step, we train a tokenizer to efficiently process the text input for the model. 
+We make use of the [tokenizers](https://github.com/huggingface/tokenizers) library to train 
+a sentencepiece unigram tokenizer as shown in [t5_tokenizer_model.py](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling/t5_tokenizer_model.py) 
+which is heavily inspired from [yandex-research/DeDLOC's tokenizer model](https://github.com/yandex-research/DeDLOC/blob/5c994bc64e573702a9a79add3ecd68b38f14b548/sahajbert/tokenizer/tokenizer_model.py) .
+
+The tokenizer is trained on the complete Norwegian dataset of OSCAR
+and consequently saved in `${MODEL_DIR}`
+This can take up to 120 minutes depending on your hardware ☕☕☕ .
+
+```python
+import datasets
+
+from t5_tokenizer_model import SentencePieceUnigramTokenizer
+
+
+vocab_size = 32_000
+input_sentence_size = None
+model_dir = "./norwegian-t5-small"  # ${MODEL_DIR}
+
+# Initialize a dataset
+dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")
+
+tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
+
+
+# Build an iterator over this dataset
+def batch_iterator(input_sentence_size=None):
+    if input_sentence_size is None:
+        input_sentence_size = len(dataset)
+    batch_length = 100
+    for i in range(0, input_sentence_size, batch_length):
+        yield dataset[i: i + batch_length]["text"]
+
+
+# Train tokenizer
+tokenizer.train_from_iterator(
+    iterator=batch_iterator(input_sentence_size=input_sentence_size),
+    vocab_size=vocab_size,
+    show_progress=True,
+)
+
+# Save files to disk
+tokenizer.save(f"{model_dir}/tokenizer.json")
+```
+
+### Create configuration
+
+Next, we create the model's configuration file. This is as simple 
+as loading and storing [`**t5-small**`](https://huggingface.co/t5-small)
+in the local model folder:
+
+```python
+from transformers import T5Config
+
+model_dir = "./norwegian-t5-small"  # ${MODEL_DIR}
+
+config = T5Config.from_pretrained("t5-small")
+config.save_pretrained(model_dir)
+```
+
+### Train model
+
+Next we can run the example script to pretrain the model:
+
+```bash
+./run_t5_mlm_flax.py \
+    --output_dir="${MODEL_DIR}" \
+    --model_type="t5" \
+    --config_name="${MODEL_DIR}" \
+    --tokenizer_name="${MODEL_DIR}" \
+    --dataset_name="oscar" \
+    --dataset_config_name="unshuffled_deduplicated_no" \
+    --max_seq_length="512" \
+    --per_device_train_batch_size="16" \
+    --per_device_eval_batch_size="16" \
+    --learning_rate="1e-3" \
+    --weight_decay="0.001" \
+    --warmup_steps="5000" \
+    --overwrite_output_dir \
+    --num_train_epochs="10" \
+    --push_to_hub
+```
+
+Training should converge at a loss and accuracy 
+of XXX and XXX respectively after 10 epochs on a single TPUv3-8.
+This should take less than 18 hours.
+Training statistics can be accessed on directly on the 🤗 [hub (TODO)]()

 ## Runtime evaluation