Update README.md
This commit is contained in:
committed by
GitHub
parent
d94773e685
commit
f4399ec570
@@ -251,32 +251,32 @@ Training statistics can be accessed on [tfhub.de](https://tensorboard.dev/experi
|
|||||||
In the following, we demonstrate how to train a T5 model using the span-masked language model
|
In the following, we demonstrate how to train a T5 model using the span-masked language model
|
||||||
objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
|
objective as proposed in the [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683).
|
||||||
More specifically, we demonstrate how JAX/Flax can be leveraged
|
More specifically, we demonstrate how JAX/Flax can be leveraged
|
||||||
to pre-train [**`t5-small`**](https://huggingface.co/t5-small)
|
to pre-train [**`google/t5-v1_1-base`**](https://huggingface.co/google/t5-v1_1-base)
|
||||||
in Norwegian on a single TPUv3-8 pod.
|
in Norwegian on a single TPUv3-8 pod.
|
||||||
|
|
||||||
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
||||||
|
|
||||||
Let's start by creating a model repository to save the trained model and logs.
|
Let's start by creating a model repository to save the trained model and logs.
|
||||||
Here we call the model `"norwegian-t5-small"`, but you can change the model name as you like.
|
Here we call the model `"norwegian-t5-base"`, but you can change the model name as you like.
|
||||||
|
|
||||||
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
||||||
you are logged in) or via the command line:
|
you are logged in) or via the command line:
|
||||||
|
|
||||||
```
|
```
|
||||||
huggingface-cli repo create norwegian-t5-small
|
huggingface-cli repo create norwegian-t5-base
|
||||||
```
|
```
|
||||||
|
|
||||||
Next we clone the model repository to add the tokenizer and model files.
|
Next we clone the model repository to add the tokenizer and model files.
|
||||||
|
|
||||||
```
|
```
|
||||||
git clone https://huggingface.co/<your-username>/norwegian-t5-small
|
git clone https://huggingface.co/<your-username>/norwegian-t5-base
|
||||||
```
|
```
|
||||||
|
|
||||||
To ensure that all tensorboard traces will be uploaded correctly, we need to
|
To ensure that all tensorboard traces will be uploaded correctly, we need to
|
||||||
track them. You can run the following command inside your model repo to do so.
|
track them. You can run the following command inside your model repo to do so.
|
||||||
|
|
||||||
```
|
```
|
||||||
cd norwegian-t5-small
|
cd norwegian-t5-base
|
||||||
git lfs track "*tfevents*"
|
git lfs track "*tfevents*"
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -286,7 +286,7 @@ push the training logs and model weights to the repo.
|
|||||||
Next, let's add a symbolic link to the `run_t5_mlm_flax.py` and `t5_tokenizer_model` scripts.
|
Next, let's add a symbolic link to the `run_t5_mlm_flax.py` and `t5_tokenizer_model` scripts.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export MODEL_DIR="./norwegian-t5-small"
|
export MODEL_DIR="./norwegian-t5-base"
|
||||||
ln -s ~/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py
|
ln -s ~/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py
|
||||||
ln -s ~/transformers/examples/flax/language-modeling/t5_tokenizer_model.py t5_tokenizer_model.py
|
ln -s ~/transformers/examples/flax/language-modeling/t5_tokenizer_model.py t5_tokenizer_model.py
|
||||||
```
|
```
|
||||||
@@ -310,7 +310,7 @@ from t5_tokenizer_model import SentencePieceUnigramTokenizer
|
|||||||
|
|
||||||
vocab_size = 32_000
|
vocab_size = 32_000
|
||||||
input_sentence_size = None
|
input_sentence_size = None
|
||||||
model_dir = "./norwegian-t5-small" # ${MODEL_DIR}
|
model_dir = "./norwegian-t5-base" # ${MODEL_DIR}
|
||||||
|
|
||||||
# Initialize a dataset
|
# Initialize a dataset
|
||||||
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")
|
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_no", split="train")
|
||||||
@@ -341,15 +341,15 @@ tokenizer.save(f"{model_dir}/tokenizer.json")
|
|||||||
### Create configuration
|
### Create configuration
|
||||||
|
|
||||||
Next, we create the model's configuration file. This is as simple
|
Next, we create the model's configuration file. This is as simple
|
||||||
as loading and storing [`**t5-small**`](https://huggingface.co/t5-small)
|
as loading and storing [`**google/t5-v1_1-base**`](https://huggingface.co/google/t5-v1_1-base)
|
||||||
in the local model folder:
|
in the local model folder:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from transformers import T5Config
|
from transformers import T5Config
|
||||||
|
|
||||||
model_dir = "./norwegian-t5-small" # ${MODEL_DIR}
|
model_dir = "./norwegian-t5-base" # ${MODEL_DIR}
|
||||||
|
|
||||||
config = T5Config.from_pretrained("t5-small")
|
config = T5Config.from_pretrained("google/t5-v1_1-base")
|
||||||
config.save_pretrained(model_dir)
|
config.save_pretrained(model_dir)
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -359,30 +359,30 @@ Next we can run the example script to pretrain the model:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
./run_t5_mlm_flax.py \
|
./run_t5_mlm_flax.py \
|
||||||
--output_dir="${MODEL_DIR}" \
|
--output_dir="./" \
|
||||||
--model_type="t5" \
|
--model_type="t5" \
|
||||||
--config_name="${MODEL_DIR}" \
|
--config_name="./" \
|
||||||
--tokenizer_name="${MODEL_DIR}" \
|
--tokenizer_name="./" \
|
||||||
--dataset_name="oscar" \
|
--dataset_name="oscar" \
|
||||||
--dataset_config_name="unshuffled_deduplicated_no" \
|
--dataset_config_name="unshuffled_deduplicated_no" \
|
||||||
--max_seq_length="512" \
|
--max_seq_length="512" \
|
||||||
--per_device_train_batch_size="16" \
|
--per_device_train_batch_size="32" \
|
||||||
--per_device_eval_batch_size="16" \
|
--per_device_eval_batch_size="32" \
|
||||||
--learning_rate="1e-3" \
|
--adafactor \
|
||||||
--weight_decay="0.001" \
|
--learning_rate="0.005" \
|
||||||
--warmup_steps="5000" \
|
--weight_decay="0.001" \
|
||||||
--overwrite_output_dir \
|
--warmup_steps="2000" \
|
||||||
--num_train_epochs="10" \
|
--overwrite_output_dir \
|
||||||
--logging_steps="500" \
|
--logging_steps="100" \
|
||||||
--save_steps="2500" \
|
--save_steps="1000" \
|
||||||
--eval_steps="2500" \
|
--eval_steps="1000" \
|
||||||
--push_to_hub
|
--push_to_hub
|
||||||
```
|
```
|
||||||
|
|
||||||
Training should converge at a loss and accuracy
|
Training should converge at a loss and accuracy
|
||||||
of XXX and XXX respectively after 10 epochs on a single TPUv3-8.
|
of 2.2 and 58.0 respectively after 2 epochs on a single TPUv3-8.
|
||||||
This should take less than 18 hours.
|
This should take around 24 hours.
|
||||||
Training statistics can be accessed on directly on the 🤗 [hub (TODO)]()
|
Training statistics can be accessed on directly on the 🤗 [hub](https://huggingface.co/patrickvonplaten/t5-base-norwegian/tensorboard)
|
||||||
|
|
||||||
## Runtime evaluation
|
## Runtime evaluation
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user