[examples/flax] use Repository API for push_to_hub (#13672)
* use Repository for push_to_hub * update readme * update other flax scripts * update readme * update qa example * fix push_to_hub call * fix typo * fix more typos * update readme * use abosolute path to get repo name * fix glue script
This commit is contained in:
@@ -33,32 +33,10 @@ in Norwegian on a single TPUv3-8 pod.
|
||||
|
||||
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
||||
|
||||
Let's start by creating a model repository to save the trained model and logs.
|
||||
Here we call the model `"norwegian-roberta-base"`, but you can change the model name as you like.
|
||||
|
||||
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
||||
you are logged in) or via the command line:
|
||||
|
||||
```
|
||||
huggingface-cli repo create norwegian-roberta-base
|
||||
```
|
||||
|
||||
Next we clone the model repository to add the tokenizer and model files.
|
||||
|
||||
```
|
||||
git clone https://huggingface.co/<your-username>/norwegian-roberta-base
|
||||
```
|
||||
|
||||
To setup all relevant files for training, let's go into the cloned model directory.
|
||||
To setup all relevant files for training, let's create a directory.
|
||||
|
||||
```bash
|
||||
cd norwegian-roberta-base
|
||||
```
|
||||
|
||||
Next, let's add a symbolic link to the `run_mlm_flax.py`.
|
||||
|
||||
```bash
|
||||
ln -s ~/transformers/examples/flax/language-modeling/run_mlm_flax.py run_mlm_flax.py
|
||||
mkdir ./norwegian-roberta-base
|
||||
```
|
||||
|
||||
### Train tokenizer
|
||||
@@ -92,7 +70,7 @@ tokenizer.train_from_iterator(batch_iterator(), vocab_size=50265, min_frequency=
|
||||
])
|
||||
|
||||
# Save files to disk
|
||||
tokenizer.save("./tokenizer.json")
|
||||
tokenizer.save("./norwegian-roberta-base/tokenizer.json")
|
||||
```
|
||||
|
||||
### Create configuration
|
||||
@@ -105,7 +83,7 @@ in the local model folder:
|
||||
from transformers import RobertaConfig
|
||||
|
||||
config = RobertaConfig.from_pretrained("roberta-base", vocab_size=50265)
|
||||
config.save_pretrained("./")
|
||||
config.save_pretrained("./norwegian-roberta-base")
|
||||
```
|
||||
|
||||
Great, we have set up our model repository. During training, we will automatically
|
||||
@@ -116,11 +94,11 @@ push the training logs and model weights to the repo.
|
||||
Next we can run the example script to pretrain the model:
|
||||
|
||||
```bash
|
||||
./run_mlm_flax.py \
|
||||
--output_dir="./" \
|
||||
python run_mlm_flax.py \
|
||||
--output_dir="./norwegian-roberta-base" \
|
||||
--model_type="roberta" \
|
||||
--config_name="./" \
|
||||
--tokenizer_name="./" \
|
||||
--config_name="./norwegian-roberta-base" \
|
||||
--tokenizer_name="./norwegian-roberta-base" \
|
||||
--dataset_name="oscar" \
|
||||
--dataset_config_name="unshuffled_deduplicated_no" \
|
||||
--max_seq_length="128" \
|
||||
@@ -157,32 +135,11 @@ in Norwegian on a single TPUv3-8 pod.
|
||||
|
||||
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
|
||||
|
||||
Let's start by creating a model repository to save the trained model and logs.
|
||||
Here we call the model `"norwegian-gpt2"`, but you can change the model name as you like.
|
||||
|
||||
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
||||
you are logged in) or via the command line:
|
||||
|
||||
```
|
||||
huggingface-cli repo create norwegian-gpt2
|
||||
```
|
||||
|
||||
Next we clone the model repository to add the tokenizer and model files.
|
||||
|
||||
```
|
||||
git clone https://huggingface.co/<your-username>/norwegian-gpt2
|
||||
```
|
||||
|
||||
To setup all relevant files for training, let's go into the cloned model directory.
|
||||
To setup all relevant files for training, let's create a directory.
|
||||
|
||||
```bash
|
||||
cd norwegian-gpt2
|
||||
```
|
||||
|
||||
Next, let's add a symbolic link to the training script `run_clm_flax.py`.
|
||||
|
||||
```bash
|
||||
ln -s ~/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
|
||||
mkdir ./norwegian-gpt2
|
||||
```
|
||||
|
||||
### Train tokenizer
|
||||
@@ -216,7 +173,7 @@ tokenizer.train_from_iterator(batch_iterator(), vocab_size=50257, min_frequency=
|
||||
])
|
||||
|
||||
# Save files to disk
|
||||
tokenizer.save("./tokenizer.json")
|
||||
tokenizer.save("./norwegian-gpt2/tokenizer.json")
|
||||
```
|
||||
|
||||
### Create configuration
|
||||
@@ -229,7 +186,7 @@ in the local model folder:
|
||||
from transformers import GPT2Config
|
||||
|
||||
config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, vocab_size=50257)
|
||||
config.save_pretrained("./")
|
||||
config.save_pretrained("./norwegian-gpt2")
|
||||
```
|
||||
|
||||
Great, we have set up our model repository. During training, we will now automatically
|
||||
@@ -240,11 +197,11 @@ push the training logs and model weights to the repo.
|
||||
Finally, we can run the example script to pretrain the model:
|
||||
|
||||
```bash
|
||||
./run_clm_flax.py \
|
||||
--output_dir="./" \
|
||||
python run_clm_flax.py \
|
||||
--output_dir="./norwegian-gpt2" \
|
||||
--model_type="gpt2" \
|
||||
--config_name="./" \
|
||||
--tokenizer_name="./" \
|
||||
--config_name="./norwegian-gpt2" \
|
||||
--tokenizer_name="./norwegian-gpt2" \
|
||||
--dataset_name="oscar" \
|
||||
--dataset_config_name="unshuffled_deduplicated_no" \
|
||||
--do_train --do_eval \
|
||||
@@ -282,30 +239,10 @@ The example script uses the 🤗 Datasets library. You can easily customize them
|
||||
Let's start by creating a model repository to save the trained model and logs.
|
||||
Here we call the model `"norwegian-t5-base"`, but you can change the model name as you like.
|
||||
|
||||
You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that
|
||||
you are logged in) or via the command line:
|
||||
|
||||
```
|
||||
huggingface-cli repo create norwegian-t5-base
|
||||
```
|
||||
|
||||
Next we clone the model repository to add the tokenizer and model files.
|
||||
|
||||
```
|
||||
git clone https://huggingface.co/<your-username>/norwegian-t5-base
|
||||
```
|
||||
|
||||
To setup all relevant files for trairing, let's go into the cloned model directory.
|
||||
To setup all relevant files for trairing, let's create a directory.
|
||||
|
||||
```bash
|
||||
cd norwegian-t5-base
|
||||
```
|
||||
|
||||
Next, let's add a symbolic link to the `run_t5_mlm_flax.py` and `t5_tokenizer_model` scripts.
|
||||
|
||||
```bash
|
||||
ln -s ~/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py
|
||||
ln -s ~/transformers/examples/flax/language-modeling/t5_tokenizer_model.py t5_tokenizer_model.py
|
||||
cd ./norwegian-t5-base
|
||||
```
|
||||
|
||||
### Train tokenizer
|
||||
@@ -351,7 +288,7 @@ tokenizer.train_from_iterator(
|
||||
)
|
||||
|
||||
# Save files to disk
|
||||
tokenizer.save("./tokenizer.json")
|
||||
tokenizer.save("./norwegian-t5-base/tokenizer.json")
|
||||
```
|
||||
|
||||
### Create configuration
|
||||
@@ -364,7 +301,7 @@ in the local model folder:
|
||||
from transformers import T5Config
|
||||
|
||||
config = T5Config.from_pretrained("google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size())
|
||||
config.save_pretrained("./")
|
||||
config.save_pretrained("./norwegian-t5-base")
|
||||
```
|
||||
|
||||
Great, we have set up our model repository. During training, we will automatically
|
||||
@@ -375,11 +312,11 @@ push the training logs and model weights to the repo.
|
||||
Next we can run the example script to pretrain the model:
|
||||
|
||||
```bash
|
||||
./run_t5_mlm_flax.py \
|
||||
--output_dir="./" \
|
||||
python run_t5_mlm_flax.py \
|
||||
--output_dir="./norwegian-t5-base" \
|
||||
--model_type="t5" \
|
||||
--config_name="./" \
|
||||
--tokenizer_name="./" \
|
||||
--config_name="./norwegian-t5-base" \
|
||||
--tokenizer_name="./norwegian-t5-base" \
|
||||
--dataset_name="oscar" \
|
||||
--dataset_config_name="unshuffled_deduplicated_no" \
|
||||
--max_seq_length="512" \
|
||||
|
||||
@@ -43,6 +43,7 @@ from flax import jax_utils, traverse_util
|
||||
from flax.jax_utils import unreplicate
|
||||
from flax.training import train_state
|
||||
from flax.training.common_utils import get_metrics, onehot, shard, shard_prng_key
|
||||
from huggingface_hub import Repository
|
||||
from transformers import (
|
||||
CONFIG_MAPPING,
|
||||
FLAX_MODEL_FOR_CAUSAL_LM_MAPPING,
|
||||
@@ -54,6 +55,7 @@ from transformers import (
|
||||
is_tensorboard_available,
|
||||
set_seed,
|
||||
)
|
||||
from transformers.file_utils import get_full_repo_name
|
||||
from transformers.testing_utils import CaptureLogger
|
||||
|
||||
|
||||
@@ -275,6 +277,16 @@ def main():
|
||||
# Set seed before initializing model.
|
||||
set_seed(training_args.seed)
|
||||
|
||||
# Handle the repository creation
|
||||
if training_args.push_to_hub:
|
||||
if training_args.hub_model_id is None:
|
||||
repo_name = get_full_repo_name(
|
||||
Path(training_args.output_dir).absolute().name, token=training_args.hub_token
|
||||
)
|
||||
else:
|
||||
repo_name = training_args.hub_model_id
|
||||
repo = Repository(training_args.output_dir, clone_from=repo_name)
|
||||
|
||||
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
|
||||
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
|
||||
# (the dataset will be downloaded automatically from the datasets Hub).
|
||||
@@ -654,12 +666,10 @@ def main():
|
||||
# save checkpoint after each epoch and push checkpoint to the hub
|
||||
if jax.process_index() == 0:
|
||||
params = jax.device_get(unreplicate(state.params))
|
||||
model.save_pretrained(
|
||||
training_args.output_dir,
|
||||
params=params,
|
||||
push_to_hub=training_args.push_to_hub,
|
||||
commit_message=f"Saving weights and logs of step {cur_step}",
|
||||
)
|
||||
model.save_pretrained(training_args.output_dir, params=params)
|
||||
tokenizer.save_pretrained(training_args.output_dir)
|
||||
if training_args.push_to_hub:
|
||||
repo.push_to_hub(commit_message=f"Saving weights and logs of step {cur_step}", blocking=False)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -41,6 +41,7 @@ import optax
|
||||
from flax import jax_utils, traverse_util
|
||||
from flax.training import train_state
|
||||
from flax.training.common_utils import get_metrics, onehot, shard
|
||||
from huggingface_hub import Repository
|
||||
from transformers import (
|
||||
CONFIG_MAPPING,
|
||||
FLAX_MODEL_FOR_MASKED_LM_MAPPING,
|
||||
@@ -54,6 +55,7 @@ from transformers import (
|
||||
is_tensorboard_available,
|
||||
set_seed,
|
||||
)
|
||||
from transformers.file_utils import get_full_repo_name
|
||||
|
||||
|
||||
MODEL_CONFIG_CLASSES = list(FLAX_MODEL_FOR_MASKED_LM_MAPPING.keys())
|
||||
@@ -308,6 +310,16 @@ if __name__ == "__main__":
|
||||
# Set seed before initializing model.
|
||||
set_seed(training_args.seed)
|
||||
|
||||
# Handle the repository creation
|
||||
if training_args.push_to_hub:
|
||||
if training_args.hub_model_id is None:
|
||||
repo_name = get_full_repo_name(
|
||||
Path(training_args.output_dir).absolute().name, token=training_args.hub_token
|
||||
)
|
||||
else:
|
||||
repo_name = training_args.hub_model_id
|
||||
repo = Repository(training_args.output_dir, clone_from=repo_name)
|
||||
|
||||
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
|
||||
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
|
||||
# (the dataset will be downloaded automatically from the datasets Hub).
|
||||
@@ -683,9 +695,7 @@ if __name__ == "__main__":
|
||||
# save checkpoint after each epoch and push checkpoint to the hub
|
||||
if jax.process_index() == 0:
|
||||
params = jax.device_get(jax.tree_map(lambda x: x[0], state.params))
|
||||
model.save_pretrained(
|
||||
training_args.output_dir,
|
||||
params=params,
|
||||
push_to_hub=training_args.push_to_hub,
|
||||
commit_message=f"Saving weights and logs of step {cur_step}",
|
||||
)
|
||||
model.save_pretrained(training_args.output_dir, params=params)
|
||||
tokenizer.save_pretrained(training_args.output_dir)
|
||||
if training_args.push_to_hub:
|
||||
repo.push_to_hub(commit_message=f"Saving weights and logs of step {cur_step}", blocking=False)
|
||||
|
||||
@@ -39,6 +39,7 @@ import optax
|
||||
from flax import jax_utils, traverse_util
|
||||
from flax.training import train_state
|
||||
from flax.training.common_utils import get_metrics, onehot, shard
|
||||
from huggingface_hub import Repository
|
||||
from transformers import (
|
||||
CONFIG_MAPPING,
|
||||
FLAX_MODEL_FOR_MASKED_LM_MAPPING,
|
||||
@@ -52,6 +53,7 @@ from transformers import (
|
||||
is_tensorboard_available,
|
||||
set_seed,
|
||||
)
|
||||
from transformers.file_utils import get_full_repo_name
|
||||
from transformers.models.t5.modeling_flax_t5 import shift_tokens_right
|
||||
|
||||
|
||||
@@ -438,6 +440,16 @@ if __name__ == "__main__":
|
||||
# Set seed before initializing model.
|
||||
set_seed(training_args.seed)
|
||||
|
||||
# Handle the repository creation
|
||||
if training_args.push_to_hub:
|
||||
if training_args.hub_model_id is None:
|
||||
repo_name = get_full_repo_name(
|
||||
Path(training_args.output_dir).absolute().name, token=training_args.hub_token
|
||||
)
|
||||
else:
|
||||
repo_name = training_args.hub_model_id
|
||||
repo = Repository(training_args.output_dir, clone_from=repo_name)
|
||||
|
||||
# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
|
||||
# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
|
||||
# (the dataset will be downloaded automatically from the datasets Hub).
|
||||
@@ -791,9 +803,7 @@ if __name__ == "__main__":
|
||||
# save checkpoint after each epoch and push checkpoint to the hub
|
||||
if jax.process_index() == 0:
|
||||
params = jax.device_get(jax.tree_map(lambda x: x[0], state.params))
|
||||
model.save_pretrained(
|
||||
training_args.output_dir,
|
||||
params=params,
|
||||
push_to_hub=training_args.push_to_hub,
|
||||
commit_message=f"Saving weights and logs of step {cur_step}",
|
||||
)
|
||||
model.save_pretrained(training_args.output_dir, params=params)
|
||||
tokenizer.save_pretrained(training_args.output_dir)
|
||||
if training_args.push_to_hub:
|
||||
repo.push_to_hub(commit_message=f"Saving weights and logs of step {cur_step}", blocking=False)
|
||||
|
||||
Reference in New Issue
Block a user