Update readme with how to train offline and fix BPE command (#15897)

* Update readme with how to train offline and fix BPE command * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
2022-03-24 06:00:46 -04:00
parent 9badcecf69
commit f5e8c9bdea
1 changed files with 29 additions and 3 deletions
--- a/examples/research_projects/codeparrot/README.md
+++ b/examples/research_projects/codeparrot/README.md
@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac
 ## Tokenizer
 Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: 
 ```bash
-python scripts/bpe_training.py
+python scripts/bpe_training.py \
-    --base_tokenizer gpt2
+    --base_tokenizer gpt2 \
    --dataset_name lvwerra/codeparrot-clean-train
 ```
@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for
 python scripts/codeparrot_training.py --help
 ```
 Instead of streaming the dataset from the hub you can also stream it from disk. This can be helpful for long training runs where the connection can be interrupted sometimes. To stream locally you simply need to clone the datasets and replace the dataset name with their path. In this example we store the data in a folder called `data`: 
 ```bash
 git lfs install
 mkdir data
 git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-train
 git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid
 ```
 And then pass the paths to the datasets when we run the training script:
 ```bash
 accelerate launch scripts/codeparrot_training.py \
 --model_ckpt lvwerra/codeparrot-small \
 --dataset_name_train ./data/codeparrot-clean-train \
 --dataset_name_valid ./data/codeparrot-clean-valid \
 --train_batch_size 12 \
 --valid_batch_size 12 \
 --learning_rate 5e-4 \
 --num_warmup_steps 2000 \
 --gradient_accumulation 1 \
 --gradient_checkpointing False \
 --max_train_steps 150000 \
 --save_checkpoint_steps 15000
 ```
 ## Evaluation
 For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
 ```bash
@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot
 ## Further Resources
 A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
-This example was provided by [Leandro von Werra](www.github.com/lvwerra).
+This example was provided by [Leandro von Werra](www.github.com/lvwerra).