From f5e8c9bdeab96c3426583cf1aa572ce7ede8a070 Mon Sep 17 00:00:00 2001
From: Nathan Cooper <nacooper01@wm.edu>
Date: Thu, 24 Mar 2022 06:00:46 -0400
Subject: [PATCH] Update readme with how to train offline and fix BPE command
 (#15897)

* Update readme with how to train offline and fix BPE command

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
---
 .../research_projects/codeparrot/README.md    | 32 +++++++++++++++++--
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/examples/research_projects/codeparrot/README.md b/examples/research_projects/codeparrot/README.md
index cf0b99345c..8fd4c5c4e7 100644
--- a/examples/research_projects/codeparrot/README.md
+++ b/examples/research_projects/codeparrot/README.md
@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac
 ## Tokenizer
 Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: 
 ```bash
-python scripts/bpe_training.py
-    --base_tokenizer gpt2
+python scripts/bpe_training.py \
+    --base_tokenizer gpt2 \
     --dataset_name lvwerra/codeparrot-clean-train
 ```
 
@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for
 python scripts/codeparrot_training.py --help
 ```
 
+Instead of streaming the dataset from the hub you can also stream it from disk. This can be helpful for long training runs where the connection can be interrupted sometimes. To stream locally you simply need to clone the datasets and replace the dataset name with their path. In this example we store the data in a folder called `data`: 
+
+```bash
+git lfs install
+mkdir data
+git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-train
+git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid
+```
+
+And then pass the paths to the datasets when we run the training script:
+
+```bash
+accelerate launch scripts/codeparrot_training.py \
+--model_ckpt lvwerra/codeparrot-small \
+--dataset_name_train ./data/codeparrot-clean-train \
+--dataset_name_valid ./data/codeparrot-clean-valid \
+--train_batch_size 12 \
+--valid_batch_size 12 \
+--learning_rate 5e-4 \
+--num_warmup_steps 2000 \
+--gradient_accumulation 1 \
+--gradient_checkpointing False \
+--max_train_steps 150000 \
+--save_checkpoint_steps 15000
+```
+
 ## Evaluation
 For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
 ```bash
@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot
 ## Further Resources
 A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
 
-This example was provided by [Leandro von Werra](www.github.com/lvwerra).
\ No newline at end of file
+This example was provided by [Leandro von Werra](www.github.com/lvwerra).