Update readme with how to train offline and fix BPE command (#15897)
* Update readme with how to train offline and fix BPE command * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
This commit is contained in:
@@ -58,8 +58,8 @@ During preprocessing the dataset is downloaded and stored locally as well as cac
|
|||||||
## Tokenizer
|
## Tokenizer
|
||||||
Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command:
|
Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command:
|
||||||
```bash
|
```bash
|
||||||
python scripts/bpe_training.py
|
python scripts/bpe_training.py \
|
||||||
--base_tokenizer gpt2
|
--base_tokenizer gpt2 \
|
||||||
--dataset_name lvwerra/codeparrot-clean-train
|
--dataset_name lvwerra/codeparrot-clean-train
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -113,6 +113,32 @@ Recall that you can see the full set of possible options with descriptions (for
|
|||||||
python scripts/codeparrot_training.py --help
|
python scripts/codeparrot_training.py --help
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Instead of streaming the dataset from the hub you can also stream it from disk. This can be helpful for long training runs where the connection can be interrupted sometimes. To stream locally you simply need to clone the datasets and replace the dataset name with their path. In this example we store the data in a folder called `data`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git lfs install
|
||||||
|
mkdir data
|
||||||
|
git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-train
|
||||||
|
git -C "./data" clone https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid
|
||||||
|
```
|
||||||
|
|
||||||
|
And then pass the paths to the datasets when we run the training script:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
accelerate launch scripts/codeparrot_training.py \
|
||||||
|
--model_ckpt lvwerra/codeparrot-small \
|
||||||
|
--dataset_name_train ./data/codeparrot-clean-train \
|
||||||
|
--dataset_name_valid ./data/codeparrot-clean-valid \
|
||||||
|
--train_batch_size 12 \
|
||||||
|
--valid_batch_size 12 \
|
||||||
|
--learning_rate 5e-4 \
|
||||||
|
--num_warmup_steps 2000 \
|
||||||
|
--gradient_accumulation 1 \
|
||||||
|
--gradient_checkpointing False \
|
||||||
|
--max_train_steps 150000 \
|
||||||
|
--save_checkpoint_steps 15000
|
||||||
|
```
|
||||||
|
|
||||||
## Evaluation
|
## Evaluation
|
||||||
For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
|
For evaluating the language modeling loss on the validation set or any other dataset you can use the following command:
|
||||||
```bash
|
```bash
|
||||||
@@ -158,4 +184,4 @@ Give the model a shot yourself! There are two demos to interact with CodeParrot
|
|||||||
## Further Resources
|
## Further Resources
|
||||||
A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
|
A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/).
|
||||||
|
|
||||||
This example was provided by [Leandro von Werra](www.github.com/lvwerra).
|
This example was provided by [Leandro von Werra](www.github.com/lvwerra).
|
||||||
|
|||||||
Reference in New Issue
Block a user