[examples/seq2seq]: add --label_smoothing option (#5919)

This commit is contained in:
Sam Shleifer
2020-07-21 16:51:39 -04:00
committed by GitHub
parent 95d1962b9c
commit 5b193b39b0
7 changed files with 132 additions and 46 deletions

View File

@@ -27,8 +27,18 @@ this should make a directory called `cnn_dm/` with files like `test.source`.
```
WMT16 English-Romanian Translation Data:
This dataset comes in two formats. The "packed" version merges short training examples into examples of <200 tokens to increase GPU utilization (and also improves validation performance).
```bash
cd examples/seq2seq
https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro_packed_train_200.tgz
tar -xzvf wmt_en_ro_packed_200.tgz
export ENRO_DIR=wmt_en_ro_packed_train_200
```
The original data can also be downloaded with this command:
```bash
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export ENRO_DIR=${PWD}/wmt_en_ro
@@ -84,16 +94,31 @@ The following command should work on a 16GB GPU:
First, follow the wmt_en_ro download instructions.
Then you can finetune mbart_cc25 on english-romanian with the following command.
**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
Best performing command:
```bash
export ENRO_DIR=${PWD}/wmt_en_ro # may need to be fixed depending on where you downloaded
export MAX_LEN=128
# optionally
export ENRO_DIR='wmt_en_ro_packed_train_200' # Download instructions above
# export WANDB_PROJECT="MT" # optional
export MAX_LEN=200
export BS=4
export GAS=8
./train_mbart_cc25_enro.sh --output_dir cc25_v1_frozen/
export GAS=8 # gradient accumulation steps
./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --label_smoothing 0.1 --fp16_opt_level=O1 --logger_name wandb --sortish_sampler
```
This should take < 2h/epoch on a 16GB v100 and achieve val_avg_ BLEU score above 25. (you can see in wandb or metrics.json).
To get results in line with fairseq, you need to do some postprocessing.
MultiGPU command
(using 8 GPUS as an example)
```bash
export ENRO_DIR='wmt_en_ro_packed_train_200' # Download instructions above
# export WANDB_PROJECT="MT" # optional
export MAX_LEN=200
export BS=4
export GAS=1 # gradient accumulation steps
./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --gpus 8 --logger_name wandb
```
### Finetuning Outputs
As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
@@ -108,7 +133,7 @@ output_dir
│   ├── tokenizer_config.json
│   └── vocab.json
├── git_log.json # repo, branch, and commit hash
├── val_avg_rouge2=0.1984-step_count=11.ckpt # this is a pytorch lightning checkpoint associated with the best val score.
├── val_avg_rouge2=0.1984-step_count=11.ckpt # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
├── metrics.json # new validation metrics will continually be appended to this
├── student # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
│   ├── config.json