[examples/seq2seq]: add --label_smoothing option (#5919)

2020-07-21 16:51:39 -04:00
parent 95d1962b9c
commit 5b193b39b0
7 changed files with 132 additions and 46 deletions
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -27,8 +27,18 @@ this should make a directory called `cnn_dm/` with files like `test.source`.
 ```

 WMT16 English-Romanian Translation Data:
+
+This dataset comes in two formats. The "packed" version merges short training examples into examples of <200 tokens to increase GPU utilization (and also improves validation performance).
+
 ```bash
 cd examples/seq2seq
+https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro_packed_train_200.tgz
+tar -xzvf wmt_en_ro_packed_200.tgz
+export ENRO_DIR=wmt_en_ro_packed_train_200
+```
+ 
+The original data can also be downloaded with this command:
+```bash
 wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
 tar -xzvf wmt_en_ro.tar.gz
 export ENRO_DIR=${PWD}/wmt_en_ro
@@ -84,16 +94,31 @@ The following command should work on a 16GB GPU:

 First, follow the wmt_en_ro download instructions.
 Then you can finetune mbart_cc25 on english-romanian with the following command.
-**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it. 
+**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
+
+Best performing command:
 ```bash
-export ENRO_DIR=${PWD}/wmt_en_ro   # may need to be fixed depending on where you downloaded
-export MAX_LEN=128
+# optionally
+export ENRO_DIR='wmt_en_ro_packed_train_200' # Download instructions above
+# export WANDB_PROJECT="MT" # optional 
+export MAX_LEN=200
 export BS=4
-export GAS=8
-./train_mbart_cc25_enro.sh --output_dir cc25_v1_frozen/
+export GAS=8 # gradient accumulation steps
+./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --label_smoothing 0.1 --fp16_opt_level=O1 --logger_name wandb --sortish_sampler
 ```
+This should take < 2h/epoch on a 16GB v100 and achieve val_avg_ BLEU score above 25. (you can see in wandb or metrics.json).
+To get results in line with fairseq, you need to do some postprocessing.

-
+MultiGPU command
+(using 8 GPUS as an example)
+```bash
+export ENRO_DIR='wmt_en_ro_packed_train_200' # Download instructions above
+ # export WANDB_PROJECT="MT" # optional
+export MAX_LEN=200
+export BS=4
+export GAS=1  # gradient accumulation steps
+./train_mbart_cc25_enro.sh --output_dir enro_finetune_baseline --gpus 8 --logger_name wandb
+```
 ### Finetuning Outputs 
 As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine). 
 Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
@@ -108,7 +133,7 @@ output_dir
 │   ├── tokenizer_config.json
 │   └── vocab.json
 ├── git_log.json   # repo, branch, and commit hash
-├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score.
+├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
 ├── metrics.json  # new validation metrics will continually be appended to this
 ├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
 │   ├── config.json