[examples/seq2seq/README.md] fix t5 examples (#10734)

* [examples/seq2seq] fix t5 examples This PR: * fixes T5 examples to include `--source_prefix` - it's **not** optional. If you give it a try you will see that you get 10x worse bleu scores w/o it. w/ `27.6849`, w/ `2.374` * added a normal translation example w/o the peculiarities of MBart and T5 * reduces the default max samples to 50 so it's much faster to test quickly summarization seems to be broken for t5 score-wise: https://github.com/huggingface/transformers/issues/10733 @sgugger * specify explicitly the t5 models requiring the special handling * one more * update the t5 summarization example to use cnn_dailymail * move max*samples into the top level README.md * better wording * better wording
2021-03-18 09:55:39 -07:00
parent 094afa515d
commit 9352b5151a
2 changed files with 66 additions and 34 deletions
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -24,10 +24,10 @@ For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2s
 ### Supported Architectures

 - `BartForConditionalGeneration`
+- `FSMTForConditionalGeneration` (translation only)
+- `MBartForConditionalGeneration`
 - `MarianMTModel`
 - `PegasusForConditionalGeneration`
- `MBartForConditionalGeneration`
- `FSMTForConditionalGeneration` (translation only)
 - `T5ForConditionalGeneration`

 `run_summarization.py` and `run_translation.py` are lightweight examples of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it.
@@ -43,17 +43,21 @@ python examples/seq2seq/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
-    --dataset_name xsum \
+    --dataset_name cnn_dailymail \
+    --dataset_config "3.0.0" \
+    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```

-CNN/DailyMail dataset is another commonly used dataset for the task of summarization. To use it replace `--dataset_name xsum` with `--dataset_name cnn_dailymail --dataset_config "3.0.0"`.
+Only T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "summarize: "`.
+
+We used CNN/DailyMail dataset in this example as `t5-small` was trained on it and one can get good scores even when pre-training with a very small sample.
+
+Extreme Summarization (XSum) Dataset is another commonly used dataset for the task of summarization. To use it replace `--dataset_name cnn_dailymail --dataset_config "3.0.0"` with  `--dataset_name xsum`.

 And here is how you would use it on your own files, after adjusting the values for the arguments
 `--train_file`, `--validation_file`, `--text_column` and `--summary_column` to match your setup:
@@ -65,13 +69,12 @@ python examples/seq2seq/run_summarization.py \
    --do_eval \
    --train_file path_to_csv_or_jsonlines_file \
    --validation_file path_to_csv_or_jsonlines_file \
+    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```

 The task of summarization supports custom CSV and JSONLINES formats.
@@ -135,11 +138,11 @@ And as with the CSV files, you can specify which values to select from the file,

 ### Translation

-Here is an example of a translation fine-tuning with T5:
+Here is an example of a translation fine-tuning with a MarianMT model:

 ```bash
 python examples/seq2seq/run_translation.py \
-    --model_name_or_path t5-small \
+    --model_name_or_path Helsinki-NLP/opus-mt-en-ro \
    --do_train \
    --do_eval \
    --source_lang en \
@@ -150,12 +153,35 @@ python examples/seq2seq/run_translation.py \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```

-And the same with MBart:
+MBart and some T5 models require special handling.
+
+T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example:
+
+```bash
+python examples/seq2seq/run_translation.py \
+    --model_name_or_path t5-small \
+    --do_train \
+    --do_eval \
+    --source_lang en \
+    --target_lang ro \
+    --source_prefix "translate English to Romanian: " \
+    --dataset_name wmt16 \
+    --dataset_config_name ro-en \
+    --output_dir /tmp/tst-translation \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=4 \
+    --overwrite_output_dir \
+    --predict_with_generate
+```
+
+If you get a terrible BLEU score, make sure that you didn't forget to use the `--source_prefix` argument.
+
+For the aforementioned group of T5 models it's important to remember that if you switch to a different language pair, make sure to adjust the source and target values in all 3 language-specific command line argument: `--source_lang`, `--target_lang` and `--source_prefix`.
+
+MBart models require a different format for `--source_lang` and `--target_lang` values, e.g. instead of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be found [here](https://huggingface.co/facebook/mbart-large-cc25). For example:

 ```bash
 python examples/seq2seq/run_translation.py \
@@ -170,18 +196,9 @@ python examples/seq2seq/run_translation.py \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```

-Note, that depending on the used model additional language-specific command-line arguments are sometimes required. Specifically:
-
-* MBart models require different `--{source,target}_lang` values, e.g. in place of `en` it expects `en_XX`, for `ro` it expects `ro_RO`. The full MBart specification for language codes can be looked up [here](https://huggingface.co/facebook/mbart-large-cc25)
-* T5 models can use a `--source_prefix` argument to override the otherwise automated prefix of the form `translate {source_lang} to {target_lang}` for `run_translation.py` and `summarize: ` for `run_summarization.py`
-
-Also, if you switch to a different language pair, make sure to adjust the source and target values in all command line arguments.
-
 And here is how you would use the translation finetuning on your own files, after adjusting the
 values for the arguments `--train_file`, `--validation_file` to match your setup:

@@ -192,6 +209,7 @@ python examples/seq2seq/run_translation.py \
    --do_eval \
    --source_lang en \
    --target_lang ro \
+    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --train_file path_to_jsonlines_file \
@@ -200,9 +218,7 @@ python examples/seq2seq/run_translation.py \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```

 The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key `"translation"` and its value another dictionary whose keys is the language pair. For example:
@@ -213,7 +229,7 @@ The task of translation supports only custom JSONLINES files, with each line bei
 ```
 Here the languages are Romanian (`ro`) and English (`en`).

-If you want to use a pre-processed dataset that leads to high bleu scores, but for the `en-de` language pair, you can use `--dataset_name wmt14-en-de-pre-processed`, as following:
+If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as following:

 ```bash
 python examples/seq2seq/run_translation.py \
@@ -222,12 +238,11 @@ python examples/seq2seq/run_translation.py \
    --do_eval \
    --source_lang en \
    --target_lang de \
-    --dataset_name wmt14-en-de-pre-processed \
+    --source_prefix "translate English to German: " \
+    --dataset_name stas/wmt14-en-de-pre-processed \
    --output_dir /tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
-    --predict_with_generate \
-    --max_train_samples 500 \
-    --max_val_samples 500
+    --predict_with_generate
 ```