split seq2seq script into summarization & translation (#10611)

* split seq2seq script, update docs * needless diff * fix readme * remove test diff * s/summarization/translation Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * cr * fix arguments & better mbart/t5 refs * copyright Co-authored-by: Suraj Patil <surajp815@gmail.com> * reword readme Co-authored-by: Suraj Patil <surajp815@gmail.com> * s/summarization/translation * short script names * fix tests * fix isort, include mbart doc * delete old script, update tests * automate source prefix * automate source prefix for translation * s/translation/trans Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * fix script name (short version) * typos Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * exact parameter Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * remove superfluous source_prefix calls in docs * rename scripts & warn for source prefix * black * flake8 Co-authored-by: theo <theo@matussie.re> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-03-15 14:11:42 +01:00
parent 505494a86f
commit 6f840990a7
9 changed files with 653 additions and 168 deletions
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -168,13 +168,13 @@ Here is an example of how this can be used on a filesystem that is shared betwee
 On the instance with the normal network run your program which will download and cache models (and optionally datasets if you use 🤗 Datasets). For example:

 ```
-python examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
+python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

 and then with the same filesystem you can now run the same program on a firewalled instance:
 ```
 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
-python examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
+python examples/seq2seq/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```
 and it should succeed without any hanging waiting to timeout.

--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -279,16 +279,16 @@ To deploy this feature:
   and make sure you have added the distributed launcher ``-m torch.distributed.launch
   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.

-For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
+For example here is how you could use it for ``run_translation.py`` with 2 GPUs:

 .. code-block:: bash

-    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_seq2seq.py \
+    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
-    --task translation_en_to_ro --source_prefix "translate English to Romanian: " \
+    --source_lang en --target_lang ro \
    --fp16 --sharded_ddp simple

 Notes:
@@ -304,16 +304,16 @@ Notes:
   to the command line arguments, and make sure you have added the distributed launcher ``-m torch.distributed.launch
   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.

-For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
+For example here is how you could use it for ``run_translation.py`` with 2 GPUs:

 .. code-block:: bash

-    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_seq2seq.py \
+    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
-    --task translation_en_to_ro --source_prefix "translate English to Romanian: " \
+    --source_lang en --target_lang ro \
    --fp16 --sharded_ddp zero_dp_2

 :obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
@@ -333,7 +333,7 @@ Notes:

 Known caveats:

- This feature is incompatible with :obj:`--predict_with_generate` in the `run_seq2seq.py` script.
+- This feature is incompatible with :obj:`--predict_with_generate` in the `run_translation.py` script.
 - Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
  :obj:`FullyShardedDataParallelism` of fairscale. It should be used with the option :obj:`auto_wrap` if you are not
  doing this yourself: :obj:`--sharded_ddp "zero_dp_3 auto_wrap"`.
@@ -402,17 +402,17 @@ In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed a
 the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
 use it here as well.

-Here is an example of running ``run_seq2seq.py`` under DeepSpeed deploying all available GPUs:
+Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:

 .. code-block:: bash

-    deepspeed examples/seq2seq/run_seq2seq.py \
+    deepspeed examples/seq2seq/run_translation.py \
    --deepspeed examples/tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
-    --task translation_en_to_ro --source_prefix "translate English to Romanian: "
+    --source_lang en --target_lang ro


 Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
@@ -431,13 +431,13 @@ To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` comma

 .. code-block:: bash

-    deepspeed --num_gpus=1 examples/seq2seq/run_seq2seq.py \
+    deepspeed --num_gpus=1 examples/seq2seq/run_translation.py \
    --deepspeed examples/tests/deepspeed/ds_config.json \
    --model_name_or_path t5-small --per_device_train_batch_size 1   \
    --output_dir output_dir --overwrite_output_dir --fp16 \
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
-    --task translation_en_to_ro --source_prefix "translate English to Romanian: "
+    --source_lang en --target_lang ro

 This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
 DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
@@ -483,7 +483,7 @@ Notes:

   .. code-block:: bash

-       deepspeed --include localhost:1 examples/seq2seq/run_seq2seq.py ...
+       deepspeed --include localhost:1 examples/seq2seq/run_translation.py ...

   In this example, we tell DeepSpeed to use GPU 1 (second gpu).

@@ -574,7 +574,7 @@ with:

 .. code-block::

-   !deepspeed examples/seq2seq/run_seq2seq.py ...
+   !deepspeed examples/seq2seq/run_translation.py ...

 or with bash magic, where you can write a multi-line code for the shell to run:

@@ -583,7 +583,7 @@ or with bash magic, where you can write a multi-line code for the shell to run:
   %%bash

   cd /somewhere
-   deepspeed examples/seq2seq/run_seq2seq.py ...
+   deepspeed examples/seq2seq/run_translation.py ...



--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -742,8 +742,8 @@ Summarization
 -----------------------------------------------------------------------------------------------------------------------

 Summarization is the task of summarizing a document or an article into a shorter text. If you would like to fine-tune a
-model on a summarization task, you may leverage the `run_seq2seq.py
-<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_seq2seq.py>`__ script.
+model on a summarization task, you may leverage the `run_summarization.py
+<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_summarization.py>`__ script.

 An example of a summarization dataset is the CNN / Daily Mail dataset, which consists of long news articles and was
 created for the task of summarization. If you would like to fine-tune a model on a summarization task, various
@@ -822,8 +822,8 @@ Translation
 -----------------------------------------------------------------------------------------------------------------------

 Translation is the task of translating a text from one language to another. If you would like to fine-tune a model on a
-translation task, you may leverage the `run_seq2seq.py
-<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_seq2seq.py>`__ script.
+translation task, you may leverage the `run_translation.py
+<https://github.com/huggingface/transformers/tree/master/examples/seq2seq/run_translation.py>`__ script.

 An example of a translation dataset is the WMT English to German dataset, which has sentences in English as the input
 data and the corresponding sentences in German as the target data. If you would like to fine-tune a model on a