Add support for ZeRO-2/3 and ZeRO-offload in fairscale (#10354)

* Ass support for ZeRO-2/3 and ZeRO-offload in fairscale * Quality * Rework from review comments * Add doc * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address review comments Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-02-25 11:07:53 -05:00
parent 88cc26dcd1
commit 9d14be5c20
5 changed files with 193 additions and 46 deletions
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -241,6 +241,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.

 1. Optimizer State Sharding
 2. Gradient Sharding
+3. Model Parameters Sharding (new and very experimental)
+4. CPU offload (new and very experimental)

 You will need at least two GPUs to use this feature.

@@ -255,8 +257,9 @@ To deploy this feature:
   or find more details on `the FairScale's GitHub page
   <https://github.com/facebookresearch/fairscale/#installation>`__.

-2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
-   torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
+2. To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments,
+   and make sure you have added the distributed launcher ``-m torch.distributed.launch
+   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.

 For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:

@@ -268,17 +271,55 @@ For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
    --do_train --max_train_samples 500 --num_train_epochs 1 \
    --dataset_name wmt16 --dataset_config "ro-en" \
    --task translation_en_to_ro --source_prefix "translate English to Romanian: " \
-    --fp16 --sharded_ddp
+    --fp16 --sharded_ddp simple

 Notes:

 - This feature requires distributed training (so multiple GPUs).
 - It is not implemented for TPUs.
 - It works with ``--fp16`` too, to make things even faster.
- One of the main benefits of enabling ``--sharded_ddp`` is that it uses a lot less GPU memory, so you should be able
-  to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
+- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
+  able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
  significantly shorter training time.

+3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp zero_dp_3`
+   to the command line arguments, and make sure you have added the distributed launcher ``-m torch.distributed.launch
+   --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
+
+For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_seq2seq.py \
+    --model_name_or_path t5-small --per_device_train_batch_size 1   \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --max_train_samples 500 --num_train_epochs 1 \
+    --dataset_name wmt16 --dataset_config "ro-en" \
+    --task translation_en_to_ro --source_prefix "translate English to Romanian: " \
+    --fp16 --sharded_ddp zero_dp_2
+
+:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
+gradients and optimizer states.
+
+Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
+"zero_dp_2 cpu_offload"`).
+
+Notes:
+
+- This feature requires distributed training (so multiple GPUs).
+- It is not implemented for TPUs.
+- It works with ``--fp16`` too, to make things even faster.
+- The ``cpu_offload`` additional option requires ``--fp16``.
+- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
+  some bugs you encounter may have been fixed there already.
+
+Known caveats:
+
+- This feature is incompatible with :obj:`--predict_with_generate` in the `run_seq2seq.py` script.
+- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
+  :obj:`FullyShardedDataParallelism` of fairscale. This is not done automatically by any of the example scripts of the
+  :class:`~transformers.Trainer`.
+

 DeepSpeed
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^