Add support for ZeRO-2/3 and ZeRO-offload in fairscale (#10354)
* Ass support for ZeRO-2/3 and ZeRO-offload in fairscale * Quality * Rework from review comments * Add doc * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Address review comments Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
This commit is contained in:
@@ -241,6 +241,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
|
||||
|
||||
1. Optimizer State Sharding
|
||||
2. Gradient Sharding
|
||||
3. Model Parameters Sharding (new and very experimental)
|
||||
4. CPU offload (new and very experimental)
|
||||
|
||||
You will need at least two GPUs to use this feature.
|
||||
|
||||
@@ -255,8 +257,9 @@ To deploy this feature:
|
||||
or find more details on `the FairScale's GitHub page
|
||||
<https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||
|
||||
2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
|
||||
torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
2. To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments,
|
||||
and make sure you have added the distributed launcher ``-m torch.distributed.launch
|
||||
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
|
||||
|
||||
@@ -268,17 +271,55 @@ For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--task translation_en_to_ro --source_prefix "translate English to Romanian: " \
|
||||
--fp16 --sharded_ddp
|
||||
--fp16 --sharded_ddp simple
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with ``--fp16`` too, to make things even faster.
|
||||
- One of the main benefits of enabling ``--sharded_ddp`` is that it uses a lot less GPU memory, so you should be able
|
||||
to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
|
||||
- One of the main benefits of enabling ``--sharded_ddp simple`` is that it uses a lot less GPU memory, so you should be
|
||||
able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to
|
||||
significantly shorter training time.
|
||||
|
||||
3. To use the second version of Sharded data-parallelism, add ``--sharded_ddp zero_dp_2`` or ``--sharded_ddp zero_dp_3`
|
||||
to the command line arguments, and make sure you have added the distributed launcher ``-m torch.distributed.launch
|
||||
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for ``run_seq2seq.py`` with 2 GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_seq2seq.py \
|
||||
--model_name_or_path t5-small --per_device_train_batch_size 1 \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --max_train_samples 500 --num_train_epochs 1 \
|
||||
--dataset_name wmt16 --dataset_config "ro-en" \
|
||||
--task translation_en_to_ro --source_prefix "translate English to Romanian: " \
|
||||
--fp16 --sharded_ddp zero_dp_2
|
||||
|
||||
:obj:`zero_dp_2` is an optimized version of the simple wrapper, while :obj:`zero_dp_3` fully shards model weights,
|
||||
gradients and optimizer states.
|
||||
|
||||
Both are compatible with adding :obj:`cpu_offload` to enable ZeRO-offload (activate it like this: :obj:`--sharded_ddp
|
||||
"zero_dp_2 cpu_offload"`).
|
||||
|
||||
Notes:
|
||||
|
||||
- This feature requires distributed training (so multiple GPUs).
|
||||
- It is not implemented for TPUs.
|
||||
- It works with ``--fp16`` too, to make things even faster.
|
||||
- The ``cpu_offload`` additional option requires ``--fp16``.
|
||||
- This is an area of active development, so make sure you have a source install of fairscale to use this feature as
|
||||
some bugs you encounter may have been fixed there already.
|
||||
|
||||
Known caveats:
|
||||
|
||||
- This feature is incompatible with :obj:`--predict_with_generate` in the `run_seq2seq.py` script.
|
||||
- Using :obj:`--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container
|
||||
:obj:`FullyShardedDataParallelism` of fairscale. This is not done automatically by any of the example scripts of the
|
||||
:class:`~transformers.Trainer`.
|
||||
|
||||
|
||||
DeepSpeed
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Reference in New Issue
Block a user