[docs] outline sharded ddp doc (#9208)

* outline sharded dpp doc * fix link * add example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * narrow the command and remove non-essentials Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-05 17:34:15 -08:00
parent eef66035a2
commit d64372fdfc
1 changed files with 40 additions and 0 deletions
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -278,6 +278,46 @@ pass it to the trainer.
 Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
 ``logging_dir`` directory.
 Trainer Integrations
 -----------------------------------------------------------------------------------------------------------------------
 The trainer is being extended to support experimental libraries that may dramatically improve your training time and
 fit bigger models.
 The main part that is being integrated at the moment is based on the paper `ZeRO: Memory Optimizations Toward Training
 Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
 <https://arxiv.org/abs/1910.02054>`__.
 You can already deploy the following features from this paper:
 * Optimizer State Sharding
 * Gradient Sharding
 using the `--sharded_ddp` trainer argument. This is implemented via `fairscale
 <https://github.com/facebookresearch/fairscale/>`__, so you will have to install this library.
 This feature requires distributed training (so multiple GPUs) and is not implemented for TPUs.
 For example here is how you could use it for `finetune_trainer.py`:
 .. code-block:: bash
    cd examples/seq2seq
    python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --n_train 500 --num_train_epochs 1 \
    --per_device_train_batch_size 1  --freeze_embeds \
    --src_lang en_XX --tgt_lang ro_RO --task translation \
    --fp16 --sharded_ddp
 Note that it works with `--fp16` too, to make things even faster.
 One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to
 use significantly larger batch sizes using the same hardware (e.g. 3x or bigger).
 Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__.
 .. _additional-resources: