From d64372fdfcbb49fb5b2dddd44cecfea76d6c5d2c Mon Sep 17 00:00:00 2001
From: Stas Bekman <stas00@users.noreply.github.com>
Date: Tue, 5 Jan 2021 17:34:15 -0800
Subject: [PATCH] [docs] outline sharded ddp doc (#9208)

* outline sharded dpp doc

* fix link

* add example

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* narrow the command and remove non-essentials

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
---
 docs/source/training.rst | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/docs/source/training.rst b/docs/source/training.rst
index 7daaaaa99a..75dcc75cb2 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -278,6 +278,46 @@ pass it to the trainer.
 Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified
 ``logging_dir`` directory.
 
+Trainer Integrations
+-----------------------------------------------------------------------------------------------------------------------
+
+The trainer is being extended to support experimental libraries that may dramatically improve your training time and
+fit bigger models.
+
+The main part that is being integrated at the moment is based on the paper `ZeRO: Memory Optimizations Toward Training
+Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He
+<https://arxiv.org/abs/1910.02054>`__.
+
+You can already deploy the following features from this paper:
+
+* Optimizer State Sharding
+* Gradient Sharding
+
+using the `--sharded_ddp` trainer argument. This is implemented via `fairscale
+<https://github.com/facebookresearch/fairscale/>`__, so you will have to install this library.
+
+This feature requires distributed training (so multiple GPUs) and is not implemented for TPUs.
+
+For example here is how you could use it for `finetune_trainer.py`:
+
+.. code-block:: bash
+
+    cd examples/seq2seq
+    python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation \
+    --fp16 --sharded_ddp
+
+Note that it works with `--fp16` too, to make things even faster.
+
+One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to
+use significantly larger batch sizes using the same hardware (e.g. 3x or bigger).
+
+Eventually more parts will be supported via integrating `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__.
+
 
 .. _additional-resources: