From d64372fdfcbb49fb5b2dddd44cecfea76d6c5d2c Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Tue, 5 Jan 2021 17:34:15 -0800 Subject: [PATCH] [docs] outline sharded ddp doc (#9208) * outline sharded dpp doc * fix link * add example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * narrow the command and remove non-essentials Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/training.rst | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/docs/source/training.rst b/docs/source/training.rst index 7daaaaa99a..75dcc75cb2 100644 --- a/docs/source/training.rst +++ b/docs/source/training.rst @@ -278,6 +278,46 @@ pass it to the trainer. Finally, you can view the results, including any calculated metrics, by launching tensorboard in your specified ``logging_dir`` directory. +Trainer Integrations +----------------------------------------------------------------------------------------------------------------------- + +The trainer is being extended to support experimental libraries that may dramatically improve your training time and +fit bigger models. + +The main part that is being integrated at the moment is based on the paper `ZeRO: Memory Optimizations Toward Training +Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He +`__. + +You can already deploy the following features from this paper: + +* Optimizer State Sharding +* Gradient Sharding + +using the `--sharded_ddp` trainer argument. This is implemented via `fairscale +`__, so you will have to install this library. + +This feature requires distributed training (so multiple GPUs) and is not implemented for TPUs. + +For example here is how you could use it for `finetune_trainer.py`: + +.. code-block:: bash + + cd examples/seq2seq + python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \ + --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ + --output_dir output_dir --overwrite_output_dir \ + --do_train --n_train 500 --num_train_epochs 1 \ + --per_device_train_batch_size 1 --freeze_embeds \ + --src_lang en_XX --tgt_lang ro_RO --task translation \ + --fp16 --sharded_ddp + +Note that it works with `--fp16` too, to make things even faster. + +One of the main benefits of enabling `--sharded_ddp` is that it uses a lot less GPU memory, so you should be able to +use significantly larger batch sizes using the same hardware (e.g. 3x or bigger). + +Eventually more parts will be supported via integrating `DeepSpeed `__. + .. _additional-resources: