From 82498cbc37d5c15520c7bddde5d804c804eee498 Mon Sep 17 00:00:00 2001
From: Stas Bekman <stas00@users.noreply.github.com>
Date: Thu, 14 Jan 2021 11:05:04 -0800
Subject: [PATCH] [deepspeed doc] install issues + 1-gpu deployment (#9582)

* [doc] install + 1-gpu deployment

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* improvements

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
---
 docs/source/main_classes/trainer.rst | 283 ++++++++++++++++++++++-----
 1 file changed, 236 insertions(+), 47 deletions(-)

diff --git a/docs/source/main_classes/trainer.rst b/docs/source/main_classes/trainer.rst
index b30efdce4b..c0ce5c4e25 100644
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O
 
 This provided support is new and experimental as of this writing.
 
-You will need at least 2 GPUs to benefit from these features.
+Installation Notes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
+
+While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
+<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
+<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
+any PyTorch extension that needs to build CUDA extensions.
+
+Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
+
+.. code-block:: bash
+
+    pip install fairscale
+    pip install deepspeed
+
+please, read the following notes first.
+
+In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
+different remember to adjust the version number to the one you are after.
+
+**Possible problem #1:**
+
+While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
+installed system-wide.
+
+For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
+CUDA ``10.2`` installed system-wide.
+
+The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
+Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
+installation location by doing:
+
+.. code-block:: bash
+
+    which nvcc
+
+If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
+search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
+<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
+
+**Possible problem #2:**
+
+Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
+may have:
+
+.. code-block:: bash
+
+    /usr/local/cuda-10.2
+    /usr/local/cuda-11.0
+
+Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
+the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
+last version was installed. If you encounter the problem, where the package build fails because it can't find the right
+CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
+environment variables.
+
+First, you may look at their contents:
+
+.. code-block:: bash
+
+    echo $PATH
+    echo $LD_LIBRARY_PATH
+
+so you get an idea of what is inside.
+
+It's possible that ``LD_LIBRARY_PATH`` is empty.
+
+``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
+are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
+entries.
+
+Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
+doing:
+
+.. code-block:: bash
+
+    export PATH=/usr/local/cuda-10.2/bin:$PATH
+    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
+
+Note that we aren't overwriting the existing values, but prepending instead.
+
+Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
+exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
+that your system will have it named differently, but if it is adjust it to reflect your reality.
+
+
+**Possible problem #3:**
+
+Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
+``gcc-7``.
+
+There are various ways to go about it.
+
+If you can install the latest CUDA toolkit it typically should support the newer compiler.
+
+Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
+already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
+build system complains it can't find it, the following might do the trick:
+
+.. code-block:: bash
+
+    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
+    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++
+
+
+Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
+``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
+should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
+
+As always make sure to edit the paths in the example to match your situation.
+
+**If still unsuccessful:**
+
+If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
+<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
+<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
+
 
 FairScale
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
 1. Optimizer State Sharding
 2. Gradient Sharding
 
+You will need at least two GPUs to use this feature.
+
 To deploy this feature:
 
 1. Install the library via pypi:
@@ -132,7 +252,7 @@ To deploy this feature:
 
        pip install fairscale
 
-   or find more details on `the FairScale's github page
+   or find more details on `the FairScale's GitHub page
    <https://github.com/facebookresearch/fairscale/#installation>`__.
 
 2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
@@ -164,7 +284,6 @@ Notes:
 DeepSpeed
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-
 `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
 <https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
 full support for:
@@ -172,58 +291,119 @@ full support for:
 1. Optimizer State Partitioning (ZeRO stage 1)
 2. Add Gradient Partitioning (ZeRO stage 2)
 
-To deploy this feature:
+Installation
+=======================================================================================================================
 
-1. Install the library via pypi:
+Install the library via pypi:
 
-   .. code-block:: bash
+.. code-block:: bash
 
-       pip install deepspeed
+    pip install deepspeed
 
-   or find more details on `the DeepSpeed's github page <https://github.com/microsoft/deepspeed#installation>`__.
+or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
 
-2. Adjust the :class:`~transformers.Trainer` command line arguments as following:
+Deployment with multiple GPUs
+=======================================================================================================================
 
-   1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
-   2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file
-      as documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
+To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
+following:
 
-   Therefore, if your original command line looked as following:
+1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
+2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
+   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
 
-   .. code-block:: bash
+Therefore, if your original command line looked as following:
 
-       python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
+.. code-block:: bash
 
-   Now it should be:
+    python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
 
-   .. code-block:: bash
+Now it should be:
 
-       deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
+.. code-block:: bash
 
-   Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with
-   the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used.
-   The full details on how to configure various nodes and GPUs can be found `here
-   <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
+    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
 
-   Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
+Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
+``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
+full details on how to configure various nodes and GPUs can be found `here
+<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
 
-   .. code-block:: bash
+Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
 
-       cd examples/seq2seq
-       deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
-       --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
-       --output_dir output_dir --overwrite_output_dir \
-       --do_train --n_train 500 --num_train_epochs 1 \
-       --per_device_train_batch_size 1  --freeze_embeds \
-       --src_lang en_XX --tgt_lang ro_RO --task translation
+.. code-block:: bash
 
-   Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` -
-   i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments
-   to deal with, we combined the two into a single argument.
+    cd examples/seq2seq
+    deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation
 
-Before you can deploy DeepSpeed, let's discuss its configuration.
+Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
+two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
+with, we combined the two into a single argument.
 
-**Configuration:**
+For some practical usage examples, please, see this `post
+<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
+
+
+
+Deployment with one GPU
+=======================================================================================================================
+
+To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
+
+.. code-block:: bash
+
+    cd examples/seq2seq
+    deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation
+
+This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
+DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
+following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
+launcher options.
+
+Why would you want to use DeepSpeed with just one GPU?
+
+1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
+   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
+   normally won't fit.
+2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
+   bigger models and data batches.
+
+While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
+with DeepSpeed is to have at least the following configuration in the configuration file:
+
+.. code-block:: json
+
+  {
+    "zero_optimization": {
+       "stage": 2,
+       "allgather_partitions": true,
+       "allgather_bucket_size": 2e8,
+       "reduce_scatter": true,
+       "reduce_bucket_size": 2e8,
+       "overlap_comm": true,
+       "contiguous_gradients": true,
+       "cpu_offload": true
+    },
+  }
+
+which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
+find more details in the discussion below.
+
+For a practical usage example of this type of deployment, please, see this `post
+<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
+
+Configuration
+=======================================================================================================================
 
 For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
 to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
@@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir
 When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
 to the console, so you can see exactly what the final configuration was passed to it.
 
-**Shared Configuration:**
+Shared Configuration
+=======================================================================================================================
 
 Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
 correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
@@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation.
 
 
 
-**ZeRO:**
+ZeRO
+=======================================================================================================================
 
 The ``zero_optimization`` section of the configuration file is the most important part (`docs
 <https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
@@ -372,7 +554,8 @@ no equivalent command line arguments.
 
 
 
-**Optimizer:**
+Optimizer
+=======================================================================================================================
 
 
 DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
@@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex
 make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
 
 
-**Scheduler:**
+Scheduler
+=======================================================================================================================
 
 DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
 <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
@@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
          }
     }
 
-**Automatic Mixed Precision:**
+Automatic Mixed Precision
+=======================================================================================================================
 
 You can work with FP16 in one of the following ways:
 
@@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways:
 2. NVIDIA's apex, as documented `here
    <https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
 
-If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the
+If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
 configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
 
 Here is an example of the ``fp16`` configuration:
@@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration:
 
 
 
-**Gradient Clipping:**
+Gradient Clipping
+=======================================================================================================================
 
 If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
 will use the value of the ``--max_grad_norm`` command line argument to set it.
@@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration:
 
 
 
-**Notes:**
+Notes
+=======================================================================================================================
 
 * DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
 * While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
@@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration:
   use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
   instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
 
-**Main DeepSpeed Resources:**
+Main DeepSpeed Resources
+=======================================================================================================================
 
-- `github <https://github.com/microsoft/deepspeed>`__
+- `Project's github <https://github.com/microsoft/deepspeed>`__
 - `Usage docs <https://www.deepspeed.ai/getting-started/>`__
 - `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
+- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
 
 Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
-have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
 <https://github.com/microsoft/DeepSpeed/issues>`__.