[deepspeed doc] install issues + 1-gpu deployment (#9582)

* [doc] install + 1-gpu deployment * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * improvements Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-14 11:05:04 -08:00
parent 329fe2746a
commit 82498cbc37
1 changed files with 236 additions and 47 deletions
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O
 This provided support is new and experimental as of this writing.
-You will need at least 2 GPUs to benefit from these features.
+Installation Notes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
 While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
 <https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
 <https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
 any PyTorch extension that needs to build CUDA extensions.
 Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
 .. code-block:: bash
    pip install fairscale
    pip install deepspeed
 please, read the following notes first.
 In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
 different remember to adjust the version number to the one you are after.
 **Possible problem #1:**
 While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
 installed system-wide.
 For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
 CUDA ``10.2`` installed system-wide.
 The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
 Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
 installation location by doing:
 .. code-block:: bash
    which nvcc
 If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
 search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
 <https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
 **Possible problem #2:**
 Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
 may have:
 .. code-block:: bash
    /usr/local/cuda-10.2
    /usr/local/cuda-11.0
 Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
 the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
 last version was installed. If you encounter the problem, where the package build fails because it can't find the right
 CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
 environment variables.
 First, you may look at their contents:
 .. code-block:: bash
    echo $PATH
    echo $LD_LIBRARY_PATH
 so you get an idea of what is inside.
 It's possible that ``LD_LIBRARY_PATH`` is empty.
 ``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
 are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
 entries.
 Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
 doing:
 .. code-block:: bash
    export PATH=/usr/local/cuda-10.2/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
 Note that we aren't overwriting the existing values, but prepending instead.
 Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
 exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
 that your system will have it named differently, but if it is adjust it to reflect your reality.
 **Possible problem #3:**
 Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
 ``gcc-7``.
 There are various ways to go about it.
 If you can install the latest CUDA toolkit it typically should support the newer compiler.
 Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
 already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
 build system complains it can't find it, the following might do the trick:
 .. code-block:: bash
    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++
 Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
 ``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
 should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
 As always make sure to edit the paths in the example to match your situation.
 **If still unsuccessful:**
 If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
 <https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
 <https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
 FairScale
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
 1. Optimizer State Sharding
 2. Gradient Sharding
 You will need at least two GPUs to use this feature.
 To deploy this feature:
 1. Install the library via pypi:
@@ -132,7 +252,7 @@ To deploy this feature:
       pip install fairscale
-   or find more details on `the FairScale's github page
+   or find more details on `the FairScale's GitHub page
   <https://github.com/facebookresearch/fairscale/#installation>`__.
 2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
@@ -164,7 +284,6 @@ Notes:
 DeepSpeed
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
 <https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
 full support for:
@@ -172,21 +291,26 @@ full support for:
 1. Optimizer State Partitioning (ZeRO stage 1)
 2. Add Gradient Partitioning (ZeRO stage 2)
-To deploy this feature:
+Installation
 =======================================================================================================================
-1. Install the library via pypi:
+Install the library via pypi:
 .. code-block:: bash
    pip install deepspeed
-   or find more details on `the DeepSpeed's github page <https://github.com/microsoft/deepspeed#installation>`__.
+or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
-2. Adjust the :class:`~transformers.Trainer` command line arguments as following:
+Deployment with multiple GPUs
 =======================================================================================================================
 To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
 following:
 1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
-   2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file
+2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
-      as documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
+   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
 Therefore, if your original command line looked as following:
@@ -200,9 +324,9 @@ To deploy this feature:
    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
-   Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with
+Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
-   the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used.
+``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
-   The full details on how to configure various nodes and GPUs can be found `here
+full details on how to configure various nodes and GPUs can be found `here
 <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
 Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
@@ -217,13 +341,69 @@ To deploy this feature:
    --per_device_train_batch_size 1  --freeze_embeds \
    --src_lang en_XX --tgt_lang ro_RO --task translation
-   Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` -
+Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
-   i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments
+two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
-   to deal with, we combined the two into a single argument.
+with, we combined the two into a single argument.
-Before you can deploy DeepSpeed, let's discuss its configuration.
+For some practical usage examples, please, see this `post
 <https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
-**Configuration:**
+
 Deployment with one GPU
 =======================================================================================================================
 To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
 .. code-block:: bash
    cd examples/seq2seq
    deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
    --output_dir output_dir --overwrite_output_dir \
    --do_train --n_train 500 --num_train_epochs 1 \
    --per_device_train_batch_size 1  --freeze_embeds \
    --src_lang en_XX --tgt_lang ro_RO --task translation
 This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
 DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
 following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
 launcher options.
 Why would you want to use DeepSpeed with just one GPU?
 1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
   normally won't fit.
 2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
   bigger models and data batches.
 While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
 with DeepSpeed is to have at least the following configuration in the configuration file:
 .. code-block:: json
  {
    "zero_optimization": {
       "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": true,
       "contiguous_gradients": true,
       "cpu_offload": true
    },
  }
 which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
 find more details in the discussion below.
 For a practical usage example of this type of deployment, please, see this `post
 <https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
 Configuration
 =======================================================================================================================
 For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
 to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
@@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir
 When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
 to the console, so you can see exactly what the final configuration was passed to it.
-**Shared Configuration:**
+Shared Configuration
 =======================================================================================================================
 Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
 correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
@@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation.
-**ZeRO:**
+ZeRO
 =======================================================================================================================
 The ``zero_optimization`` section of the configuration file is the most important part (`docs
 <https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
@@ -372,7 +554,8 @@ no equivalent command line arguments.
-**Optimizer:**
+Optimizer
 =======================================================================================================================
 DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
@@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex
 make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
-**Scheduler:**
+Scheduler
 =======================================================================================================================
 DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
 <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
@@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
         }
    }
-**Automatic Mixed Precision:**
+Automatic Mixed Precision
 =======================================================================================================================
 You can work with FP16 in one of the following ways:
@@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways:
 2. NVIDIA's apex, as documented `here
   <https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
-If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the
+If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
 configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
 Here is an example of the ``fp16`` configuration:
@@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration:
-**Gradient Clipping:**
+Gradient Clipping
 =======================================================================================================================
 If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
 will use the value of the ``--max_grad_norm`` command line argument to set it.
@@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration:
-**Notes:**
+Notes
 =======================================================================================================================
 * DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
 * While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
@@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration:
  use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
  instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
-**Main DeepSpeed Resources:**
+Main DeepSpeed Resources
 =======================================================================================================================
- `github <https://github.com/microsoft/deepspeed>`__
+- `Project's github <https://github.com/microsoft/deepspeed>`__
 - `Usage docs <https://www.deepspeed.ai/getting-started/>`__
 - `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
 - `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
 Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
-have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
 <https://github.com/microsoft/DeepSpeed/issues>`__.