From 82498cbc37d5c15520c7bddde5d804c804eee498 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Thu, 14 Jan 2021 11:05:04 -0800 Subject: [PATCH] [deepspeed doc] install issues + 1-gpu deployment (#9582) * [doc] install + 1-gpu deployment * Apply suggestions from code review Co-authored-by: Lysandre Debut Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * improvements Co-authored-by: Lysandre Debut Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/main_classes/trainer.rst | 283 ++++++++++++++++++++++----- 1 file changed, 236 insertions(+), 47 deletions(-) diff --git a/docs/source/main_classes/trainer.rst b/docs/source/main_classes/trainer.rst index b30efdce4b..c0ce5c4e25 100644 --- a/docs/source/main_classes/trainer.rst +++ b/docs/source/main_classes/trainer.rst @@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O This provided support is new and experimental as of this writing. -You will need at least 2 GPUs to benefit from these features. +Installation Notes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used. + +While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale +`__ and `Deepspeed +`__, there are a few common issues that one may encounter while building +any PyTorch extension that needs to build CUDA extensions. + +Therefore, if you encounter a CUDA-related build issue while doing one of the following or both: + +.. code-block:: bash + + pip install fairscale + pip install deepspeed + +please, read the following notes first. + +In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is +different remember to adjust the version number to the one you are after. + +**Possible problem #1:** + +While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA +installed system-wide. + +For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have +CUDA ``10.2`` installed system-wide. + +The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many +Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the +installation location by doing: + +.. code-block:: bash + + which nvcc + +If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite +search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install +`__. + +**Possible problem #2:** + +Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you +may have: + +.. code-block:: bash + + /usr/local/cuda-10.2 + /usr/local/cuda-11.0 + +Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain +the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the +last version was installed. If you encounter the problem, where the package build fails because it can't find the right +CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned +environment variables. + +First, you may look at their contents: + +.. code-block:: bash + + echo $PATH + echo $LD_LIBRARY_PATH + +so you get an idea of what is inside. + +It's possible that ``LD_LIBRARY_PATH`` is empty. + +``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries +are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple +entries. + +Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by +doing: + +.. code-block:: bash + + export PATH=/usr/local/cuda-10.2/bin:$PATH + export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH + +Note that we aren't overwriting the existing values, but prepending instead. + +Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do +exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely +that your system will have it named differently, but if it is adjust it to reflect your reality. + + +**Possible problem #3:** + +Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants +``gcc-7``. + +There are various ways to go about it. + +If you can install the latest CUDA toolkit it typically should support the newer compiler. + +Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may +already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the +build system complains it can't find it, the following might do the trick: + +.. code-block:: bash + + sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc + sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++ + + +Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since +``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it +should find ``gcc-7`` (and ``g++7``) and then the build will succeed. + +As always make sure to edit the paths in the example to match your situation. + +**If still unsuccessful:** + +If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale +`__ and `Deepspeed +`__, depending on the project you have the problem with. + FairScale ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper `__. 2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m @@ -164,7 +284,6 @@ Notes: DeepSpeed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - `DeepSpeed `__ implements everything described in the `ZeRO paper `__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides full support for: @@ -172,58 +291,119 @@ full support for: 1. Optimizer State Partitioning (ZeRO stage 1) 2. Add Gradient Partitioning (ZeRO stage 2) -To deploy this feature: +Installation +======================================================================================================================= -1. Install the library via pypi: +Install the library via pypi: - .. code-block:: bash +.. code-block:: bash - pip install deepspeed + pip install deepspeed - or find more details on `the DeepSpeed's github page `__. +or find more details on `the DeepSpeed's GitHub page `__. -2. Adjust the :class:`~transformers.Trainer` command line arguments as following: +Deployment with multiple GPUs +======================================================================================================================= - 1. replace ``python -m torch.distributed.launch`` with ``deepspeed``. - 2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file - as documented `here `__. The file naming is up to you. +To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as +following: - Therefore, if your original command line looked as following: +1. replace ``python -m torch.distributed.launch`` with ``deepspeed``. +2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as + documented `here `__. The file naming is up to you. - .. code-block:: bash +Therefore, if your original command line looked as following: - python -m torch.distributed.launch --nproc_per_node=2 your_program.py +.. code-block:: bash - Now it should be: + python -m torch.distributed.launch --nproc_per_node=2 your_program.py - .. code-block:: bash +Now it should be: - deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json +.. code-block:: bash - Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with - the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. - The full details on how to configure various nodes and GPUs can be found `here - `__. + deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json - Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs: +Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the +``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The +full details on how to configure various nodes and GPUs can be found `here +`__. - .. code-block:: bash +Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs: - cd examples/seq2seq - deepspeed ./finetune_trainer.py --deepspeed ds_config.json \ - --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ - --output_dir output_dir --overwrite_output_dir \ - --do_train --n_train 500 --num_train_epochs 1 \ - --per_device_train_batch_size 1 --freeze_embeds \ - --src_lang en_XX --tgt_lang ro_RO --task translation +.. code-block:: bash - Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - - i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments - to deal with, we combined the two into a single argument. + cd examples/seq2seq + deepspeed ./finetune_trainer.py --deepspeed ds_config.json \ + --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ + --output_dir output_dir --overwrite_output_dir \ + --do_train --n_train 500 --num_train_epochs 1 \ + --per_device_train_batch_size 1 --freeze_embeds \ + --src_lang en_XX --tgt_lang ro_RO --task translation -Before you can deploy DeepSpeed, let's discuss its configuration. +Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e. +two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal +with, we combined the two into a single argument. -**Configuration:** +For some practical usage examples, please, see this `post +`__. + + + +Deployment with one GPU +======================================================================================================================= + +To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following: + +.. code-block:: bash + + cd examples/seq2seq + deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \ + --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ + --output_dir output_dir --overwrite_output_dir \ + --do_train --n_train 500 --num_train_epochs 1 \ + --per_device_train_batch_size 1 --freeze_embeds \ + --src_lang en_XX --tgt_lang ro_RO --task translation + +This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default, +DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The +following `documentation `__ discusses the +launcher options. + +Why would you want to use DeepSpeed with just one GPU? + +1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus + leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which + normally won't fit. +2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit + bigger models and data batches. + +While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU +with DeepSpeed is to have at least the following configuration in the configuration file: + +.. code-block:: json + + { + "zero_optimization": { + "stage": 2, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "overlap_comm": true, + "contiguous_gradients": true, + "cpu_offload": true + }, + } + +which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will +find more details in the discussion below. + +For a practical usage example of this type of deployment, please, see this `post +`__. + +Configuration +======================================================================================================================= For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer to the `following documentation `__. @@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer` to the console, so you can see exactly what the final configuration was passed to it. -**Shared Configuration:** +Shared Configuration +======================================================================================================================= Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to @@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation. -**ZeRO:** +ZeRO +======================================================================================================================= The ``zero_optimization`` section of the configuration file is the most important part (`docs `__), since that is where you define @@ -372,7 +554,8 @@ no equivalent command line arguments. -**Optimizer:** +Optimizer +======================================================================================================================= DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus @@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``. -**Scheduler:** +Scheduler +======================================================================================================================= DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here `__. @@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con } } -**Automatic Mixed Precision:** +Automatic Mixed Precision +======================================================================================================================= You can work with FP16 in one of the following ways: @@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways: 2. NVIDIA's apex, as documented `here `__. -If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the +If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``. Here is an example of the ``fp16`` configuration: @@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration: -**Gradient Clipping:** +Gradient Clipping +======================================================================================================================= If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer` will use the value of the ``--max_grad_norm`` command line argument to set it. @@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration: -**Notes:** +Notes +======================================================================================================================= * DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`. * While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source @@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration: use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions `__. -**Main DeepSpeed Resources:** +Main DeepSpeed Resources +======================================================================================================================= -- `github `__ +- `Project's github `__ - `Usage docs `__ - `API docs `__ +- `Blog posts `__ Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you -have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github +have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub `__.