[deepspeed doc] install issues + 1-gpu deployment (#9582)
* [doc] install + 1-gpu deployment * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * improvements Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O
|
||||
|
||||
This provided support is new and experimental as of this writing.
|
||||
|
||||
You will need at least 2 GPUs to benefit from these features.
|
||||
Installation Notes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
|
||||
|
||||
While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
|
||||
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
|
||||
any PyTorch extension that needs to build CUDA extensions.
|
||||
|
||||
Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale
|
||||
pip install deepspeed
|
||||
|
||||
please, read the following notes first.
|
||||
|
||||
In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
|
||||
different remember to adjust the version number to the one you are after.
|
||||
|
||||
**Possible problem #1:**
|
||||
|
||||
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
|
||||
installed system-wide.
|
||||
|
||||
For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
|
||||
CUDA ``10.2`` installed system-wide.
|
||||
|
||||
The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
|
||||
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
|
||||
installation location by doing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
which nvcc
|
||||
|
||||
If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
|
||||
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
|
||||
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
|
||||
|
||||
**Possible problem #2:**
|
||||
|
||||
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
|
||||
may have:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
/usr/local/cuda-10.2
|
||||
/usr/local/cuda-11.0
|
||||
|
||||
Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
|
||||
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
|
||||
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
|
||||
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
|
||||
environment variables.
|
||||
|
||||
First, you may look at their contents:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
echo $PATH
|
||||
echo $LD_LIBRARY_PATH
|
||||
|
||||
so you get an idea of what is inside.
|
||||
|
||||
It's possible that ``LD_LIBRARY_PATH`` is empty.
|
||||
|
||||
``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
|
||||
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
|
||||
entries.
|
||||
|
||||
Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
|
||||
doing:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export PATH=/usr/local/cuda-10.2/bin:$PATH
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||
|
||||
Note that we aren't overwriting the existing values, but prepending instead.
|
||||
|
||||
Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
|
||||
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
|
||||
that your system will have it named differently, but if it is adjust it to reflect your reality.
|
||||
|
||||
|
||||
**Possible problem #3:**
|
||||
|
||||
Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
|
||||
``gcc-7``.
|
||||
|
||||
There are various ways to go about it.
|
||||
|
||||
If you can install the latest CUDA toolkit it typically should support the newer compiler.
|
||||
|
||||
Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
|
||||
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
|
||||
build system complains it can't find it, the following might do the trick:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
|
||||
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
|
||||
|
||||
|
||||
Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
|
||||
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
|
||||
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
|
||||
|
||||
As always make sure to edit the paths in the example to match your situation.
|
||||
|
||||
**If still unsuccessful:**
|
||||
|
||||
If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
|
||||
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||
<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
|
||||
|
||||
|
||||
FairScale
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
|
||||
1. Optimizer State Sharding
|
||||
2. Gradient Sharding
|
||||
|
||||
You will need at least two GPUs to use this feature.
|
||||
|
||||
To deploy this feature:
|
||||
|
||||
1. Install the library via pypi:
|
||||
@@ -132,7 +252,7 @@ To deploy this feature:
|
||||
|
||||
pip install fairscale
|
||||
|
||||
or find more details on `the FairScale's github page
|
||||
or find more details on `the FairScale's GitHub page
|
||||
<https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||
|
||||
2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
|
||||
@@ -164,7 +284,6 @@ Notes:
|
||||
DeepSpeed
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
|
||||
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
|
||||
full support for:
|
||||
@@ -172,58 +291,119 @@ full support for:
|
||||
1. Optimizer State Partitioning (ZeRO stage 1)
|
||||
2. Add Gradient Partitioning (ZeRO stage 2)
|
||||
|
||||
To deploy this feature:
|
||||
Installation
|
||||
=======================================================================================================================
|
||||
|
||||
1. Install the library via pypi:
|
||||
Install the library via pypi:
|
||||
|
||||
.. code-block:: bash
|
||||
.. code-block:: bash
|
||||
|
||||
pip install deepspeed
|
||||
pip install deepspeed
|
||||
|
||||
or find more details on `the DeepSpeed's github page <https://github.com/microsoft/deepspeed#installation>`__.
|
||||
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
|
||||
|
||||
2. Adjust the :class:`~transformers.Trainer` command line arguments as following:
|
||||
Deployment with multiple GPUs
|
||||
=======================================================================================================================
|
||||
|
||||
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
|
||||
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file
|
||||
as documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
|
||||
To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
|
||||
following:
|
||||
|
||||
Therefore, if your original command line looked as following:
|
||||
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
|
||||
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
|
||||
documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
|
||||
|
||||
.. code-block:: bash
|
||||
Therefore, if your original command line looked as following:
|
||||
|
||||
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
|
||||
.. code-block:: bash
|
||||
|
||||
Now it should be:
|
||||
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
|
||||
|
||||
.. code-block:: bash
|
||||
Now it should be:
|
||||
|
||||
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
|
||||
.. code-block:: bash
|
||||
|
||||
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with
|
||||
the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used.
|
||||
The full details on how to configure various nodes and GPUs can be found `here
|
||||
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
|
||||
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
|
||||
|
||||
Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
|
||||
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
|
||||
``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
|
||||
full details on how to configure various nodes and GPUs can be found `here
|
||||
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
|
||||
|
||||
.. code-block:: bash
|
||||
Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
|
||||
|
||||
cd examples/seq2seq
|
||||
deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
|
||||
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --n_train 500 --num_train_epochs 1 \
|
||||
--per_device_train_batch_size 1 --freeze_embeds \
|
||||
--src_lang en_XX --tgt_lang ro_RO --task translation
|
||||
.. code-block:: bash
|
||||
|
||||
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` -
|
||||
i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments
|
||||
to deal with, we combined the two into a single argument.
|
||||
cd examples/seq2seq
|
||||
deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
|
||||
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --n_train 500 --num_train_epochs 1 \
|
||||
--per_device_train_batch_size 1 --freeze_embeds \
|
||||
--src_lang en_XX --tgt_lang ro_RO --task translation
|
||||
|
||||
Before you can deploy DeepSpeed, let's discuss its configuration.
|
||||
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
|
||||
two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
|
||||
with, we combined the two into a single argument.
|
||||
|
||||
**Configuration:**
|
||||
For some practical usage examples, please, see this `post
|
||||
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
|
||||
|
||||
|
||||
|
||||
Deployment with one GPU
|
||||
=======================================================================================================================
|
||||
|
||||
To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd examples/seq2seq
|
||||
deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
|
||||
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
|
||||
--output_dir output_dir --overwrite_output_dir \
|
||||
--do_train --n_train 500 --num_train_epochs 1 \
|
||||
--per_device_train_batch_size 1 --freeze_embeds \
|
||||
--src_lang en_XX --tgt_lang ro_RO --task translation
|
||||
|
||||
This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
|
||||
DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
|
||||
following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
|
||||
launcher options.
|
||||
|
||||
Why would you want to use DeepSpeed with just one GPU?
|
||||
|
||||
1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
|
||||
leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
|
||||
normally won't fit.
|
||||
2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
|
||||
bigger models and data batches.
|
||||
|
||||
While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
|
||||
with DeepSpeed is to have at least the following configuration in the configuration file:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"zero_optimization": {
|
||||
"stage": 2,
|
||||
"allgather_partitions": true,
|
||||
"allgather_bucket_size": 2e8,
|
||||
"reduce_scatter": true,
|
||||
"reduce_bucket_size": 2e8,
|
||||
"overlap_comm": true,
|
||||
"contiguous_gradients": true,
|
||||
"cpu_offload": true
|
||||
},
|
||||
}
|
||||
|
||||
which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
|
||||
find more details in the discussion below.
|
||||
|
||||
For a practical usage example of this type of deployment, please, see this `post
|
||||
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
|
||||
|
||||
Configuration
|
||||
=======================================================================================================================
|
||||
|
||||
For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
|
||||
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
|
||||
@@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir
|
||||
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
|
||||
to the console, so you can see exactly what the final configuration was passed to it.
|
||||
|
||||
**Shared Configuration:**
|
||||
Shared Configuration
|
||||
=======================================================================================================================
|
||||
|
||||
Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
|
||||
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
|
||||
@@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation.
|
||||
|
||||
|
||||
|
||||
**ZeRO:**
|
||||
ZeRO
|
||||
=======================================================================================================================
|
||||
|
||||
The ``zero_optimization`` section of the configuration file is the most important part (`docs
|
||||
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
|
||||
@@ -372,7 +554,8 @@ no equivalent command line arguments.
|
||||
|
||||
|
||||
|
||||
**Optimizer:**
|
||||
Optimizer
|
||||
=======================================================================================================================
|
||||
|
||||
|
||||
DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
|
||||
@@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex
|
||||
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
|
||||
|
||||
|
||||
**Scheduler:**
|
||||
Scheduler
|
||||
=======================================================================================================================
|
||||
|
||||
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
|
||||
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
|
||||
@@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
|
||||
}
|
||||
}
|
||||
|
||||
**Automatic Mixed Precision:**
|
||||
Automatic Mixed Precision
|
||||
=======================================================================================================================
|
||||
|
||||
You can work with FP16 in one of the following ways:
|
||||
|
||||
@@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways:
|
||||
2. NVIDIA's apex, as documented `here
|
||||
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
|
||||
|
||||
If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the
|
||||
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
|
||||
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
|
||||
|
||||
Here is an example of the ``fp16`` configuration:
|
||||
@@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration:
|
||||
|
||||
|
||||
|
||||
**Gradient Clipping:**
|
||||
Gradient Clipping
|
||||
=======================================================================================================================
|
||||
|
||||
If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
|
||||
will use the value of the ``--max_grad_norm`` command line argument to set it.
|
||||
@@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration:
|
||||
|
||||
|
||||
|
||||
**Notes:**
|
||||
Notes
|
||||
=======================================================================================================================
|
||||
|
||||
* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
|
||||
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
|
||||
@@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration:
|
||||
use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
|
||||
instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
|
||||
|
||||
**Main DeepSpeed Resources:**
|
||||
Main DeepSpeed Resources
|
||||
=======================================================================================================================
|
||||
|
||||
- `github <https://github.com/microsoft/deepspeed>`__
|
||||
- `Project's github <https://github.com/microsoft/deepspeed>`__
|
||||
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
|
||||
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
|
||||
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
|
||||
|
||||
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
|
||||
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github
|
||||
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
|
||||
<https://github.com/microsoft/DeepSpeed/issues>`__.
|
||||
|
||||
Reference in New Issue
Block a user