[deepspeed doc] install issues + 1-gpu deployment (#9582)
* [doc] install + 1-gpu deployment * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * improvements Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O
|
|||||||
|
|
||||||
This provided support is new and experimental as of this writing.
|
This provided support is new and experimental as of this writing.
|
||||||
|
|
||||||
You will need at least 2 GPUs to benefit from these features.
|
Installation Notes
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
|
||||||
|
|
||||||
|
While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
|
||||||
|
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||||
|
<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
|
||||||
|
any PyTorch extension that needs to build CUDA extensions.
|
||||||
|
|
||||||
|
Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
pip install fairscale
|
||||||
|
pip install deepspeed
|
||||||
|
|
||||||
|
please, read the following notes first.
|
||||||
|
|
||||||
|
In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
|
||||||
|
different remember to adjust the version number to the one you are after.
|
||||||
|
|
||||||
|
**Possible problem #1:**
|
||||||
|
|
||||||
|
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
|
||||||
|
installed system-wide.
|
||||||
|
|
||||||
|
For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
|
||||||
|
CUDA ``10.2`` installed system-wide.
|
||||||
|
|
||||||
|
The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
|
||||||
|
Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
|
||||||
|
installation location by doing:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
which nvcc
|
||||||
|
|
||||||
|
If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
|
||||||
|
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
|
||||||
|
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
|
||||||
|
|
||||||
|
**Possible problem #2:**
|
||||||
|
|
||||||
|
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
|
||||||
|
may have:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
/usr/local/cuda-10.2
|
||||||
|
/usr/local/cuda-11.0
|
||||||
|
|
||||||
|
Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
|
||||||
|
the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
|
||||||
|
last version was installed. If you encounter the problem, where the package build fails because it can't find the right
|
||||||
|
CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
|
||||||
|
environment variables.
|
||||||
|
|
||||||
|
First, you may look at their contents:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
echo $PATH
|
||||||
|
echo $LD_LIBRARY_PATH
|
||||||
|
|
||||||
|
so you get an idea of what is inside.
|
||||||
|
|
||||||
|
It's possible that ``LD_LIBRARY_PATH`` is empty.
|
||||||
|
|
||||||
|
``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
|
||||||
|
are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
|
||||||
|
entries.
|
||||||
|
|
||||||
|
Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
|
||||||
|
doing:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
export PATH=/usr/local/cuda-10.2/bin:$PATH
|
||||||
|
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||||
|
|
||||||
|
Note that we aren't overwriting the existing values, but prepending instead.
|
||||||
|
|
||||||
|
Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
|
||||||
|
exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
|
||||||
|
that your system will have it named differently, but if it is adjust it to reflect your reality.
|
||||||
|
|
||||||
|
|
||||||
|
**Possible problem #3:**
|
||||||
|
|
||||||
|
Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
|
||||||
|
``gcc-7``.
|
||||||
|
|
||||||
|
There are various ways to go about it.
|
||||||
|
|
||||||
|
If you can install the latest CUDA toolkit it typically should support the newer compiler.
|
||||||
|
|
||||||
|
Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
|
||||||
|
already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
|
||||||
|
build system complains it can't find it, the following might do the trick:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
|
||||||
|
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
|
||||||
|
|
||||||
|
|
||||||
|
Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
|
||||||
|
``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
|
||||||
|
should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
|
||||||
|
|
||||||
|
As always make sure to edit the paths in the example to match your situation.
|
||||||
|
|
||||||
|
**If still unsuccessful:**
|
||||||
|
|
||||||
|
If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
|
||||||
|
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||||
|
<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
|
||||||
|
|
||||||
|
|
||||||
FairScale
|
FairScale
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
|
|||||||
1. Optimizer State Sharding
|
1. Optimizer State Sharding
|
||||||
2. Gradient Sharding
|
2. Gradient Sharding
|
||||||
|
|
||||||
|
You will need at least two GPUs to use this feature.
|
||||||
|
|
||||||
To deploy this feature:
|
To deploy this feature:
|
||||||
|
|
||||||
1. Install the library via pypi:
|
1. Install the library via pypi:
|
||||||
@@ -132,7 +252,7 @@ To deploy this feature:
|
|||||||
|
|
||||||
pip install fairscale
|
pip install fairscale
|
||||||
|
|
||||||
or find more details on `the FairScale's github page
|
or find more details on `the FairScale's GitHub page
|
||||||
<https://github.com/facebookresearch/fairscale/#installation>`__.
|
<https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||||
|
|
||||||
2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
|
2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
|
||||||
@@ -164,7 +284,6 @@ Notes:
|
|||||||
DeepSpeed
|
DeepSpeed
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|
||||||
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
|
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
|
||||||
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
|
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
|
||||||
full support for:
|
full support for:
|
||||||
@@ -172,21 +291,26 @@ full support for:
|
|||||||
1. Optimizer State Partitioning (ZeRO stage 1)
|
1. Optimizer State Partitioning (ZeRO stage 1)
|
||||||
2. Add Gradient Partitioning (ZeRO stage 2)
|
2. Add Gradient Partitioning (ZeRO stage 2)
|
||||||
|
|
||||||
To deploy this feature:
|
Installation
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
1. Install the library via pypi:
|
Install the library via pypi:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
pip install deepspeed
|
pip install deepspeed
|
||||||
|
|
||||||
or find more details on `the DeepSpeed's github page <https://github.com/microsoft/deepspeed#installation>`__.
|
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
|
||||||
|
|
||||||
2. Adjust the :class:`~transformers.Trainer` command line arguments as following:
|
Deployment with multiple GPUs
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
|
To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
|
||||||
|
following:
|
||||||
|
|
||||||
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
|
1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
|
||||||
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file
|
2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
|
||||||
as documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
|
documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
|
||||||
|
|
||||||
Therefore, if your original command line looked as following:
|
Therefore, if your original command line looked as following:
|
||||||
|
|
||||||
@@ -200,9 +324,9 @@ To deploy this feature:
|
|||||||
|
|
||||||
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
|
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
|
||||||
|
|
||||||
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with
|
Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
|
||||||
the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used.
|
``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
|
||||||
The full details on how to configure various nodes and GPUs can be found `here
|
full details on how to configure various nodes and GPUs can be found `here
|
||||||
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
|
<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
|
||||||
|
|
||||||
Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
|
Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
|
||||||
@@ -217,13 +341,69 @@ To deploy this feature:
|
|||||||
--per_device_train_batch_size 1 --freeze_embeds \
|
--per_device_train_batch_size 1 --freeze_embeds \
|
||||||
--src_lang en_XX --tgt_lang ro_RO --task translation
|
--src_lang en_XX --tgt_lang ro_RO --task translation
|
||||||
|
|
||||||
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` -
|
Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
|
||||||
i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments
|
two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
|
||||||
to deal with, we combined the two into a single argument.
|
with, we combined the two into a single argument.
|
||||||
|
|
||||||
Before you can deploy DeepSpeed, let's discuss its configuration.
|
For some practical usage examples, please, see this `post
|
||||||
|
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
|
||||||
|
|
||||||
**Configuration:**
|
|
||||||
|
|
||||||
|
Deployment with one GPU
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
|
To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
cd examples/seq2seq
|
||||||
|
deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
|
||||||
|
--model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
|
||||||
|
--output_dir output_dir --overwrite_output_dir \
|
||||||
|
--do_train --n_train 500 --num_train_epochs 1 \
|
||||||
|
--per_device_train_batch_size 1 --freeze_embeds \
|
||||||
|
--src_lang en_XX --tgt_lang ro_RO --task translation
|
||||||
|
|
||||||
|
This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
|
||||||
|
DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
|
||||||
|
following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
|
||||||
|
launcher options.
|
||||||
|
|
||||||
|
Why would you want to use DeepSpeed with just one GPU?
|
||||||
|
|
||||||
|
1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
|
||||||
|
leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
|
||||||
|
normally won't fit.
|
||||||
|
2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
|
||||||
|
bigger models and data batches.
|
||||||
|
|
||||||
|
While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
|
||||||
|
with DeepSpeed is to have at least the following configuration in the configuration file:
|
||||||
|
|
||||||
|
.. code-block:: json
|
||||||
|
|
||||||
|
{
|
||||||
|
"zero_optimization": {
|
||||||
|
"stage": 2,
|
||||||
|
"allgather_partitions": true,
|
||||||
|
"allgather_bucket_size": 2e8,
|
||||||
|
"reduce_scatter": true,
|
||||||
|
"reduce_bucket_size": 2e8,
|
||||||
|
"overlap_comm": true,
|
||||||
|
"contiguous_gradients": true,
|
||||||
|
"cpu_offload": true
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
|
||||||
|
find more details in the discussion below.
|
||||||
|
|
||||||
|
For a practical usage example of this type of deployment, please, see this `post
|
||||||
|
<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
|
For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
|
||||||
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
|
to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
|
||||||
@@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir
|
|||||||
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
|
When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
|
||||||
to the console, so you can see exactly what the final configuration was passed to it.
|
to the console, so you can see exactly what the final configuration was passed to it.
|
||||||
|
|
||||||
**Shared Configuration:**
|
Shared Configuration
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
|
Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
|
||||||
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
|
correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
|
||||||
@@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
**ZeRO:**
|
ZeRO
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
The ``zero_optimization`` section of the configuration file is the most important part (`docs
|
The ``zero_optimization`` section of the configuration file is the most important part (`docs
|
||||||
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
|
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
|
||||||
@@ -372,7 +554,8 @@ no equivalent command line arguments.
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
**Optimizer:**
|
Optimizer
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
|
|
||||||
DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
|
DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
|
||||||
@@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex
|
|||||||
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
|
make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.
|
||||||
|
|
||||||
|
|
||||||
**Scheduler:**
|
Scheduler
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
|
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
|
||||||
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
|
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
|
||||||
@@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
**Automatic Mixed Precision:**
|
Automatic Mixed Precision
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
You can work with FP16 in one of the following ways:
|
You can work with FP16 in one of the following ways:
|
||||||
|
|
||||||
@@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways:
|
|||||||
2. NVIDIA's apex, as documented `here
|
2. NVIDIA's apex, as documented `here
|
||||||
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
|
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
|
||||||
|
|
||||||
If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the
|
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
|
||||||
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
|
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
|
||||||
|
|
||||||
Here is an example of the ``fp16`` configuration:
|
Here is an example of the ``fp16`` configuration:
|
||||||
@@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration:
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
**Gradient Clipping:**
|
Gradient Clipping
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
|
If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
|
||||||
will use the value of the ``--max_grad_norm`` command line argument to set it.
|
will use the value of the ``--max_grad_norm`` command line argument to set it.
|
||||||
@@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration:
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
**Notes:**
|
Notes
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
|
* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
|
||||||
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
|
* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
|
||||||
@@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration:
|
|||||||
use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
|
use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
|
||||||
instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
|
instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
|
||||||
|
|
||||||
**Main DeepSpeed Resources:**
|
Main DeepSpeed Resources
|
||||||
|
=======================================================================================================================
|
||||||
|
|
||||||
- `github <https://github.com/microsoft/deepspeed>`__
|
- `Project's github <https://github.com/microsoft/deepspeed>`__
|
||||||
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
|
- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
|
||||||
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
|
- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
|
||||||
|
- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
|
||||||
|
|
||||||
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
|
Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
|
||||||
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github
|
have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
|
||||||
<https://github.com/microsoft/DeepSpeed/issues>`__.
|
<https://github.com/microsoft/DeepSpeed/issues>`__.
|
||||||
|
|||||||
Reference in New Issue
Block a user