[DeepSpeed] ZeRO Stage 3 (#10753)
* synced gpus * fix * fix * need to use t5-small for quality tests * notes * complete merge * fix a disappearing std stream problem * start zero3 tests * wip * tune params * sorting out the pre-trained model loading * reworking generate loop wip * wip * style * fix tests * split the tests * refactor tests * wip * parameterized * fix * workout the resume from non-ds checkpoint pass + test * cleanup * remove no longer needed code * split getter/setter functions * complete the docs * suggestions * gpus and their compute capabilities link * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * style * remove invalid paramgd * automatically configure zero3 params that rely on hidden size * make _get_resized_embeddings zero3-aware * add test exercising resize_token_embeddings() * add docstring Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
@@ -134,6 +134,8 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O
|
||||
|
||||
This provided support is new and experimental as of this writing.
|
||||
|
||||
.. _zero-install-notes:
|
||||
|
||||
Installation Notes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@@ -156,7 +158,8 @@ please, read the following notes first.
|
||||
In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
|
||||
different remember to adjust the version number to the one you are after.
|
||||
|
||||
**Possible problem #1:**
|
||||
Possible problem #1
|
||||
=======================================================================================================================
|
||||
|
||||
While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
|
||||
installed system-wide.
|
||||
@@ -176,7 +179,8 @@ If you don't have CUDA installed system-wide, install it first. You will find th
|
||||
search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
|
||||
<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
|
||||
|
||||
**Possible problem #2:**
|
||||
Possible problem #2
|
||||
=======================================================================================================================
|
||||
|
||||
Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
|
||||
may have:
|
||||
@@ -222,7 +226,8 @@ exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like `
|
||||
that your system will have it named differently, but if it is adjust it to reflect your reality.
|
||||
|
||||
|
||||
**Possible problem #3:**
|
||||
Possible problem #3
|
||||
=======================================================================================================================
|
||||
|
||||
Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
|
||||
``gcc-7``.
|
||||
@@ -247,13 +252,6 @@ should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
|
||||
|
||||
As always make sure to edit the paths in the example to match your situation.
|
||||
|
||||
**If still unsuccessful:**
|
||||
|
||||
If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
|
||||
<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
|
||||
<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
|
||||
|
||||
|
||||
FairScale
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@@ -267,20 +265,66 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
|
||||
|
||||
You will need at least two GPUs to use this feature.
|
||||
|
||||
To deploy this feature:
|
||||
|
||||
1. Install the library via pypi:
|
||||
**Installation**:
|
||||
|
||||
.. code-block:: bash
|
||||
Install the library via pypi:
|
||||
|
||||
pip install fairscale
|
||||
.. code-block:: bash
|
||||
|
||||
or find more details on `the FairScale's GitHub page
|
||||
<https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||
pip install fairscale
|
||||
|
||||
2. To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments,
|
||||
and make sure you have added the distributed launcher ``-m torch.distributed.launch
|
||||
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
or find more details on `the FairScale's GitHub page <https://github.com/facebookresearch/fairscale/#installation>`__.
|
||||
|
||||
If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
|
||||
|
||||
If it's still not resolved the build issue, here are a few more ideas.
|
||||
|
||||
``fairscale`` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem
|
||||
with it, you may want to try one of:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install fairscale --no-build-isolation .
|
||||
|
||||
or:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/facebookresearch/fairscale/
|
||||
cd fairscale
|
||||
rm -r dist build
|
||||
python setup.py bdist_wheel
|
||||
pip uninstall -y fairscale
|
||||
pip install dist/fairscale-*.whl
|
||||
|
||||
``fairscale`` also has issues with building against pytorch-nightly, so if you use it you may have to try one of:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip uninstall -y fairscale; pip install fairscale --pre \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html \
|
||||
--no-cache --no-build-isolation
|
||||
|
||||
or:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install -v --disable-pip-version-check . \
|
||||
-f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html --pre
|
||||
|
||||
Of course, adjust the urls to match the cuda version you use.
|
||||
|
||||
If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
|
||||
`FairScale <https://github.com/facebookresearch/fairscale/issues>`__.
|
||||
|
||||
|
||||
|
||||
**Usage**:
|
||||
|
||||
To use the first version of Sharded data-parallelism, add ``--sharded_ddp simple`` to the command line arguments, and
|
||||
make sure you have added the distributed launcher ``-m torch.distributed.launch
|
||||
--nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`` if you haven't been using it already.
|
||||
|
||||
For example here is how you could use it for ``run_translation.py`` with 2 GPUs:
|
||||
|
||||
@@ -346,19 +390,23 @@ DeepSpeed
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
|
||||
<https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
|
||||
full support for:
|
||||
<https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for:
|
||||
|
||||
1. Optimizer State Partitioning (ZeRO stage 1)
|
||||
2. Add Gradient Partitioning (ZeRO stage 2)
|
||||
3. Custom fp16 handling
|
||||
4. A range of fast Cuda-extension-based Optimizers
|
||||
5. ZeRO-Offload
|
||||
2. Gradient Partitioning (ZeRO stage 2)
|
||||
3. Param Partitioning (ZeRO stage 3)
|
||||
4. Custom mixed precision training handling
|
||||
5. A range of fast CUDA-extension-based Optimizers
|
||||
6. ZeRO-Offload
|
||||
|
||||
ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training
|
||||
<https://arxiv.org/abs/2101.06840>`__.
|
||||
|
||||
DeepSpeed is currently used only for training, as all the currently available features are of no use to inference.
|
||||
DeepSpeed ZeRO-2 is currently used only for training, as all the currently available features are of no use to
|
||||
inference.
|
||||
|
||||
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
|
||||
won't be possible on a single GPU.
|
||||
|
||||
|
||||
|
||||
@@ -371,7 +419,74 @@ Install the library via pypi:
|
||||
|
||||
pip install deepspeed
|
||||
|
||||
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.
|
||||
or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__ and
|
||||
`advanced install <https://www.deepspeed.ai/tutorials/advanced-install/>`__.
|
||||
|
||||
If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
|
||||
|
||||
If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions
|
||||
to no avail, the next thing to try is to pre-build the modules before installing them.
|
||||
|
||||
To make a local build for DeepSpeed:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/microsoft/DeepSpeed/
|
||||
cd DeepSpeed
|
||||
rm -rf build
|
||||
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 pip install . \
|
||||
--global-option="build_ext" --global-option="-j8" --no-cache -v \
|
||||
--disable-pip-version-check 2>&1 | tee build.log
|
||||
|
||||
Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use.
|
||||
|
||||
Or if you need to use the same setup on multiple machines, make a binary wheel:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git clone https://github.com/microsoft/DeepSpeed/
|
||||
cd DeepSpeed
|
||||
rm -rf build
|
||||
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 \
|
||||
python setup.py build_ext -j8 bdist_wheel
|
||||
|
||||
it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install
|
||||
as ``pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` locally or on any other machine.
|
||||
|
||||
Again, remember to ensure to adjust ``TORCH_CUDA_ARCH_LIST`` to the target architectures.
|
||||
|
||||
You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this
|
||||
context) `here <https://developer.nvidia.com/cuda-gpus>`__.
|
||||
|
||||
You can check the archs pytorch was built with using:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python -c "import torch; print(torch.cuda.get_arch_list())"
|
||||
|
||||
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
|
||||
print(torch.cuda.get_device_properties(torch.device('cuda')))"
|
||||
|
||||
If the output is:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
|
||||
|
||||
then you know that this card's arch is ``8.6``.
|
||||
|
||||
You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
|
||||
architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
|
||||
it's best to specify the desired archs explicitly.
|
||||
|
||||
If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
|
||||
`Deepspeed <https://github.com/microsoft/DeepSpeed/issues>`__,
|
||||
|
||||
|
||||
|
||||
Deployment with multiple GPUs
|
||||
=======================================================================================================================
|
||||
@@ -498,7 +613,7 @@ Deployment in Notebooks
|
||||
The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so
|
||||
under certain setups we have to emulate it.
|
||||
|
||||
Here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
|
||||
If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -516,7 +631,11 @@ Here is how you'd have to adjust your training code in the notebook to use DeepS
|
||||
trainer = Trainer(...)
|
||||
trainer.train()
|
||||
|
||||
Note: `...` stands for the normal arguments that you'd pass to the functions.
|
||||
Note: ``...`` stands for the normal arguments that you'd pass to the functions.
|
||||
|
||||
If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have
|
||||
to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented
|
||||
at the beginning of this section.
|
||||
|
||||
If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
|
||||
cell with:
|
||||
@@ -570,22 +689,30 @@ cell with:
|
||||
EOT
|
||||
|
||||
|
||||
That's said if the script is not in the notebook cells, you can launch ``deepspeed`` normally via shell from a cell
|
||||
with:
|
||||
If the training script is in a normal file and not in the notebook cells, you can launch ``deepspeed`` normally via
|
||||
shell from a cell. For example, to use ``run_translation.py`` you would launch it with:
|
||||
|
||||
.. code-block::
|
||||
|
||||
!deepspeed examples/seq2seq/run_translation.py ...
|
||||
!git clone https://github.com/huggingface/transformers
|
||||
!cd transformers; deepspeed examples/seq2seq/run_translation.py ...
|
||||
|
||||
or with bash magic, where you can write a multi-line code for the shell to run:
|
||||
or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run:
|
||||
|
||||
.. code-block::
|
||||
|
||||
%%bash
|
||||
|
||||
cd /somewhere
|
||||
git clone https://github.com/huggingface/transformers
|
||||
cd transformers
|
||||
deepspeed examples/seq2seq/run_translation.py ...
|
||||
|
||||
In such case you don't need any of the code presented at the beginning of this section.
|
||||
|
||||
Note: ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process
|
||||
completes.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -717,26 +844,45 @@ Of course, you will need to adjust the values in this example to your situation.
|
||||
ZeRO
|
||||
=======================================================================================================================
|
||||
|
||||
`Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the work horse of DeepSpeed. It
|
||||
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
|
||||
therefore this document focuses on stages 2 and 3. You will find more indepth information in the DeepSpeed
|
||||
documentation.
|
||||
|
||||
The ``zero_optimization`` section of the configuration file is the most important part (`docs
|
||||
<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
|
||||
which ZeRO stages you want to enable and how to configure them.
|
||||
which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the
|
||||
DeepSpeed docs.
|
||||
|
||||
This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
|
||||
no equivalent command line arguments.
|
||||
|
||||
Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for
|
||||
the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is
|
||||
going to use.
|
||||
|
||||
|
||||
ZeRO-2 Config
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The following is an example configuration for ZeRO stage 2:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"zero_optimization": {
|
||||
"stage": 2,
|
||||
"allgather_partitions": true,
|
||||
"allgather_bucket_size": 5e8,
|
||||
"overlap_comm": true,
|
||||
"reduce_scatter": true,
|
||||
"reduce_bucket_size": 5e8,
|
||||
"contiguous_gradients": true,
|
||||
"cpu_offload": true
|
||||
}
|
||||
"zero_optimization": {
|
||||
"stage": 2,
|
||||
"allgather_partitions": true,
|
||||
"allgather_bucket_size": 5e8,
|
||||
"overlap_comm": true,
|
||||
"reduce_scatter": true,
|
||||
"reduce_bucket_size": 5e8,
|
||||
"contiguous_gradients": true,
|
||||
"cpu_offload": true
|
||||
}
|
||||
}
|
||||
|
||||
Notes:
|
||||
**Performance tuning:**
|
||||
|
||||
- enabling ``cpu_offload`` should reduce GPU RAM usage (it requires ``"stage": 2``)
|
||||
- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x
|
||||
@@ -748,9 +894,217 @@ Notes:
|
||||
the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
|
||||
important, getting a slightly slower training time could be a good trade.
|
||||
|
||||
This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
|
||||
no equivalent command line arguments.
|
||||
|
||||
ZeRO-3 Config
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The following is an example configuration for ZeRO stage 3:
|
||||
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"zero_optimization": {
|
||||
"stage": 3,
|
||||
"cpu_offload": true,
|
||||
"cpu_offload_params": true,
|
||||
"cpu_offload_use_pin_memory" : true,
|
||||
"overlap_comm": true,
|
||||
"contiguous_gradients": true,
|
||||
"sub_group_size": 1e14,
|
||||
"reduce_bucket_size": 1e6,
|
||||
"stage3_prefetch_bucket_size": 0.94e6,
|
||||
"stage3_param_persistence_threshold": 1e4,
|
||||
"stage3_max_live_parameters": 1e9,
|
||||
"stage3_max_reuse_distance": 1e9,
|
||||
"stage3_gather_fp16_weights_on_model_save": true
|
||||
}
|
||||
}
|
||||
|
||||
Note: if you're migrating from ZeRO-2 configuration that: ``allgather_partitions``, ``allgather_bucket_size`` and
|
||||
``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these they will just be ignored.
|
||||
|
||||
**Performance tuning:**
|
||||
|
||||
- ``sub_group_size``: ``1e14``
|
||||
- ``reduce_bucket_size``: ``hidden_size*hidden_size``
|
||||
- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size``
|
||||
- ``stage3_param_persistence_threshold``: ``10 * hidden_size``
|
||||
- ``stage3_max_live_parameters``: ``1e9``
|
||||
- ``stage3_max_reuse_distance``: ``1e9``
|
||||
|
||||
If hitting OOM reduce ``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``. They should have minimal impact
|
||||
on performance unless you are doing activation checkpointing. ``1e9`` would consume ~2GB. The memory is shared by
|
||||
``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``, so its not additive, its just 2GB total.
|
||||
|
||||
``stage3_max_live_parameters`` is the upper limit on how many full parameters you want to keep on the GPU at any given
|
||||
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
|
||||
use the ``stage3_max_reuse_distance`` to decide whether to throw away the parameter or to keep it. If a parameter is
|
||||
going to be used again in near future (less than ``stage3_max_reuse_distance``) then we keep it to reduce communication
|
||||
overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
|
||||
backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
|
||||
|
||||
If you set ``reduce_bucket_size``, ``stage3_prefetch_bucket_size`` and ``stage3_param_persistence_threshold`` as
|
||||
recommended above, they will already be fairly small so you won't have to tune those much.
|
||||
|
||||
Since ``hidden_size`` varies from model to model, the ``Trainer`` will automatically set the needed value for the 3
|
||||
config parameters that contain that variable (using ``model.config.hidden_size``). Just set these values to ``0`` as
|
||||
shown below and the right configuration will be passed to DeepSpeed:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"zero_optimization": {
|
||||
"stage": 3,
|
||||
"cpu_offload": true,
|
||||
"cpu_offload_params": true,
|
||||
"cpu_offload_use_pin_memory" : true,
|
||||
"overlap_comm": true,
|
||||
"contiguous_gradients": true,
|
||||
"sub_group_size": 1e14,
|
||||
"reduce_bucket_size": 0,
|
||||
"stage3_prefetch_bucket_size": 0,
|
||||
"stage3_param_persistence_threshold": 0,
|
||||
"stage3_max_live_parameters": 1e9,
|
||||
"stage3_max_reuse_distance": 1e9,
|
||||
"stage3_gather_fp16_weights_on_model_save": true
|
||||
}
|
||||
}
|
||||
|
||||
``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large
|
||||
models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
|
||||
you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
|
||||
flexible.
|
||||
|
||||
|
||||
ZeRO-2 vs ZeRO-3 Performance
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather
|
||||
model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs
|
||||
then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity
|
||||
at a cost of speed.
|
||||
|
||||
It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2:
|
||||
|
||||
- set ``stage3_param_persistence_threshold`` to a very large number - larger than the largest parameter, e.g., ``6 *
|
||||
hidden_size * hidden_size``. This will keep the parameters on the GPUs.
|
||||
- turn off ``cpu_offload_params`` since ZeRO-2 doesn't have that option.
|
||||
|
||||
The performance will likely improve significantly with just ``cpu_offload_params`` turned off, even if you don't change
|
||||
``stage3_param_persistence_threshold``. Of course, these changes will impact the size of the model you can train. So
|
||||
these help you to trade scalability for speed depending on your needs.
|
||||
|
||||
|
||||
|
||||
ZeRO-2 Example
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Here is a full ZeRO-2 all-enabled configuration file ``ds_config_zero2.json``:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"fp16": {
|
||||
"enabled": true,
|
||||
"loss_scale": 0,
|
||||
"loss_scale_window": 1000,
|
||||
"initial_scale_power": 16,
|
||||
"hysteresis": 2,
|
||||
"min_loss_scale": 1
|
||||
},
|
||||
|
||||
"zero_optimization": {
|
||||
"stage": 2,
|
||||
"allgather_partitions": true,
|
||||
"allgather_bucket_size": 2e8,
|
||||
"overlap_comm": true,
|
||||
"reduce_scatter": true,
|
||||
"reduce_bucket_size": 2e8,
|
||||
"contiguous_gradients": true,
|
||||
"cpu_offload": true
|
||||
},
|
||||
|
||||
"optimizer": {
|
||||
"type": "AdamW",
|
||||
"params": {
|
||||
"lr": 3e-5,
|
||||
"betas": [0.8, 0.999],
|
||||
"eps": 1e-8,
|
||||
"weight_decay": 3e-7
|
||||
}
|
||||
},
|
||||
|
||||
"scheduler": {
|
||||
"type": "WarmupLR",
|
||||
"params": {
|
||||
"warmup_min_lr": 0,
|
||||
"warmup_max_lr": 3e-5,
|
||||
"warmup_num_steps": 500
|
||||
}
|
||||
},
|
||||
|
||||
"steps_per_print": 2000,
|
||||
"wall_clock_breakdown": false
|
||||
}
|
||||
|
||||
|
||||
|
||||
ZeRO-3 Example
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Here is a full ZeRO-3 all-enabled configuration file ``ds_config_zero3.json``:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"fp16": {
|
||||
"enabled": true,
|
||||
"loss_scale": 0,
|
||||
"loss_scale_window": 1000,
|
||||
"initial_scale_power": 16,
|
||||
"hysteresis": 2,
|
||||
"min_loss_scale": 1
|
||||
},
|
||||
|
||||
"zero_optimization": {
|
||||
"stage": 3,
|
||||
"cpu_offload": true,
|
||||
"cpu_offload_params": true,
|
||||
"cpu_offload_use_pin_memory" : true,
|
||||
"overlap_comm": true,
|
||||
"contiguous_gradients": true,
|
||||
"sub_group_size": 1e14,
|
||||
"reduce_bucket_size": 1e6,
|
||||
"stage3_prefetch_bucket_size": 0.94e6,
|
||||
"stage3_param_persistence_threshold": 1e4,
|
||||
"stage3_max_live_parameters": 1e9,
|
||||
"stage3_max_reuse_distance": 1e9,
|
||||
"stage3_gather_fp16_weights_on_model_save": true
|
||||
},
|
||||
|
||||
"optimizer": {
|
||||
"type": "AdamW",
|
||||
"params": {
|
||||
"lr": 3e-5,
|
||||
"betas": [0.8, 0.999],
|
||||
"eps": 1e-8,
|
||||
"weight_decay": 3e-7
|
||||
}
|
||||
},
|
||||
|
||||
"scheduler": {
|
||||
"type": "WarmupLR",
|
||||
"params": {
|
||||
"warmup_min_lr": 0,
|
||||
"warmup_max_lr": 3e-5,
|
||||
"warmup_num_steps": 500
|
||||
}
|
||||
},
|
||||
|
||||
"steps_per_print": 2000,
|
||||
"wall_clock_breakdown": false
|
||||
}
|
||||
|
||||
|
||||
Optimizer and Scheduler
|
||||
@@ -772,7 +1126,7 @@ If ``cpu_offload`` is enabled you must use both DeepSpeed scheduler and DeepSpee
|
||||
|
||||
|
||||
Optimizer
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
|
||||
DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
|
||||
@@ -818,7 +1172,7 @@ make sure to adjust the values. e.g. if use Adam you will want ``weight_decay``
|
||||
|
||||
|
||||
Scheduler
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
|
||||
<https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
|
||||
@@ -886,11 +1240,7 @@ and ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be corr
|
||||
Automatic Mixed Precision
|
||||
=======================================================================================================================
|
||||
|
||||
You can work with FP16 in one of the following ways:
|
||||
|
||||
1. Pytorch native amp, as documented `here <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
|
||||
2. NVIDIA's apex, as documented `here
|
||||
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
|
||||
You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
|
||||
|
||||
If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
|
||||
configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.
|
||||
@@ -909,6 +1259,8 @@ Here is an example of the ``fp16`` configuration:
|
||||
},
|
||||
}
|
||||
|
||||
Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
|
||||
|
||||
If you want to use NVIDIA's apex instead, you can can either configure the ``amp`` entry in the configuration file, or
|
||||
use the following command line arguments: ``--fp16 --fp16_backend apex --fp16_opt_level 01``.
|
||||
|
||||
@@ -923,6 +1275,9 @@ Here is an example of the ``amp`` configuration:
|
||||
}
|
||||
}
|
||||
|
||||
Here is the `documentation
|
||||
<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
|
||||
|
||||
|
||||
Gradient Accumulation
|
||||
=======================================================================================================================
|
||||
@@ -935,12 +1290,12 @@ While normally DeepSpeed gets gradient accumulation configured with:
|
||||
"gradient_accumulation_steps": 3,
|
||||
}
|
||||
|
||||
in this case, to enable gradient accumulation, pass the command line `--gradient_accumulation_steps` argument as normal
|
||||
and it will get injected into the DeepSpeed configuration.
|
||||
in this case, to enable gradient accumulation, pass the command line ``--gradient_accumulation_steps 3`` argument as
|
||||
normal and it will get injected into the DeepSpeed configuration.
|
||||
|
||||
If you try to add it directly to the configuration file, you will receive an error from the Trainer - this is because
|
||||
this setting is needed by the Trainer too, and so this approach ensures that there is a single way of setting this
|
||||
value and thus avoid potential subtle errors.
|
||||
If you try to add it directly to the configuration file, you will receive an error from the ``Trainer`` - this is
|
||||
because this setting is needed by the ``Trainer`` too, and so this approach ensures that there is a single way of
|
||||
setting this value and thus avoid potential subtle errors.
|
||||
|
||||
|
||||
|
||||
@@ -963,6 +1318,175 @@ Here is an example of the ``gradient_clipping`` configuration:
|
||||
|
||||
|
||||
|
||||
Getting the model weights out
|
||||
=======================================================================================================================
|
||||
|
||||
As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
|
||||
fp32 master weights in its custom checkpoint optimizer files, which are ``global_step*/*optim_states.pt`` (this is glob
|
||||
pattern), and are saved under the normal checkpoint.
|
||||
|
||||
**FP16 Weights:**
|
||||
|
||||
When a model is saved under ZeRO-2, you end up having the normal ``pytorch_model.bin`` file with the model weights, but
|
||||
they are only the fp16 version of the weights.
|
||||
|
||||
Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs,
|
||||
therefore ``"stage3_gather_fp16_weights_on_model_save": true`` is required to get the ``Trainer`` to save the fp16
|
||||
version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't be created. This is because by default
|
||||
DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it
|
||||
won't be possible to load it back.
|
||||
|
||||
**FP32 Weights:**
|
||||
|
||||
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
|
||||
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
|
||||
weights. This cannot be done during training since this is a process that requires a lot of memory, and therefore this
|
||||
is performed offline.
|
||||
|
||||
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
|
||||
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
|
||||
have the configuration file or a ``Trainer`` to do the extraction.
|
||||
|
||||
Let's say your checkpoint folder looks like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ls -l output_dir/checkpoint-1/
|
||||
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
|
||||
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
|
||||
-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest
|
||||
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
|
||||
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
|
||||
-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt
|
||||
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
|
||||
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
|
||||
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
|
||||
-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json
|
||||
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
|
||||
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
|
||||
|
||||
In this example there is just one DeepSpeed checkpoint sub-folder `global_step1`. Therefore to reconstruct the fp32
|
||||
weights just run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python zero_to_fp32.py global_step1 pytorch_model.bin
|
||||
|
||||
The script will automatically handle either ZeRO-2 or ZeRO-3 checkpoint.
|
||||
|
||||
``python zero_to_fp32.py -h`` will give you usage details.
|
||||
|
||||
If you have multiple DeepSpeed checkpoint sub-folders, pick the one you know to have the desired weights.
|
||||
|
||||
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
|
||||
|
||||
Note: currently the script requires 2x general RAM of the final fp32 model weights.
|
||||
|
||||
ZeRO 3 Nuances
|
||||
=======================================================================================================================
|
||||
|
||||
ZeRO 3 is quite different from ZeRO 2 because of its param sharding feature.
|
||||
|
||||
While all the efforts were made for things to just work without needing any special changes to your models, in certain
|
||||
circumstances you may find the following information to be needed.
|
||||
|
||||
|
||||
Registering External Parameters
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
If layer A needs to access weights belonging to layer B, currently layer A needs to tell DeepSpeed about it. This is
|
||||
done with the help of ``deepspeed.zero.register_external_parameter`` that needs to be called in ``A.__init__`` and can
|
||||
be seen in the following example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class ModuleZ3(torch.nn.Module):
|
||||
def __init__(self, *args):
|
||||
super().__init__(self, *args)
|
||||
self.layer1 = SomeLayer()
|
||||
self.layer2 = OtherLayer()
|
||||
deepspeed.zero.register_external_parameter(self, self.layer1.weight)
|
||||
|
||||
def forward(self, input):
|
||||
x = self.layer1(input)
|
||||
# self.layer1.weight is needed in ModuleZ3.forward
|
||||
y = self.layer2(x, self.layer1.weight)
|
||||
return y
|
||||
|
||||
In general ``transformers`` models don't use this style of referring to other layer's weights so most likely you won't
|
||||
need to use it.
|
||||
|
||||
For full details on this method please refer to `Registering External Parameters
|
||||
<https://deepspeed.readthedocs.io/en/latest/zero3.html#registering-external-parameters>`__.
|
||||
|
||||
|
||||
|
||||
Constructing Massive Models
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases,
|
||||
but also if you want the initialization to happen much faster, initialize the model using `deepspeed.zero.Init()`
|
||||
context manager (which is also a function decorator), like so:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from transformers import T5ForConditionalGeneration, T5Config
|
||||
import deepspeed
|
||||
with deepspeed.zero.Init():
|
||||
config = T5Config.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration(config)
|
||||
|
||||
As you can see this gives you a randomly initialized model.
|
||||
|
||||
If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as
|
||||
``is_deepspeed_zero3_enabled()`` returns ``True``, which can be set manually via ``deepspeed_zero3_enable(True)``.
|
||||
Therefore to enable this feature here is the required sequence:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from transformers.integrations import deepspeed_zero3_enable
|
||||
deepspeed_zero3_enable(True)
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
If you're using ``Trainer`` command line arguments which include ``--deepspeed ds_config.json`` with ZeRO-3 config
|
||||
enabled, then you can skip ``deepspeed_zero3_enable(True)`` as it will try to discover whether it'll be run under
|
||||
ZeRO-3 and ``from_pretrained`` will automatically activate this feature.
|
||||
|
||||
Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
|
||||
|
||||
For full details on this method and other related features please refer to `Constructing Massive Models
|
||||
<https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models>`__.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Gathering Parameters
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently
|
||||
executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it.
|
||||
Most likely you won't need it, but if you do please refer to `Gathering Parameters
|
||||
<https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination>`__
|
||||
|
||||
We do however use it internally in several places, one such example is when loading pretrained model weights in
|
||||
``from_pretrained``. We load one layer at a time and immediately partition it to all participating GPUs, as for very
|
||||
large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory
|
||||
limitations.
|
||||
|
||||
Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
|
||||
|
||||
stress on ``tensor([1.])``, or if you get an error where it says the parameter is of size ``1``, instead of some much
|
||||
larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Notes
|
||||
=======================================================================================================================
|
||||
|
||||
|
||||
Reference in New Issue
Block a user