[Deepspeed] new docs (#12077)

* document sub_group_size * style * install + issues reporting * style * style * Update docs/source/main_classes/deepspeed.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * indent 4 * restore * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-23 11:07:37 -07:00
parent 3694484d0a
commit 07ae6103c3
1 changed files with 76 additions and 7 deletions
--- a/docs/source/main_classes/deepspeed.rst
+++ b/docs/source/main_classes/deepspeed.rst
@@ -73,8 +73,6 @@ or via ``transformers``' ``extras``:
    pip install transformers[deepspeed]
 (will become available starting from ``transformers==4.6.0``)
 or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__ and
 `advanced install <https://www.deepspeed.ai/tutorials/advanced-install/>`__.
@@ -90,20 +88,31 @@ To make a local build for DeepSpeed:
    git clone https://github.com/microsoft/DeepSpeed/
    cd DeepSpeed
    rm -rf build
-    TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 pip install . \
+    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
    --global-option="build_ext" --global-option="-j8" --no-cache -v \
    --disable-pip-version-check 2>&1 | tee build.log
-Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use.
+If you intend to use NVMe offload you will need to also include ``DS_BUILD_AIO=1`` in the instructions above (and also
 install `libaio-dev` system-wide).
-Or if you need to use the same setup on multiple machines, make a binary wheel:
+Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
 your cards are the same you can get the arch via:
 .. code-block:: bash
    CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
 So if you get ``8, 6``, then use ``TORCH_CUDA_ARCH_LIST="8.6"``. If you have multiple different cards, you can list all
 of them like so ``TORCH_CUDA_ARCH_LIST="6.1;8.6"``
 If you need to use the same setup on multiple machines, make a binary wheel:
 .. code-block:: bash
    git clone https://github.com/microsoft/DeepSpeed/
    cd DeepSpeed
    rm -rf build
-    TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 \
+    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
    python setup.py build_ext -j8 bdist_wheel
 it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install
@@ -692,7 +701,17 @@ be ignored.
 - ``sub_group_size``: ``1e9``
-This one does impact GPU memory usage. But no docs at the moment on Deepspeed side to explain the tuning.
+``sub_group_size`` controls the granularity in which parameters are updated during optimizer steps. Parameters are
 grouped into buckets of ``sub_group_size`` and each buckets is updated one at a time. When used with NVMe offload in
 ZeRO-Infinity, ``sub_group_size`` therefore controls the granularity in which model states are moved in and out of CPU
 memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
 You can leave ``sub_group_size`` to its default value of `1e9` when not using NVMe offload. You may want to change its
 default value in the following cases:
 1. Running into OOM during optimizer step: Reduce ``sub_group_size`` to reduce memory utilization of temporary buffers
 2. Optimizer Step is taking a long time: Increase ``sub_group_size`` to improve bandwidth utilization as a result of
   the increased data buffers.
 .. _deepspeed-nvme:
@@ -1555,6 +1574,56 @@ stress on ``tensor([1.])``, or if you get an error where it says the parameter i
 larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
 Filing Issues
 =======================================================================================================================
 Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work.
 In your report please always include:
 1. the full Deepspeed config file in the report
 2. either the command line arguments if you were using the :class:`~transformers.Trainer` or
   :class:`~transformers.TrainingArguments` arguments if you were scripting the Trainer setup yourself. Please do not
   dump the :class:`~transformers.TrainingArguments` as it has dozens of entries that are irrelevant.
 3. Output of:
 .. code-block:: bash
    python -c 'import torch; print(f"torch: {torch.__version__}")'
    python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
    python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
 4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this
   `notebook <https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb>`__ as
   a starting point.
 5. Unless it's impossible please always use a standard dataset that we can use and not something custom.
 6. If possible try to use one of the existing `examples
   <https://github.com/huggingface/transformers/tree/master/examples/pytorch>`__ to reproduce the problem with.
 Things to consider:
 * Deepspeed is often not the cause of the problem.
    Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the
    problem was still there.
    Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an
    exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it.
    And only if the problem persists then do mentioned Deepspeed and supply all the required details.
 * If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
  directly with `Deepspeed <https://github.com/microsoft/DeepSpeed/>`__. If you aren't sure, please do not worry,
  either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if
  need be.
 Troubleshooting
 =======================================================================================================================