[DeepSpeed] fp32 support (#11499)

* prep for deepspeed==0.3.16 * new version * too soon * support and test fp32 mode * troubleshooting doc start * workaround no longer needed * add fp32 doc * style * cleanup, add tf32 note * clarify * release was made
2021-04-30 12:51:48 -07:00
parent 282f3ac3ef
commit 4e7bf94e72
6 changed files with 139 additions and 70 deletions
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -1507,6 +1507,35 @@ and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_



+
+fp32 Precision
+=======================================================================================================================
+
+Deepspeed supports the full fp32 and the fp16 mixed precision.
+
+Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you
+will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this
+happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained
+models). Such models may overflow or underflow leading to ``NaN`` loss. If this is your case then you will want to use
+the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with:
+
+.. code-block:: json
+
+    {
+        "fp16": {
+            "enabled": "false",
+        }
+    }
+
+If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using
+the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and
+benchmarks, please, see `TensorFloat-32(TF32) on Ampere devices
+<https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices>`__. The document includes
+instructions on how to disable this automatic conversion if for some reason you prefer not to use it.
+
+
+
+
 Automatic Mixed Precision
 =======================================================================================================================

@@ -1532,11 +1561,6 @@ and the :class:`~transformers.Trainer` will automatically enable or disable it b

 This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed.

-.. note::
-
-   At the moment DeepSpeed doesn't supported fp32 mode, though it will become available soon. Until then it will be
-   always set to ``true``.
-
 You can also enable/disable this mode explicitly:

 .. code-block:: json
@@ -1790,6 +1814,24 @@ stress on ``tensor([1.])``, or if you get an error where it says the parameter i
 larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.


+Troubleshooting
+=======================================================================================================================
+
+* ``deepspeed`` process gets killed at startup without a traceback
+
+If the ``deepspeed`` process gets killed at launch time without a traceback, that usually means that the program tried
+to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that
+process. This is because your configuration file most likely has either ``offload_optimizer`` or ``offload_param`` or
+both configured to offload to ``cpu`` (or under ZeRO-2 ``cpu_offload`` is enabled). If you have NVMe, experiment with
+offloading to NVMe if you're running under ZeRO-3.
+
+Work is being done to enable estimating how much memory is needed for a specific model: `PR
+<https://github.com/microsoft/DeepSpeed/pull/965>`__.
+
+
+
+
+

 Notes
 =======================================================================================================================