From 185876392c0dcd4c4bb02f2750822144a3bee545 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Tue, 21 Dec 2021 09:55:25 -0800 Subject: [PATCH] [doc porting] several docs (#14858) * [doc porting] 2 docs * [doc porting] 2 docs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/main_classes/deepspeed.mdx * cleanup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/debugging.mdx | 299 ++++ docs/source/debugging.rst | 299 ---- docs/source/main_classes/deepspeed.mdx | 1758 +++++++++++++++++++++++ docs/source/main_classes/deepspeed.rst | 1833 ------------------------ docs/source/testing.mdx | 1189 +++++++++++++++ docs/source/testing.rst | 1252 ---------------- 6 files changed, 3246 insertions(+), 3384 deletions(-) create mode 100644 docs/source/debugging.mdx delete mode 100644 docs/source/debugging.rst create mode 100644 docs/source/main_classes/deepspeed.mdx delete mode 100644 docs/source/main_classes/deepspeed.rst create mode 100644 docs/source/testing.mdx delete mode 100644 docs/source/testing.rst diff --git a/docs/source/debugging.mdx b/docs/source/debugging.mdx new file mode 100644 index 0000000000..a3f05df48e --- /dev/null +++ b/docs/source/debugging.mdx @@ -0,0 +1,299 @@ + + +# Debugging + +## Underflow and Overflow Detection + + + +This feature is currently available for PyTorch-only. + + + + + +For multi-GPU training it requires DDP (`torch.distributed.launch`). + + + + + +This feature can be used with any `nn.Module`-based model. + + + +If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in +activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily +you can accomplish that easily by activating a special module that will do the detection automatically. + +If you're using [`Trainer`], you just need to add: + +```bash +--debug underflow_overflow +``` + +to the normal command line arguments, or pass `debug="underflow_overflow"` when creating the +[`TrainingArguments`] object. + +If you're using your own training loop or another Trainer you can accomplish the same with: + +```python +from .debug_utils import DebugUnderflowOverflow +debug_overflow = DebugUnderflowOverflow(model) +``` + +[`~debug_utils.DebugUnderflowOverflow`] inserts hooks into the model that immediately after each +forward call will test input and output variables and also the corresponding module's weights. As soon as `inf` or +`nan` is detected in at least one element of the activations or weights, the program will assert and print a report +like this (this was caught with `google/mt5-small` under fp16 mixed precision): + +``` +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata + encoder.block.1.layer.1.DenseReluDense.dropout Dropout +0.00e+00 2.57e+02 input[0] +0.00e+00 2.85e+02 output +[...] + encoder.block.2.layer.0 T5LayerSelfAttention +6.78e-04 3.15e+03 input[0] +2.65e-04 3.42e+03 output[0] + None output[1] +2.25e-01 1.00e+04 output[2] + encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.dropout Dropout +0.00e+00 8.76e+03 input[0] +0.00e+00 9.74e+03 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +``` + +The example output has been trimmed in the middle for brevity. + +The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames, +the inputs and outputs were in the range of `1e4`. So when this training was done under fp16 mixed precision the very +last step overflowed (since under `fp16` the largest number before `inf` is `64e3`). To avoid overflows under +`fp16` the activations must remain way below `1e4`, because `1e4 * 1e4 = 1e8` so any matrix multiplication with +large activations is going to lead to a numerical overflow condition. + +At the very start of the trace you can discover at which batch number the problem occurred (here `Detected inf/nan during batch_number=0` means the problem occurred on the first batch). + +Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting +for. If we look just at this frame: + +``` + encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output +``` + +Here, `encoder.block.2.layer.1.layer_norm` indicates that it was a layer norm for the first layer, of the second +block of the encoder. And the specific calls of the `forward` is `T5LayerNorm`. + +Let's look at the last few frames of that report: + +``` +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata +[...] + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +``` + +The last frame reports for `Dropout.forward` function with the first entry for the only input and the second for the +only output. You can see that it was called from an attribute `dropout` inside `DenseReluDense` class. We can see +that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest +input elements was `6.27e+04` and same for the output was `inf`. + +You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was +around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which renormalizes +the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an +overflow (`inf`). + +As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 +numbers. + +Let's match the report to the code from `models/t5/modeling_t5.py`: + +```python +class T5DenseGatedGeluDense(nn.Module): + def __init__(self, config): + super().__init__() + self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) + self.dropout = nn.Dropout(config.dropout_rate) + self.gelu_act = ACT2FN["gelu_new"] + + def forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states +``` + +Now it's easy to see the `dropout` call, and all the previous calls as well. + +Since the detection is happening in a forward hook, these reports are printed immediately after each `forward` +returns. + +Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers +started to go up and most likely switch to the `fp32` mode here, so that the numbers don't overflow when multiplied +or summed up. Of course, there might be other solutions. For example, we could turn off `amp` temporarily if it's +enabled, after moving the original `forward` into a helper wrapper, like so: + +```python +def _forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states + +import torch +def forward(self, hidden_states): + if torch.is_autocast_enabled(): + with torch.cuda.amp.autocast(enabled=False): + return self._forward(hidden_states) + else: + return self._forward(hidden_states) +``` + +Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may +want to analyse the intermediary stages of any specific `forward` function as well. In such a case you can use the +`detect_overflow` helper function to inject the detector where you want it, for example: + +```python +from debug_utils import detect_overflow + +class T5LayerFF(nn.Module): + [...] + def forward(self, hidden_states): + forwarded_states = self.layer_norm(hidden_states) + detect_overflow(forwarded_states, "after layer_norm") + forwarded_states = self.DenseReluDense(forwarded_states) + detect_overflow(forwarded_states, "after DenseReluDense") + return hidden_states + self.dropout(forwarded_states) +``` + +You can see that we added 2 of these and now we track if `inf` or `nan` for `forwarded_states` was detected +somewhere in between. + +Actually, the detector already reports these because each of the calls in the example above is a `nn.Module`, but +let's say if you had some local direct calculations this is how you'd do that. + +Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from +its default, e.g.: + +```python +from .debug_utils import DebugUnderflowOverflow +debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100) +``` + +### Specific batch absolute mix and max value tracing + +The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off. + +Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a given +batch, and only do that for batches 1 and 3. Then you instantiate this class as: + +```python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3]) +``` + +And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does. + +Batches are 0-indexed. + +This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward +right to that area. Here is a sample truncated output for such configuration: + +``` + *** Starting batch number=1 *** +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.47e+04 input[0] +5.36e-05 7.92e+02 output +[...] + decoder.dropout Dropout +1.60e-07 2.27e+01 input[0] +0.00e+00 2.52e+01 output + decoder T5Stack + not a tensor output + lm_head Linear +1.01e-06 7.92e+02 weight +0.00e+00 1.11e+00 input[0] +6.06e-02 8.39e+01 output + T5ForConditionalGeneration + not a tensor output + + *** Starting batch number=3 *** +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.78e+04 input[0] +5.36e-05 7.92e+02 output +[...] +``` + +Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may +not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if +a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where +numbers started to diverge. + +You can also specify the batch number after which to stop the training, with: + +```python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3) +``` diff --git a/docs/source/debugging.rst b/docs/source/debugging.rst deleted file mode 100644 index 235e32b77f..0000000000 --- a/docs/source/debugging.rst +++ /dev/null @@ -1,299 +0,0 @@ -.. - Copyright 2021 The HuggingFace Team. All rights reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on - an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the - specific language governing permissions and limitations under the License. - - - -Debugging -======================================================================================================================= - -Underflow and Overflow Detection ------------------------------------------------------------------------------------------------------------------------ - -.. note:: - - This feature is currently available for PyTorch-only. - -.. note:: - - For multi-GPU training it requires DDP (``torch.distributed.launch``). - -.. note:: - - This feature can be used with any ``nn.Module``-based model. - -If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in -activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily -you can accomplish that easily by activating a special module that will do the detection automatically. - -If you're using :class:`~transformers.Trainer`, you just need to add: - -.. code-block:: bash - - --debug underflow_overflow - -to the normal command line arguments, or pass ``debug="underflow_overflow"`` when creating the -:class:`~transformers.TrainingArguments` object. - -If you're using your own training loop or another Trainer you can accomplish the same with: - -.. code-block:: python - - from .debug_utils import DebugUnderflowOverflow - debug_overflow = DebugUnderflowOverflow(model) - -:class:`~transformers.debug_utils.DebugUnderflowOverflow` inserts hooks into the model that immediately after each -forward call will test input and output variables and also the corresponding module's weights. As soon as ``inf`` or -``nan`` is detected in at least one element of the activations or weights, the program will assert and print a report -like this (this was caught with ``google/mt5-small`` under fp16 mixed precision): - -.. code-block:: - - Detected inf/nan during batch_number=0 - Last 21 forward frames: - abs min abs max metadata - encoder.block.1.layer.1.DenseReluDense.dropout Dropout - 0.00e+00 2.57e+02 input[0] - 0.00e+00 2.85e+02 output - [...] - encoder.block.2.layer.0 T5LayerSelfAttention - 6.78e-04 3.15e+03 input[0] - 2.65e-04 3.42e+03 output[0] - None output[1] - 2.25e-01 1.00e+04 output[2] - encoder.block.2.layer.1.layer_norm T5LayerNorm - 8.69e-02 4.18e-01 weight - 2.65e-04 3.42e+03 input[0] - 1.79e-06 4.65e+00 output - encoder.block.2.layer.1.DenseReluDense.wi_0 Linear - 2.17e-07 4.50e+00 weight - 1.79e-06 4.65e+00 input[0] - 2.68e-06 3.70e+01 output - encoder.block.2.layer.1.DenseReluDense.wi_1 Linear - 8.08e-07 2.66e+01 weight - 1.79e-06 4.65e+00 input[0] - 1.27e-04 2.37e+02 output - encoder.block.2.layer.1.DenseReluDense.dropout Dropout - 0.00e+00 8.76e+03 input[0] - 0.00e+00 9.74e+03 output - encoder.block.2.layer.1.DenseReluDense.wo Linear - 1.01e-06 6.44e+00 weight - 0.00e+00 9.74e+03 input[0] - 3.18e-04 6.27e+04 output - encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense - 1.79e-06 4.65e+00 input[0] - 3.18e-04 6.27e+04 output - encoder.block.2.layer.1.dropout Dropout - 3.18e-04 6.27e+04 input[0] - 0.00e+00 inf output - -The example output has been trimmed in the middle for brevity. - -The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames, -the inputs and outputs were in the range of ``1e4``. So when this training was done under fp16 mixed precision the very -last step overflowed (since under ``fp16`` the largest number before ``inf`` is ``64e3``). To avoid overflows under -``fp16`` the activations must remain way below ``1e4``, because ``1e4 * 1e4 = 1e8`` so any matrix multiplication with -large activations is going to lead to a numerical overflow condition. - -At the very start of the trace you can discover at which batch number the problem occurred (here ``Detected inf/nan -during batch_number=0`` means the problem occurred on the first batch). - -Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting -for. If we look just at this frame: - -.. code-block:: - - encoder.block.2.layer.1.layer_norm T5LayerNorm - 8.69e-02 4.18e-01 weight - 2.65e-04 3.42e+03 input[0] - 1.79e-06 4.65e+00 output - -Here, ``encoder.block.2.layer.1.layer_norm`` indicates that it was a layer norm for the first layer, of the second -block of the encoder. And the specific calls of the ``forward`` is ``T5LayerNorm``. - -Let's look at the last few frames of that report: - -.. code-block:: - - Detected inf/nan during batch_number=0 - Last 21 forward frames: - abs min abs max metadata - [...] - encoder.block.2.layer.1.DenseReluDense.wi_0 Linear - 2.17e-07 4.50e+00 weight - 1.79e-06 4.65e+00 input[0] - 2.68e-06 3.70e+01 output - encoder.block.2.layer.1.DenseReluDense.wi_1 Linear - 8.08e-07 2.66e+01 weight - 1.79e-06 4.65e+00 input[0] - 1.27e-04 2.37e+02 output - encoder.block.2.layer.1.DenseReluDense.wo Linear - 1.01e-06 6.44e+00 weight - 0.00e+00 9.74e+03 input[0] - 3.18e-04 6.27e+04 output - encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense - 1.79e-06 4.65e+00 input[0] - 3.18e-04 6.27e+04 output - encoder.block.2.layer.1.dropout Dropout - 3.18e-04 6.27e+04 input[0] - 0.00e+00 inf output - -The last frame reports for ``Dropout.forward`` function with the first entry for the only input and the second for the -only output. You can see that it was called from an attribute ``dropout`` inside ``DenseReluDense`` class. We can see -that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest -input elements was ``6.27e+04`` and same for the output was ``inf``. - -You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was -around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes -the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an -overflow (``inf``). - -As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 -numbers. - -Let's match the report to the code from ``models/t5/modeling_t5.py``: - -.. code-block:: python - - class T5DenseGatedGeluDense(nn.Module): - def __init__(self, config): - super().__init__() - self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) - self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) - self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) - self.dropout = nn.Dropout(config.dropout_rate) - self.gelu_act = ACT2FN["gelu_new"] - - def forward(self, hidden_states): - hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) - hidden_linear = self.wi_1(hidden_states) - hidden_states = hidden_gelu * hidden_linear - hidden_states = self.dropout(hidden_states) - hidden_states = self.wo(hidden_states) - return hidden_states - -Now it's easy to see the ``dropout`` call, and all the previous calls as well. - -Since the detection is happening in a forward hook, these reports are printed immediately after each ``forward`` -returns. - -Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers -started to go up and most likely switch to the ``fp32`` mode here, so that the numbers don't overflow when multiplied -or summed up. Of course, there might be other solutions. For example, we could turn off ``amp`` temporarily if it's -enabled, after moving the original ``forward`` into a helper wrapper, like so: - -.. code-block:: python - - def _forward(self, hidden_states): - hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) - hidden_linear = self.wi_1(hidden_states) - hidden_states = hidden_gelu * hidden_linear - hidden_states = self.dropout(hidden_states) - hidden_states = self.wo(hidden_states) - return hidden_states - - import torch - def forward(self, hidden_states): - if torch.is_autocast_enabled(): - with torch.cuda.amp.autocast(enabled=False): - return self._forward(hidden_states) - else: - return self._forward(hidden_states) - -Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may -want to analyse the intermediary stages of any specific ``forward`` function as well. In such a case you can use the -``detect_overflow`` helper function to inject the detector where you want it, for example: - -.. code-block:: python - - from debug_utils import detect_overflow - - class T5LayerFF(nn.Module): - [...] - def forward(self, hidden_states): - forwarded_states = self.layer_norm(hidden_states) - detect_overflow(forwarded_states, "after layer_norm") - forwarded_states = self.DenseReluDense(forwarded_states) - detect_overflow(forwarded_states, "after DenseReluDense") - return hidden_states + self.dropout(forwarded_states) - -You can see that we added 2 of these and now we track if ``inf`` or ``nan`` for ``forwarded_states`` was detected -somewhere in between. - -Actually, the detector already reports these because each of the calls in the example above is a `nn.Module``, but -let's say if you had some local direct calculations this is how you'd do that. - -Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from -its default, e.g.: - -.. code-block:: python - - from .debug_utils import DebugUnderflowOverflow - debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100) - -Specific batch absolute mix and max value tracing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off. - -Let's say you want to watch the absolute min and max values for all the ingredients of each ``forward`` call of a given -batch, and only do that for batches 1 and 3. Then you instantiate this class as: - -.. code-block:: python - - debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3]) - -And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does. - -Batches are 0-indexed. - -This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward -right to that area. Here is a sample truncated output for such configuration: - -.. code-block:: - - *** Starting batch number=1 *** - abs min abs max metadata - shared Embedding - 1.01e-06 7.92e+02 weight - 0.00e+00 2.47e+04 input[0] - 5.36e-05 7.92e+02 output - [...] - decoder.dropout Dropout - 1.60e-07 2.27e+01 input[0] - 0.00e+00 2.52e+01 output - decoder T5Stack - not a tensor output - lm_head Linear - 1.01e-06 7.92e+02 weight - 0.00e+00 1.11e+00 input[0] - 6.06e-02 8.39e+01 output - T5ForConditionalGeneration - not a tensor output - - *** Starting batch number=3 *** - abs min abs max metadata - shared Embedding - 1.01e-06 7.92e+02 weight - 0.00e+00 2.78e+04 input[0] - 5.36e-05 7.92e+02 output - [...] - -Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may -not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if -a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where -numbers started to diverge. - -You can also specify the batch number after which to stop the training, with: - -.. code-block:: python - - debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3) diff --git a/docs/source/main_classes/deepspeed.mdx b/docs/source/main_classes/deepspeed.mdx new file mode 100644 index 0000000000..c68a15fbc6 --- /dev/null +++ b/docs/source/main_classes/deepspeed.mdx @@ -0,0 +1,1758 @@ + + +# DeepSpeed Integration + +[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for: + +1. Optimizer state partitioning (ZeRO stage 1) +2. Gradient partitioning (ZeRO stage 2) +3. Parameter partitioning (ZeRO stage 3) +4. Custom mixed precision training handling +5. A range of fast CUDA-extension-based optimizers +6. ZeRO-Offload to CPU and NVMe + +ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU +Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857). + +DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference. + +DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which +won't be possible on a single GPU. + +🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options: + +1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type + of integration - just supply your custom config file or use our template and you have nothing else to do. Most of + this document is focused on this feature. +2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed + yourself, core functionality functions like `from_pretrained` and `from_config` include integration of essential + parts of DeepSpeed like `zero.Init` for ZeRO stage 3 and higher. To tap into this feature read the docs on + [deepspeed-non-trainer-integration](#deepspeed-non-trainer-integration). + +What is integrated: + +Training: + +1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 with ZeRO-Infinity (CPU and NVME offload). + +Inference: + +1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but + it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see: + [deepspeed-zero-inference](#deepspeed-zero-inference). + +There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of +ZeRO (coming soon). + + + + + + +## Trainer Deepspeed Integration + + + + +### Installation + +Install the library via pypi: + +```bash +pip install deepspeed +``` + +or via `transformers`' `extras`: + +```bash +pip install transformers[deepspeed] +``` + +or find more details on [the DeepSpeed's GitHub page](https://github.com/microsoft/deepspeed#installation) and +[advanced install](https://www.deepspeed.ai/tutorials/advanced-install/). + +If you're still struggling with the build, first make sure to read [zero-install-notes](#zero-install-notes). + +If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions +to no avail, the next thing to try is to pre-build the modules before installing them. + +To make a local build for DeepSpeed: + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ +--global-option="build_ext" --global-option="-j8" --no-cache -v \ +--disable-pip-version-check 2>&1 | tee build.log +``` + +If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also +install *libaio-dev* system-wide). + +Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all +your cards are the same you can get the arch via: + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" +``` + +So if you get `8, 6`, then use `TORCH_CUDA_ARCH_LIST="8.6"`. If you have multiple different cards, you can list all +of them like so `TORCH_CUDA_ARCH_LIST="6.1;8.6"` + +If you need to use the same setup on multiple machines, make a binary wheel: + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ +python setup.py build_ext -j8 bdist_wheel +``` + +it will generate something like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` which now you can install +as `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` locally or on any other machine. + +Again, remember to ensure to adjust `TORCH_CUDA_ARCH_LIST` to the target architectures. + +You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this +context) [here](https://developer.nvidia.com/cuda-gpus). + +You can check the archs pytorch was built with using: + +```bash +python -c "import torch; print(torch.cuda.get_arch_list())" +``` + +Here is how to find out the arch for one of the installed GPU. For example, for GPU 0: + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ +print(torch.cuda.get_device_properties(torch.device('cuda')))" +``` + +If the output is: + +```bash +_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) +``` + +then you know that this card's arch is `8.6`. + +You can also leave `TORCH_CUDA_ARCH_LIST` out completely and then the build program will automatically query the +architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why +it's best to specify the desired archs explicitly. + +If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of +[Deepspeed](https://github.com/microsoft/DeepSpeed/issues), + + + + + +### Deployment with multiple GPUs + +To deploy this feature with multiple GPUs adjust the [`Trainer`] command line arguments as +following: + +1. replace `python -m torch.distributed.launch` with `deepspeed`. +2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as + documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you. + +Therefore, if your original command line looked as following: + +```bash +python -m torch.distributed.launch --nproc_per_node=2 your_program.py +``` + +Now it should be: + +```bash +deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json +``` + +Unlike, `torch.distributed.launch` where you have to specify how many GPUs to use with `--nproc_per_node`, with the +`deepspeed` launcher you don't have to use the corresponding `--num_gpus` if you want all of your GPUs used. The +full details on how to configure various nodes and GPUs can be found [here](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). + +In fact, you can continue using `-m torch.distributed.launch` with DeepSpeed as long as you don't need to use +`deepspeed` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use +the `deepspeed` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will +use it here as well. + +Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs: + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +Note that in the DeepSpeed documentation you are likely to see `--deepspeed --deepspeed_config ds_config.json` - i.e. +two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal +with, we combined the two into a single argument. + +For some practical usage examples, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400). + + + + + +### Deployment with one GPU + +To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following: + +```bash +deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero2.json \ +--model_name_or_path t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via +`--num_gpus=1`. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start +with, then you don't need this argument. The following [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) discusses the launcher options. + +Why would you want to use DeepSpeed with just one GPU? + +1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus + leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which + normally won't fit. +2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit + bigger models and data batches. + +While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU +with DeepSpeed is to have at least the following configuration in the configuration file: + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "overlap_comm": true, + "contiguous_gradients": true + } +} +``` + +which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will +find more details in the discussion below. + +For a practical usage example of this type of deployment, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685). + +You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document. + + + +Notes: + +- if you need to run on a specific GPU, which is different from GPU 0, you can't use `CUDA_VISIBLE_DEVICES` to limit + the visible scope of available GPUs. Instead, you have to use the following syntax: + + ```bash + deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... + ``` + + In this example, we tell DeepSpeed to use GPU 1 (second gpu). + + + + + +### Deployment in Notebooks + +The problem with running notebook cells as a script is that there is no normal `deepspeed` launcher to rely on, so +under certain setups we have to emulate it. + +If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed. + +```python +# DeepSpeed requires a distributed environment even when only one process is used. +# This emulates a launcher in the notebook +import os +os.environ['MASTER_ADDR'] = 'localhost' +os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use +os.environ['RANK'] = "0" +os.environ['LOCAL_RANK'] = "0" +os.environ['WORLD_SIZE'] = "1" + +# Now proceed as normal, plus pass the deepspeed config file +training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") +trainer = Trainer(...) +trainer.train() +``` + +Note: `...` stands for the normal arguments that you'd pass to the functions. + +If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have +to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented +at the beginning of this section. + +If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated +cell with: + +```python +%%bash +cat <<'EOT' > ds_config_zero3.json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_fp16_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +EOT +``` + +If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via +shell from a cell. For example, to use `run_translation.py` you would launch it with: + +```python +!git clone https://github.com/huggingface/transformers +!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... +``` + +or with `%%bash` magic, where you can write a multi-line code for the shell program to run: + +```python +%%bash + +git clone https://github.com/huggingface/transformers +cd transformers +deepspeed examples/pytorch/translation/run_translation.py ... +``` + +In such case you don't need any of the code presented at the beginning of this section. + +Note: While `%%bash` magic is neat, but currently it buffers the output so you won't see the logs until the process +completes. + + + + + + +### Configuration + +For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer +to the [following documentation](https://www.deepspeed.ai/docs/config-json/). + +You can find dozens of DeepSpeed configuration examples that address various practical needs in [the DeepSpeedExamples +repo](https://github.com/microsoft/DeepSpeedExamples): + +```bash +git clone https://github.com/microsoft/DeepSpeedExamples +cd DeepSpeedExamples +find . -name '*json' +``` + +Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the +example `.json` files with: + +```bash +grep -i Lamb $(find . -name '*json') +``` + +Some more examples are to be found in the [main repo](https://github.com/microsoft/DeepSpeed) as well. + +When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have +to be configured via the command line. You will find the nuances in the rest of this guide. + +To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, +including optimizer states cpu offload, uses `AdamW` optimizer and `WarmupLR` scheduler and will enable mixed +precision training if `--fp16` is passed: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", +} +``` + +When you execute the program, DeepSpeed will log the configuration it received from the [`Trainer`] +to the console, so you can see exactly what was the final configuration passed to it. + + + + + +### Passing Configuration + +As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're +not using the command line interface to configure the training, and instead instantiate the +[`Trainer`] via [`TrainingArguments`] then for the `deepspeed` argument you can +pass a nested `dict`. This allows you to create the configuration on the fly and doesn't require you to write it to +the file system before passing it to [`TrainingArguments`]. + +To summarize you can do: + +```python +TrainingArguments(..., deepspeed="/path/to/ds_config.json") +``` + +or: + +```python +ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params) +TrainingArguments(..., deepspeed=ds_config_dict) +``` + + + +### Shared Configuration + + + + +This section is a must-read + + + +Some configuration values are required by both the [`Trainer`] and DeepSpeed to function correctly, +therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those +via the [`Trainer`] command line arguments. + +Additionally, some configuration values are derived automatically based on the model's configuration, so instead of +remembering to manually adjust multiple values, it's the best to let the [`Trainer`] do the majority +of configuration for you. + +Therefore, in the rest of this guide you will find a special configuration value: `auto`, which when set will be +automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this +recommendation and set the values explicitly, in which case be very careful that your the +[`Trainer`] arguments and DeepSpeed configurations agree. For example, are you using the same +learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very +difficult to detect ways. You have been warned. + +There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit +your needs. + +In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master +and configure [`TrainingArguments`] based on that. The steps are: + +1. Create or load the DeepSpeed configuration to be used as a master configuration +2. Create the [`TrainingArguments`] object based on these values + +Do note that some values, such as `scheduler.params.total_num_steps` are calculated by +[`Trainer`] during `train`, but you can of course do the math yourself. + + + +### ZeRO + +[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It +support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes, +therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity. +You will find more indepth information in the DeepSpeed documentation. + +The `zero_optimization` section of the configuration file is the most important part ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)), since that is where you define +which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the +DeepSpeed docs. + +This section has to be configured exclusively via DeepSpeed configuration - the [`Trainer`] provides +no equivalent command line arguments. + +Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for +the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is +going to use. + + + + + +#### ZeRO-2 Config + +The following is an example configuration for ZeRO stage 2: + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 5e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 5e8, + "contiguous_gradients": true + } +} +``` + +**Performance tuning:** + +- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`) +- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x + the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB + footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting + OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do + the same on larger capacity GPU as well, if you're starting to hit OOM. +- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size, + the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is + important, getting a slightly slower training time could be a good trade. + + + + + +#### ZeRO-3 Config + +The following is an example configuration for ZeRO stage 3: + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_fp16_weights_on_model_save": true + } +} +``` + +If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU +memory offloading the optimizer states and parameters to CPU memory with `"device": "cpu"` may solve this limitation. +If you don't want to offload to CPU memory, use `none` instead of `cpu` for the `device` entry. Offloading to +NVMe is discussed further down. + +Pinned memory is enabled with `pin_memory` set to `true`. This feature can improve the throughput at the cost of +making less memory available to other processes. Pinned memory is set aside to the specific process that requested it +and its typically accessed much faster than normal CPU memory. + +**Performance tuning:** + +- `stage3_max_live_parameters`: `1e9` +- `stage3_max_reuse_distance`: `1e9` + +If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact +on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by +`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total. + +`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given +time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we +use the `stage3_max_reuse_distance` to decide whether to throw away the parameter or to keep it. If a parameter is +going to be used again in near future (less than `stage3_max_reuse_distance`) then we keep it to reduce communication +overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and +backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward + +The following configuration values depend on the model's hidden size: + +- `reduce_bucket_size`: `hidden_size*hidden_size` +- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size` +- `stage3_param_persistence_threshold`: `10 * hidden_size` + +therefore set these values to `auto` and the [`Trainer`] will automatically assign the recommended +values. But, of course, feel free to set these explicitly as well. + +`stage3_gather_fp16_weights_on_model_save` enables model fp16 weights consolidation when model gets saved. With large +models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if +you plan to resume the training. Watch out for future updates that will remove this limitation and make things more +flexible. + +If you're migrating from ZeRO-2 configuration note that `allgather_partitions`, `allgather_bucket_size` and +`reduce_scatter` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just +be ignored. + +- `sub_group_size`: `1e9` + +`sub_group_size` controls the granularity in which parameters are updated during optimizer steps. Parameters are +grouped into buckets of `sub_group_size` and each buckets is updated one at a time. When used with NVMe offload in +ZeRO-Infinity, `sub_group_size` therefore controls the granularity in which model states are moved in and out of CPU +memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. + +You can leave `sub_group_size` to its default value of *1e9* when not using NVMe offload. You may want to change its +default value in the following cases: + +1. Running into OOM during optimizer step: Reduce `sub_group_size` to reduce memory utilization of temporary buffers +2. Optimizer Step is taking a long time: Increase `sub_group_size` to improve bandwidth utilization as a result of + the increased data buffers. + + + + +### NVMe Support + +ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to +smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during +offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training +process. ZeRO-Infinity requires ZeRO-3 enabled. + +The following configuration example enables NVMe to offload both optimizer states and the params: + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 4, + "fast_init": false + }, + "offload_param": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 5, + "buffer_size": 1e8, + "max_in_cpu": 1e9 + } + "aio": { + "block_size": 262144, + "queue_depth": 32, + "thread_count": 1, + "single_submit": false, + "overlap_events": true + } + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_fp16_weights_on_model_save": true + }, +} +``` + +You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you +have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint: +*"device": "cpu"*). + +Here is the full documentation for offloading [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading). + +Make sure that your `nvme_path` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll +be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this +writing one can have ~3.5GB/s read, ~3GB/s write peak speeds). + +In order to figure out the optimal `aio` configuration block you must run a benchmark on your target setup, as +[explained here](https://github.com/microsoft/DeepSpeed/issues/998). + + + + + +#### ZeRO-2 vs ZeRO-3 Performance + +ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather +model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs +then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity +at a cost of speed. + +It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: + +- set `stage3_param_persistence_threshold` to a very large number - larger than the largest parameter, e.g., `6 * hidden_size * hidden_size`. This will keep the parameters on the GPUs. +- turn off `offload_params` since ZeRO-2 doesn't have that option. + +The performance will likely improve significantly with just `offload_params` turned off, even if you don't change +`stage3_param_persistence_threshold`. Of course, these changes will impact the size of the model you can train. So +these help you to trade scalability for speed depending on your needs. + + + + + +#### ZeRO-2 Example + +Here is a full ZeRO-2 auto-configuration file `ds_config_zero2.json`: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical +values look like, but we highly recommend using the one with multiple `auto` settings in it. + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + + + +#### ZeRO-3 Example + +Here is a full ZeRO-3 auto-configuration file `ds_config_zero3.json`: + + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_fp16_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical +values look like, but we highly recommend using the one with multiple `auto` settings in it. + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": 1e6, + "stage3_prefetch_bucket_size": 0.94e6, + "stage3_param_persistence_threshold": 1e4, + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_fp16_weights_on_model_save": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + +### Optimizer and Scheduler + +As long as you don't enable `offload_optimizer` you can mix and match DeepSpeed and HuggingFace schedulers and +optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer: + +| Combos | HF Scheduler | DS Scheduler | +| HF Optimizer | Yes | Yes | +| DS Optimizer | No | Yes | + +It is possible to use a non-DeepSpeed optimizer when `offload_optimizer` is enabled, as long as it has both CPU and +GPU implementation (except LAMB). + + + + + + +#### Optimizer + + +DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are +thus recommended to be used. It, however, can import other optimizers from `torch`. The full documentation is [here](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). + +If you don't configure the `optimizer` entry in the configuration file, the [`Trainer`] will +automatically set it to `AdamW` and will use the supplied values or the defaults for the following command line +arguments: `--learning_rate`, `--adam_beta1`, `--adam_beta2`, `--adam_epsilon` and `--weight_decay`. + +Here is an example of the auto-configured `optimizer` entry for `AdamW`: + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + } +} +``` + +Note that the command line arguments will set the values in the configuration file. This is so that there is one +definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to +different values in different places. Command line rules. The values that get overridden are: + +- `lr` with the value of `--learning_rate` +- `betas` with the value of `--adam_beta1 --adam_beta2` +- `eps` with the value of `--adam_epsilon` +- `weight_decay` with the value of `--weight_decay` + +Therefore please remember to tune the shared hyperparameters on the command line. + +You can also set the values explicitly: + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": 0.001, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + } +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + +If you want to use another optimizer which is not listed above, you will have to add to the top level configuration. + +```json +{ + "zero_allow_untested_optimizer": true +} +``` + +Similarly to `AdamW`, you can configure other officially supported optimizers. Just remember that may have different +config values. e.g. for Adam you will want `weight_decay` around `0.01`. + + + + + +#### Scheduler + +DeepSpeed supports `LRRangeTest`, `OneCycle`, `WarmupLR` and `WarmupDecayLR` learning rate schedulers. The full +documentation is [here](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters). + +Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: + +- `WarmupLR` via `--lr_scheduler_type constant_with_warmup` +- `WarmupDecayLR` via `--lr_scheduler_type linear`. This is also the default value for `--lr_scheduler_type`, + therefore, if you don't configure the scheduler this is scheduler that will get configured by default. + +If you don't configure the `scheduler` entry in the configuration file, the [`Trainer`] will use +the values of `--lr_scheduler_type`, `--learning_rate` and `--warmup_steps` or `--warmup_ratio` to configure a +🤗 Transformers version of it. + +Here is an example of the auto-configured `scheduler` entry for `WarmupLR`: + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +Since *"auto"* is used the [`Trainer`] arguments will set the correct values in the configuration +file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example, +the learning rate is set to different values in different places. Command line rules. The values that get set are: + +- `warmup_min_lr` with the value of `0`. +- `warmup_max_lr` with the value of `--learning_rate`. +- `warmup_num_steps` with the value of `--warmup_steps` if provided. Otherwise will use `--warmup_ratio` + multiplied by the number of training steps and rounded up. +- `total_num_steps` with either the value of `--max_steps` or if it is not provided, derived automatically at run + time based on the environment and the size of the dataset and other command line arguments (needed for + `WarmupDecayLR`). + +You can, of course, take over any or all of the configuration values and set those yourself: + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 0.001, + "warmup_num_steps": 1000 + } + } +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + +For example, for `WarmupDecayLR`, you can use the following entry: + +```json +{ + "scheduler": { + "type": "WarmupDecayLR", + "params": { + "last_batch_iteration": -1, + "total_num_steps": "auto", + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +and `total_num_steps`, `warmup_max_lr`, `warmup_num_steps` and `total_num_steps` will be set at loading time. + + + + + + +### fp32 Precision + +Deepspeed supports the full fp32 and the fp16 mixed precision. + +Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you +will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this +happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained +models). Such models may overflow or underflow leading to `NaN` loss. If this is your case then you will want to use +the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with: + +```json +{ + "fp16": { + "enabled": "false", + } +} +``` + +If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using +the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and +benchmarks, please, see [TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices). The document includes +instructions on how to disable this automatic conversion if for some reason you prefer not to use it. + + + + + + +### Automatic Mixed Precision + +You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way: + +To configure pytorch AMP-like mode set: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +and the [`Trainer`] will automatically enable or disable it based on the value of +`args.fp16_backend`. The rest of config values are up to you. + +This mode gets enabled when `--fp16 --fp16_backend amp` command line args are passed. + +You can also enable/disable this mode explicitly: + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + +Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#fp16-training-options). + +To configure apex AMP-like mode set: + +```json +"amp": { + "enabled": "auto", + "opt_level": "auto" +} +``` + +and the [`Trainer`] will automatically configure it based on the values of `args.fp16_backend` and +`args.fp16_opt_level`. + +This mode gets enabled when `--fp16 --fp16_backend apex --fp16_opt_level 01` command line args are passed. + +You can also configure this mode explicitly: + +```json +{ + "amp": { + "enabled": true, + "opt_level": "O1" + } +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + +Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options). + + + + + +### Batch Size + +To configure batch size, use: + +```json +{ + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto" +} +``` + +and the [`Trainer`] will automatically set `train_micro_batch_size_per_gpu` to the value of +`args.per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`. + +You can also set the values explicitly: + +```json +{ + "train_batch_size": 12, + "train_micro_batch_size_per_gpu": 4 +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + + + + + +### Gradient Accumulation + +To configure gradient accumulation set: + +```json +{ + "gradient_accumulation_steps": "auto" +} +``` + +and the [`Trainer`] will automatically set it to the value of `args.gradient_accumulation_steps`. + +You can also set the value explicitly: + +```json +{ + "gradient_accumulation_steps": 3 +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + + + + + +### Gradient Clipping + +To configure gradient gradient clipping set: + +```json +{ + "gradient_clipping": "auto" +} +``` + +and the [`Trainer`] will automatically set it to the value of `args.max_grad_norm`. + +You can also set the value explicitly: + +```json +{ + "gradient_clipping": 1.0 +} +``` + +But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed +configuration. + + + + + +### Getting The Model Weights Out + +As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores +fp32 master weights in its custom checkpoint optimizer files, which are `global_step*/*optim_states.pt` (this is glob +pattern), and are saved under the normal checkpoint. + +**FP16 Weights:** + +When a model is saved under ZeRO-2, you end up having the normal `pytorch_model.bin` file with the model weights, but +they are only the fp16 version of the weights. + +Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs, +therefore `"stage3_gather_fp16_weights_on_model_save": true` is required to get the `Trainer` to save the fp16 +version of the weights. If this setting is `False` ``pytorch_model.bin` won't be created. This is because by default DeepSpeed's `state_dict` contains a placeholder and not the real weights. If we were to save this `state_dict`` it +won't be possible to load it back. + + +```json +{ + "zero_optimization": { + "stage3_gather_fp16_weights_on_model_save": true + } +} +``` + +**FP32 Weights:** + +While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to +the [models hub](https://huggingface.co/models) or pass it to someone else you most likely will want to get the fp32 +weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and +therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU +memory it can be done in the same training script. The following sections will discuss both approaches. + + +**Live FP32 Weights Recovery:** + +This approach may not work if you model is large and you have little free CPU memory left, at the end of the training. + +If you have saved at least one checkpoint, and you want to use the latest one, you can do the following: + +```python +from transformers.trainer_utils import get_last_checkpoint +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint +checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + +If you're using the `--load_best_model_at_end` class:*~transformers.TrainingArguments* argument (to track the best +checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above: + +```python +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint +checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") +trainer.deepspeed.save_checkpoint(checkpoint_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + + + +Note, that once `load_state_dict_from_zero_checkpoint` was run, the `model` will no longer be useable in the +DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since +`model.load_state_dict(state_dict)` will remove all the DeepSpeed magic from it. So do this only at the very end +of the training. + + + +Of course, you don't have to use class:*~transformers.Trainer* and you can adjust the examples above to your own +trainer. + +If for some reason you want more refinement, you can also extract the fp32 `state_dict` of the weights and apply +these yourself as is shown in the following example: + +```python +from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint +state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu +model = model.cpu() +model.load_state_dict(state_dict) +``` + +**Offline FP32 Weights Recovery:** + +DeepSpeed creates a special conversion script `zero_to_fp32.py` which it places in the top-level of the checkpoint +folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to +have the configuration file or a `Trainer` to do the extraction. + +Let's say your checkpoint folder looks like this: + +```bash +$ ls -l output_dir/checkpoint-1/ +-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json +drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ +-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest +-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt +-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin +-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt +-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json +-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model +-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json +-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json +-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin +-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* +``` + +In this example there is just one DeepSpeed checkpoint sub-folder *global_step1*. Therefore to reconstruct the fp32 +weights just run: + +```bash +python zero_to_fp32.py . pytorch_model.bin +``` + +This is it. `pytorch_model.bin` will now contain the full fp32 model weights consolidated from multiple GPUs. + +The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint. + +`python zero_to_fp32.py -h` will give you usage details. + +The script will auto-discover the deepspeed sub-folder using the contents of the file `latest`, which in the current +example will contain `global_step1`. + +Note: currently the script requires 2x general RAM of the final fp32 model weights. + + +### ZeRO-3 and Infinity Nuances + +ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature. + +ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements. + +While all the efforts were made for things to just work without needing any special changes to your models, in certain +circumstances you may find the following information to be needed. + + + +#### Constructing Massive Models + +DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases, +but also if you want the initialization to happen much faster, initialize the model using *deepspeed.zero.Init()* +context manager (which is also a function decorator), like so: + +```python +from transformers import T5ForConditionalGeneration, T5Config +import deepspeed +with deepspeed.zero.Init(): + config = T5Config.from_pretrained("t5-small") + model = T5ForConditionalGeneration(config) +``` + +As you can see this gives you a randomly initialized model. + +If you want to use a pretrained model, `model_class.from_pretrained` will activate this feature as long as +`is_deepspeed_zero3_enabled()` returns `True`, which currently is setup by the +[`TrainingArguments`] object if the passed DeepSpeed configuration file contains ZeRO-3 config +section. Thus you must create the [`TrainingArguments`] object **before** calling +`from_pretrained`. Here is an example of a possible sequence: + +```python +from transformers import AutoModel, Trainer, TrainingArguments +training_args = TrainingArguments(..., deepspeed=ds_config) +model = AutoModel.from_pretrained("t5-small") +trainer = Trainer(model=model, args=training_args, ...) +``` + +If you're using the official example scripts and your command line arguments include `--deepspeed ds_config.json` +with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written. + +Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used. + +For full details on this method and other related features please refer to [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models). + +Also when loading fp16-pretrained models, you will want to tell `from_pretrained` to use +`torch_dtype=torch.float16`. For details, please, see [from_pretrained-torch-dtype](#from_pretrained-torch-dtype). + + +#### Gathering Parameters + +Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently +executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it. +Most likely you won't need it, but if you do please refer to [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination) + +We do however use it internally in several places, one such example is when loading pretrained model weights in +`from_pretrained`. We load one layer at a time and immediately partition it to all participating GPUs, as for very +large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory +limitations. + +Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like: + +```python +tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True) +``` + +stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much +larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder. + + + + + + +### ZeRO Inference + +ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In +fact you can leave these in the config file if you want to share the same one with the training. They will just be +ignored. + +Otherwise you just need to pass the usual [`TrainingArguments`] arguments. For example: + +```bash +deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json +``` + +The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever +for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states. + +Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs: + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path t5-small --output_dir output_dir \ +--do_eval --max_eval_samples 50 --warmup_steps 50 \ +--max_source_length 128 --val_max_target_length 128 \ +--overwrite_output_dir --per_device_eval_batch_size 4 \ +--predict_with_generate --dataset_config "ro-en" --fp16 \ +--source_lang en --target_lang ro --dataset_name wmt16 \ +--source_prefix "translate English to Romanian: " +``` + +Since for inference there is no need for additional large memory used by the optimizer states and the gradients you +should be able to fit much larger batches and/or sequence length onto the same hardware. + + +Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship +to the ZeRO technology, but instead uses tensor parallelism to scale models that can't fit onto a single GPU. This is a +work in progress and we will provide the integration once that product is complete. + + +### Filing Issues + +Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work. + +In your report please always include: + +1. the full Deepspeed config file in the report + +2. either the command line arguments if you were using the [`Trainer`] or + [`TrainingArguments`] arguments if you were scripting the Trainer setup yourself. Please do not + dump the [`TrainingArguments`] as it has dozens of entries that are irrelevant. + +3. Output of: + + ```bash + python -c 'import torch; print(f"torch: {torch.__version__}")' + python -c 'import transformers; print(f"transformers: {transformers.__version__}")' + python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' + ``` + +4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this + [notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) as + a starting point. + +5. Unless it's impossible please always use a standard dataset that we can use and not something custom. + +6. If possible try to use one of the existing [examples](https://github.com/huggingface/transformers/tree/master/examples/pytorch) to reproduce the problem with. + +Things to consider: + +- Deepspeed is often not the cause of the problem. + + Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the + problem was still there. + + Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an + exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it. + And only if the problem persists then do mentioned Deepspeed and supply all the required details. + +- If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue + directly with [Deepspeed](https://github.com/microsoft/DeepSpeed/). If you aren't sure, please do not worry, + either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if + need be. + + + +### Troubleshooting + +- `deepspeed` process gets killed at startup without a traceback + +If the `deepspeed` process gets killed at launch time without a traceback, that usually means that the program tried +to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that +process. This is because your configuration file most likely has either `offload_optimizer` or `offload_param` or +both configured to offload to `cpu`. If you have NVMe, experiment with offloading to NVMe if you're running under +ZeRO-3. + +Work is being done to enable estimating how much memory is needed for a specific model: [PR](https://github.com/microsoft/DeepSpeed/pull/965). + + + + + + +### Notes + +- DeepSpeed works with the PyTorch [`Trainer`] but not TF [`TFTrainer`]. +- While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from [source](https://github.com/microsoft/deepspeed#installation) to best match your hardware and also if you need to enable + certain features, like 1-bit Adam, which aren't available in the pypi distribution. +- You don't have to use the [`Trainer`] to use DeepSpeed with 🤗 Transformers - you can use any model + with your own trainer, and you will have to adapt the latter according to [the DeepSpeed integration instructions](https://www.deepspeed.ai/getting-started/#writing-deepspeed-models). + + + + + + +## Non-Trainer Deepspeed Integration + +The [`~integrations.HfDeepSpeedConfig`] is used to integrate Deepspeed into the 🤗 Transformers core +functionality, when [`Trainer`] is not used. + +When using [`Trainer`] everything is automatically taken care of. + +When not using [`Trainer`], to efficiently deploy DeepSpeed stage 3, you must instantiate the +[`~integrations.HfDeepSpeedConfig`] object before instantiating the model. + +For example for a pretrained model: + +```python +from transformers.deepspeed import HfDeepSpeedConfig +from transformers import AutoModel, deepspeed + +ds_config = { ... } # deepspeed config object or path to the file +# must run before instantiating the model +dschf = HfDeepSpeedConfig(ds_config) # keep this object alive +model = AutoModel.from_pretrained("gpt2") +engine = deepspeed.initialize(model=model, config_params=ds_config, ...) +``` + +or for non-pretrained model: + +```python +from transformers.deepspeed import HfDeepSpeedConfig +from transformers import AutoModel, AutoConfig, deepspeed + +ds_config = { ... } # deepspeed config object or path to the file +# must run before instantiating the model +dschf = HfDeepSpeedConfig(ds_config) # keep this object alive +config = AutoConfig.from_pretrained("gpt2") +model = AutoModel.from_config(config) +engine = deepspeed.initialize(model=model, config_params=ds_config, ...) +``` + +## HfDeepSpeedConfig + +[[autodoc]] deepspeed.HfDeepSpeedConfig + - all + +## Main DeepSpeed Resources + +- [Project's github](https://github.com/microsoft/deepspeed) +- [Usage docs](https://www.deepspeed.ai/getting-started/) +- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html) +- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed) + +Papers: + +- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054) +- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840) +- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857) + +Finally, please, remember that, HuggingFace [`Trainer`] only integrates DeepSpeed, therefore if you +have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues). diff --git a/docs/source/main_classes/deepspeed.rst b/docs/source/main_classes/deepspeed.rst deleted file mode 100644 index 4b0b8c5bdb..0000000000 --- a/docs/source/main_classes/deepspeed.rst +++ /dev/null @@ -1,1833 +0,0 @@ -.. - Copyright 2020 The HuggingFace Team. All rights reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on - an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the - specific language governing permissions and limitations under the License. - - -DeepSpeed Integration ------------------------------------------------------------------------------------------------------------------------ - - -`DeepSpeed `__ implements everything described in the `ZeRO paper -`__. Currently it provides full support for: - -1. Optimizer state partitioning (ZeRO stage 1) -2. Gradient partitioning (ZeRO stage 2) -3. Parameter partitioning (ZeRO stage 3) -4. Custom mixed precision training handling -5. A range of fast CUDA-extension-based optimizers -6. ZeRO-Offload to CPU and NVMe - -ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training -`__. And NVMe-support is described in the paper `ZeRO-Infinity: Breaking the GPU -Memory Wall for Extreme Scale Deep Learning `__. - -DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference. - -DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which -won't be possible on a single GPU. - - - -🤗 Transformers integrates `DeepSpeed `__ via 2 options: - -1. Integration of the core DeepSpeed features via :class:`~transformers.Trainer`. This is everything done for you type - of integration - just supply your custom config file or use our template and you have nothing else to do. Most of - this document is focused on this feature. -2. If you don't use :class:`~transformers.Trainer` and want to use your own Trainer where you integrated DeepSpeed - yourself, core functionality functions like ``from_pretrained`` and ``from_config`` include integration of essential - parts of DeepSpeed like ``zero.Init`` for ZeRO stage 3 and higher. To tap into this feature read the docs on - :ref:`deepspeed-non-trainer-integration`. - -What is integrated: - -Training: - -1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 with ZeRO-Infinity (CPU and NVME offload). - -Inference: - -1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but - it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see: - :ref:`deepspeed-zero-inference`. - -There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of -ZeRO (coming soon). - - - -.. _deepspeed-trainer-integration: - - -Trainer Deepspeed Integration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - -.. _deepspeed-installation: - -Installation -======================================================================================================================= - -Install the library via pypi: - -.. code-block:: bash - - pip install deepspeed - -or via ``transformers``' ``extras``: - -.. code-block:: bash - - pip install transformers[deepspeed] - -or find more details on `the DeepSpeed's GitHub page `__ and -`advanced install `__. - -If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`. - -If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions -to no avail, the next thing to try is to pre-build the modules before installing them. - -To make a local build for DeepSpeed: - -.. code-block:: bash - - git clone https://github.com/microsoft/DeepSpeed/ - cd DeepSpeed - rm -rf build - TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ - --global-option="build_ext" --global-option="-j8" --no-cache -v \ - --disable-pip-version-check 2>&1 | tee build.log - -If you intend to use NVMe offload you will need to also include ``DS_BUILD_AIO=1`` in the instructions above (and also -install `libaio-dev` system-wide). - -Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use. Assuming all -your cards are the same you can get the arch via: - -.. code-block:: bash - - CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" - -So if you get ``8, 6``, then use ``TORCH_CUDA_ARCH_LIST="8.6"``. If you have multiple different cards, you can list all -of them like so ``TORCH_CUDA_ARCH_LIST="6.1;8.6"`` - -If you need to use the same setup on multiple machines, make a binary wheel: - -.. code-block:: bash - - git clone https://github.com/microsoft/DeepSpeed/ - cd DeepSpeed - rm -rf build - TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ - python setup.py build_ext -j8 bdist_wheel - -it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install -as ``pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` locally or on any other machine. - -Again, remember to ensure to adjust ``TORCH_CUDA_ARCH_LIST`` to the target architectures. - -You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this -context) `here `__. - -You can check the archs pytorch was built with using: - -.. code-block:: bash - - python -c "import torch; print(torch.cuda.get_arch_list())" - -Here is how to find out the arch for one of the installed GPU. For example, for GPU 0: - -.. code-block:: bash - - CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ - print(torch.cuda.get_device_properties(torch.device('cuda')))" - -If the output is: - -.. code-block:: bash - - _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) - -then you know that this card's arch is ``8.6``. - -You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the -architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why -it's best to specify the desired archs explicitly. - -If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of -`Deepspeed `__, - - - -.. _deepspeed-multi-gpu: - -Deployment with multiple GPUs -======================================================================================================================= - -To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as -following: - -1. replace ``python -m torch.distributed.launch`` with ``deepspeed``. -2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as - documented `here `__. The file naming is up to you. - -Therefore, if your original command line looked as following: - -.. code-block:: bash - - python -m torch.distributed.launch --nproc_per_node=2 your_program.py - -Now it should be: - -.. code-block:: bash - - deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json - -Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the -``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The -full details on how to configure various nodes and GPUs can be found `here -`__. - -In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use -``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use -the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will -use it here as well. - -Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs: - -.. code-block:: bash - - deepspeed examples/pytorch/translation/run_translation.py \ - --deepspeed tests/deepspeed/ds_config_zero3.json \ - --model_name_or_path t5-small --per_device_train_batch_size 1 \ - --output_dir output_dir --overwrite_output_dir --fp16 \ - --do_train --max_train_samples 500 --num_train_epochs 1 \ - --dataset_name wmt16 --dataset_config "ro-en" \ - --source_lang en --target_lang ro - - -Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e. -two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal -with, we combined the two into a single argument. - -For some practical usage examples, please, see this `post -`__. - - - -.. _deepspeed-one-gpu: - -Deployment with one GPU -======================================================================================================================= - -To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following: - -.. code-block:: bash - - deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ - --deepspeed tests/deepspeed/ds_config_zero2.json \ - --model_name_or_path t5-small --per_device_train_batch_size 1 \ - --output_dir output_dir --overwrite_output_dir --fp16 \ - --do_train --max_train_samples 500 --num_train_epochs 1 \ - --dataset_name wmt16 --dataset_config "ro-en" \ - --source_lang en --target_lang ro - -This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via -``--num_gpus=1``. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start -with, then you don't need this argument. The following `documentation -`__ discusses the launcher options. - -Why would you want to use DeepSpeed with just one GPU? - -1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus - leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which - normally won't fit. -2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit - bigger models and data batches. - -While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU -with DeepSpeed is to have at least the following configuration in the configuration file: - -.. code-block:: json - - { - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "overlap_comm": true, - "contiguous_gradients": true - } - } - -which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will -find more details in the discussion below. - -For a practical usage example of this type of deployment, please, see this `post -`__. - -You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document. - - - -Notes: - -- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit - the visible scope of available GPUs. Instead, you have to use the following syntax: - - .. code-block:: bash - - deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... - - In this example, we tell DeepSpeed to use GPU 1 (second gpu). - - - -.. _deepspeed-notebook: - -Deployment in Notebooks -======================================================================================================================= - -The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so -under certain setups we have to emulate it. - -If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed. - -.. code-block:: python - - # DeepSpeed requires a distributed environment even when only one process is used. - # This emulates a launcher in the notebook - import os - os.environ['MASTER_ADDR'] = 'localhost' - os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use - os.environ['RANK'] = "0" - os.environ['LOCAL_RANK'] = "0" - os.environ['WORLD_SIZE'] = "1" - - # Now proceed as normal, plus pass the deepspeed config file - training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") - trainer = Trainer(...) - trainer.train() - -Note: ``...`` stands for the normal arguments that you'd pass to the functions. - -If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have -to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented -at the beginning of this section. - -If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated -cell with: - -.. code-block:: python - - %%bash - cat <<'EOT' > ds_config_zero3.json - { - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_fp16_weights_on_model_save": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false - } - EOT - - -If the training script is in a normal file and not in the notebook cells, you can launch ``deepspeed`` normally via -shell from a cell. For example, to use ``run_translation.py`` you would launch it with: - -.. code-block:: - - !git clone https://github.com/huggingface/transformers - !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... - -or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run: - -.. code-block:: - - %%bash - - git clone https://github.com/huggingface/transformers - cd transformers - deepspeed examples/pytorch/translation/run_translation.py ... - -In such case you don't need any of the code presented at the beginning of this section. - -Note: While ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process -completes. - - - - -.. _deepspeed-config: - -Configuration -======================================================================================================================= - -For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer -to the `following documentation `__. - -You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples -repo `__: - -.. code-block:: bash - - git clone https://github.com/microsoft/DeepSpeedExamples - cd DeepSpeedExamples - find . -name '*json' - -Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the -example ``.json`` files with: - -.. code-block:: bash - - grep -i Lamb $(find . -name '*json') - -Some more examples are to be found in the `main repo `__ as well. - -When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have -to be configured via the command line. You will find the nuances in the rest of this guide. - -To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, -including optimizer states cpu offload, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler and will enable mixed -precision training if ``--fp16`` is passed: - -.. code-block:: json - - { - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - } - -When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer` -to the console, so you can see exactly what was the final configuration passed to it. - - - -.. _deepspeed-config-passing: - -Passing Configuration -======================================================================================================================= - -As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're -not using the command line interface to configure the training, and instead instantiate the -:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can -pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to -the file system before passing it to :class:`~transformers.TrainingArguments`. - -To summarize you can do: - -.. code-block:: python - - TrainingArguments(..., deepspeed="/path/to/ds_config.json") - -or: - -.. code-block:: python - - ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params) - TrainingArguments(..., deepspeed=ds_config_dict) - - - -.. _deepspeed-config-shared: - -Shared Configuration -======================================================================================================================= - - -.. warning:: - - This section is a must-read - -Some configuration values are required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly, -therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those -via the :class:`~transformers.Trainer` command line arguments. - -Additionally, some configuration values are derived automatically based on the model's configuration, so instead of -remembering to manually adjust multiple values, it's the best to let the :class:`~transformers.Trainer` do the majority -of configuration for you. - -Therefore, in the rest of this guide you will find a special configuration value: ``auto``, which when set will be -automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this -recommendation and set the values explicitly, in which case be very careful that your the -:class:`~transformers.Trainer` arguments and DeepSpeed configurations agree. For example, are you using the same -learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very -difficult to detect ways. You have been warned. - -There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit -your needs. - -In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master -and configure :class:`~transformers.TrainingArguments` based on that. The steps are: - -1. Create or load the DeepSpeed configuration to be used as a master configuration -2. Create the :class:`~transformers.TrainingArguments` object based on these values - -Do note that some values, such as :obj:`scheduler.params.total_num_steps` are calculated by -:class:`~transformers.Trainer` during ``train``, but you can of course do the math yourself. - -.. _deepspeed-zero: - -ZeRO -======================================================================================================================= - -`Zero Redundancy Optimizer (ZeRO) `__ is the workhorse of DeepSpeed. It -support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes, -therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity. -You will find more indepth information in the DeepSpeed documentation. - -The ``zero_optimization`` section of the configuration file is the most important part (`docs -`__), since that is where you define -which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the -DeepSpeed docs. - -This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides -no equivalent command line arguments. - -Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for -the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is -going to use. - - - -.. _deepspeed-zero2-config: - -ZeRO-2 Config -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -The following is an example configuration for ZeRO stage 2: - -.. code-block:: json - - { - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 5e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 5e8, - "contiguous_gradients": true - } - } - -**Performance tuning:** - -- enabling ``offload_optimizer`` should reduce GPU RAM usage (it requires ``"stage": 2``) -- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x - the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB - footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting - OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do - the same on larger capacity GPU as well, if you're starting to hit OOM. -- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size, - the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is - important, getting a slightly slower training time could be a good trade. - - - -.. _deepspeed-zero3-config: - -ZeRO-3 Config -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -The following is an example configuration for ZeRO stage 3: - -.. code-block:: json - - { - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_fp16_weights_on_model_save": true - } - } - -If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU -memory offloading the optimizer states and parameters to CPU memory with ``"device": "cpu"`` may solve this limitation. -If you don't want to offload to CPU memory, use ``none`` instead of ``cpu`` for the ``device`` entry. Offloading to -NVMe is discussed further down. - -Pinned memory is enabled with ``pin_memory`` set to ``true``. This feature can improve the throughput at the cost of -making less memory available to other processes. Pinned memory is set aside to the specific process that requested it -and its typically accessed much faster than normal CPU memory. - -**Performance tuning:** - -- ``stage3_max_live_parameters``: ``1e9`` -- ``stage3_max_reuse_distance``: ``1e9`` - -If hitting OOM reduce ``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``. They should have minimal impact -on performance unless you are doing activation checkpointing. ``1e9`` would consume ~2GB. The memory is shared by -``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``, so its not additive, its just 2GB total. - -``stage3_max_live_parameters`` is the upper limit on how many full parameters you want to keep on the GPU at any given -time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we -use the ``stage3_max_reuse_distance`` to decide whether to throw away the parameter or to keep it. If a parameter is -going to be used again in near future (less than ``stage3_max_reuse_distance``) then we keep it to reduce communication -overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and -backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward - -The following configuration values depend on the model's hidden size: - -- ``reduce_bucket_size``: ``hidden_size*hidden_size`` -- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size`` -- ``stage3_param_persistence_threshold``: ``10 * hidden_size`` - -therefore set these values to ``auto`` and the :class:`~transformers.Trainer` will automatically assign the recommended -values. But, of course, feel free to set these explicitly as well. - -``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large -models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if -you plan to resume the training. Watch out for future updates that will remove this limitation and make things more -flexible. - -If you're migrating from ZeRO-2 configuration note that ``allgather_partitions``, ``allgather_bucket_size`` and -``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just -be ignored. - -- ``sub_group_size``: ``1e9`` - -``sub_group_size`` controls the granularity in which parameters are updated during optimizer steps. Parameters are -grouped into buckets of ``sub_group_size`` and each buckets is updated one at a time. When used with NVMe offload in -ZeRO-Infinity, ``sub_group_size`` therefore controls the granularity in which model states are moved in and out of CPU -memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. - -You can leave ``sub_group_size`` to its default value of `1e9` when not using NVMe offload. You may want to change its -default value in the following cases: - -1. Running into OOM during optimizer step: Reduce ``sub_group_size`` to reduce memory utilization of temporary buffers -2. Optimizer Step is taking a long time: Increase ``sub_group_size`` to improve bandwidth utilization as a result of - the increased data buffers. - - -.. _deepspeed-nvme: - -NVMe Support -======================================================================================================================= - -ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to -smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during -offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training -process. ZeRO-Infinity requires ZeRO-3 enabled. - -The following configuration example enables NVMe to offload both optimizer states and the params: - -.. code-block:: json - - { - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "nvme", - "nvme_path": "/local_nvme", - "pin_memory": true, - "buffer_count": 4, - "fast_init": false - }, - "offload_param": { - "device": "nvme", - "nvme_path": "/local_nvme", - "pin_memory": true, - "buffer_count": 5, - "buffer_size": 1e8, - "max_in_cpu": 1e9 - } - "aio": { - "block_size": 262144, - "queue_depth": 32, - "thread_count": 1, - "single_submit": false, - "overlap_events": true - } - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_fp16_weights_on_model_save": true - }, - } - -You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you -have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint: -`"device": "cpu"`). - -Here is the full documentation for offloading `optimizer states -`__ and `parameters -`__. - -Make sure that your ``nvme_path`` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll -be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this -writing one can have ~3.5GB/s read, ~3GB/s write peak speeds). - -In order to figure out the optimal ``aio`` configuration block you must run a benchmark on your target setup, as -`explained here `__. - - - -.. _deepspeed-zero2-zero3-performance: - -ZeRO-2 vs ZeRO-3 Performance -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather -model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs -then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity -at a cost of speed. - -It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: - -- set ``stage3_param_persistence_threshold`` to a very large number - larger than the largest parameter, e.g., ``6 * - hidden_size * hidden_size``. This will keep the parameters on the GPUs. -- turn off ``offload_params`` since ZeRO-2 doesn't have that option. - -The performance will likely improve significantly with just ``offload_params`` turned off, even if you don't change -``stage3_param_persistence_threshold``. Of course, these changes will impact the size of the model you can train. So -these help you to trade scalability for speed depending on your needs. - - - -.. _deepspeed-zero2-example: - -ZeRO-2 Example -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``: - -.. code-block:: json - - { - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false - } - - -Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical -values look like, but we highly recommend using the one with multiple ``auto`` settings in it. - -.. code-block:: json - - { - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": 3e-5, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 3e-5, - "warmup_num_steps": 500 - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "steps_per_print": 2000, - "wall_clock_breakdown": false - } - - - -.. _deepspeed-zero3-example: - -ZeRO-3 Example -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``: - - -.. code-block:: json - - { - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_fp16_weights_on_model_save": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false - } - -Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical -values look like, but we highly recommend using the one with multiple ``auto`` settings in it. - -.. code-block:: json - - { - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": 3e-5, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 3e-5, - "warmup_num_steps": 500 - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": 1e6, - "stage3_prefetch_bucket_size": 0.94e6, - "stage3_param_persistence_threshold": 1e4, - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_fp16_weights_on_model_save": true - }, - - "steps_per_print": 2000, - "wall_clock_breakdown": false - } - - -Optimizer and Scheduler -======================================================================================================================= - -As long as you don't enable ``offload_optimizer`` you can mix and match DeepSpeed and HuggingFace schedulers and -optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer: - -+--------------+--------------+--------------+ -| Combos | HF Scheduler | DS Scheduler | -+--------------+--------------+--------------+ -| HF Optimizer | Yes | Yes | -+--------------+--------------+--------------+ -| DS Optimizer | No | Yes | -+--------------+--------------+--------------+ - -It is possible to use a non-DeepSpeed optimizer when ``offload_optimizer`` is enabled, as long as it has both CPU and -GPU implementation (except LAMB). - - - - -.. _deepspeed-optimizer: - -Optimizer -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - - -DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are -thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here -`__. - -If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will -automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line -arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``. - -Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``: - -.. code-block:: json - - { - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - } - } - - -Note that the command line arguments will set the values in the configuration file. This is so that there is one -definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to -different values in different places. Command line rules. The values that get overridden are: - -- ``lr`` with the value of ``--learning_rate`` -- ``betas`` with the value of ``--adam_beta1 --adam_beta2`` -- ``eps`` with the value of ``--adam_epsilon`` -- ``weight_decay`` with the value of ``--weight_decay`` - -Therefore please remember to tune the shared hyperparameters on the command line. - -You can also set the values explicitly: - -.. code-block:: json - - { - "optimizer": { - "type": "AdamW", - "params": { - "lr": 0.001, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - } - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - -If you want to use another optimizer which is not listed above, you will have to add to the top level configuration. - -.. code-block:: json - - { - "zero_allow_untested_optimizer": true - } - -Similarly to ``AdamW``, you can configure other officially supported optimizers. Just remember that may have different -config values. e.g. for Adam you will want ``weight_decay`` around ``0.01``. - - - -.. _deepspeed-scheduler: - -Scheduler -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -DeepSpeed supports ``LRRangeTest``, ``OneCycle``, ``WarmupLR`` and ``WarmupDecayLR`` learning rate schedulers. The full -documentation is `here `__. - -Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: - -* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup`` -* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``, - therefore, if you don't configure the scheduler this is scheduler that will get configured by default. - -If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use -the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` or ``--warmup_ratio`` to configure a -🤗 Transformers version of it. - -Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``: - -.. code-block:: json - - { - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - } - } - -Since `"auto"` is used the :class:`~transformers.Trainer` arguments will set the correct values in the configuration -file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example, -the learning rate is set to different values in different places. Command line rules. The values that get set are: - -- ``warmup_min_lr`` with the value of ``0``. -- ``warmup_max_lr`` with the value of ``--learning_rate``. -- ``warmup_num_steps`` with the value of ``--warmup_steps`` if provided. Otherwise will use ``--warmup_ratio`` - multiplied by the number of training steps and rounded up. -- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run - time based on the environment and the size of the dataset and other command line arguments (needed for - ``WarmupDecayLR``). - -You can, of course, take over any or all of the configuration values and set those yourself: - -.. code-block:: json - - { - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 0.001, - "warmup_num_steps": 1000 - } - } - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - -For example, for ``WarmupDecayLR``, you can use the following entry: - -.. code-block:: json - - { - "scheduler": { - "type": "WarmupDecayLR", - "params": { - "last_batch_iteration": -1, - "total_num_steps": "auto", - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - } - } - -and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be set at loading time. - - - - -.. _deepspeed-fp32: - -fp32 Precision -======================================================================================================================= - -Deepspeed supports the full fp32 and the fp16 mixed precision. - -Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you -will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this -happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained -models). Such models may overflow or underflow leading to ``NaN`` loss. If this is your case then you will want to use -the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with: - -.. code-block:: json - - { - "fp16": { - "enabled": "false", - } - } - -If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using -the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and -benchmarks, please, see `TensorFloat-32(TF32) on Ampere devices -`__. The document includes -instructions on how to disable this automatic conversion if for some reason you prefer not to use it. - - - - -.. _deepspeed-amp: - -Automatic Mixed Precision -======================================================================================================================= - -You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way: - -To configure pytorch AMP-like mode set: - -.. code-block:: json - - { - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - } - } - -and the :class:`~transformers.Trainer` will automatically enable or disable it based on the value of -``args.fp16_backend``. The rest of config values are up to you. - -This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed. - -You can also enable/disable this mode explicitly: - -.. code-block:: json - - { - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - } - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - -Here is the `documentation `__. - -To configure apex AMP-like mode set: - -.. code-block:: json - - "amp": { - "enabled": "auto", - "opt_level": "auto" - } - -and the :class:`~transformers.Trainer` will automatically configure it based on the values of ``args.fp16_backend`` and -``args.fp16_opt_level``. - -This mode gets enabled when ``--fp16 --fp16_backend apex --fp16_opt_level 01`` command line args are passed. - -You can also configure this mode explicitly: - -.. code-block:: json - - { - "amp": { - "enabled": true, - "opt_level": "O1" - } - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - -Here is the `documentation -`__. - - - -.. _deepspeed-bs: - -Batch Size -======================================================================================================================= - -To configure batch size, use: - -.. code-block:: json - - { - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto" - } - -and the :class:`~transformers.Trainer` will automatically set ``train_micro_batch_size_per_gpu`` to the value of -``args.per_device_train_batch_size`` and ``train_batch_size`` to ``args.world_size * args.per_device_train_batch_size * -args.gradient_accumulation_steps``. - -You can also set the values explicitly: - -.. code-block:: json - - { - "train_batch_size": 12, - "train_micro_batch_size_per_gpu": 4 - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - - - -.. _deepspeed-grad-acc: - -Gradient Accumulation -======================================================================================================================= - -To configure gradient accumulation set: - -.. code-block:: json - - { - "gradient_accumulation_steps": "auto" - } - -and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.gradient_accumulation_steps``. - -You can also set the value explicitly: - -.. code-block:: json - - { - "gradient_accumulation_steps": 3 - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - - - -.. _deepspeed-grad-clip: - -Gradient Clipping -======================================================================================================================= - -To configure gradient gradient clipping set: - -.. code-block:: json - - { - "gradient_clipping": "auto" - } - -and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.max_grad_norm``. - -You can also set the value explicitly: - -.. code-block:: json - - { - "gradient_clipping": 1.0 - } - -But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed -configuration. - - - -.. _deepspeed-weight-extraction: - -Getting The Model Weights Out -======================================================================================================================= - -As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores -fp32 master weights in its custom checkpoint optimizer files, which are ``global_step*/*optim_states.pt`` (this is glob -pattern), and are saved under the normal checkpoint. - -**FP16 Weights:** - -When a model is saved under ZeRO-2, you end up having the normal ``pytorch_model.bin`` file with the model weights, but -they are only the fp16 version of the weights. - -Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs, -therefore ``"stage3_gather_fp16_weights_on_model_save": true`` is required to get the ``Trainer`` to save the fp16 -version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't be created. This is because by default -DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it -won't be possible to load it back. - - -.. code-block:: json - - { - "zero_optimization": { - "stage3_gather_fp16_weights_on_model_save": true - } - } - - -**FP32 Weights:** - -While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to -the `models hub `__ or pass it to someone else you most likely will want to get the fp32 -weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and -therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU -memory it can be done in the same training script. The following sections will discuss both approaches. - - -**Live FP32 Weights Recovery:** - -This approach may not work if you model is large and you have little free CPU memory left, at the end of the training. - -If you have saved at least one checkpoint, and you want to use the latest one, you can do the following: - -.. code-block:: python - - from transformers.trainer_utils import get_last_checkpoint - from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint - checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) - fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) - -If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best -checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above: - -.. code-block:: python - - from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint - checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") - trainer.deepspeed.save_checkpoint(checkpoint_dir) - fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) - -.. note:: - - Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the - DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since - ``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end - of the training. - -Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own -trainer. - -If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply -these yourself as is shown in the following example: - -.. code-block:: python - - from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint - state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu - model = model.cpu() - model.load_state_dict(state_dict) - - -**Offline FP32 Weights Recovery:** - -DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint -folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to -have the configuration file or a ``Trainer`` to do the extraction. - -Let's say your checkpoint folder looks like this: - -.. code-block:: bash - - $ ls -l output_dir/checkpoint-1/ - -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json - drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ - -rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest - -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt - -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin - -rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt - -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json - -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model - -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json - -rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json - -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin - -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* - -In this example there is just one DeepSpeed checkpoint sub-folder `global_step1`. Therefore to reconstruct the fp32 -weights just run: - -.. code-block:: bash - - python zero_to_fp32.py . pytorch_model.bin - -This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs. - -The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint. - -``python zero_to_fp32.py -h`` will give you usage details. - -The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current -example will contain ``global_step1``. - -Note: currently the script requires 2x general RAM of the final fp32 model weights. - - -ZeRO-3 and Infinity Nuances -======================================================================================================================= - -ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature. - -ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements. - -While all the efforts were made for things to just work without needing any special changes to your models, in certain -circumstances you may find the following information to be needed. - - - -Constructing Massive Models -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases, -but also if you want the initialization to happen much faster, initialize the model using `deepspeed.zero.Init()` -context manager (which is also a function decorator), like so: - -.. code-block:: python - - from transformers import T5ForConditionalGeneration, T5Config - import deepspeed - with deepspeed.zero.Init(): - config = T5Config.from_pretrained("t5-small") - model = T5ForConditionalGeneration(config) - -As you can see this gives you a randomly initialized model. - -If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as -``is_deepspeed_zero3_enabled()`` returns ``True``, which currently is setup by the -class:`~transformers.TrainingArguments` object if the passed DeepSpeed configuration file contains ZeRO-3 config -section. Thus you must create the :class:`~transformers.TrainingArguments` object **before** calling -``from_pretrained``. Here is an example of a possible sequence: - -.. code-block:: python - - from transformers import AutoModel, Trainer, TrainingArguments - training_args = TrainingArguments(..., deepspeed=ds_config) - model = AutoModel.from_pretrained("t5-small") - trainer = Trainer(model=model, args=training_args, ...) - -If you're using the official example scripts and your command line arguments include ``--deepspeed ds_config.json`` -with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written. - -Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used. - -For full details on this method and other related features please refer to `Constructing Massive Models -`__. - -Also when loading fp16-pretrained models, you will want to tell ``from_pretrained`` to use -``torch_dtype=torch.float16``. For details, please, see :ref:`from_pretrained-torch-dtype`. - - -Gathering Parameters -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ - -Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently -executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it. -Most likely you won't need it, but if you do please refer to `Gathering Parameters -`__ - -We do however use it internally in several places, one such example is when loading pretrained model weights in -``from_pretrained``. We load one layer at a time and immediately partition it to all participating GPUs, as for very -large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory -limitations. - -Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like: - -.. code-block:: python - - tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True) - -stress on ``tensor([1.])``, or if you get an error where it says the parameter is of size ``1``, instead of some much -larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder. - - - -.. _deepspeed-zero-inference: - - -ZeRO Inference -======================================================================================================================= - -ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In -fact you can leave these in the config file if you want to share the same one with the training. They will just be -ignored. - -Otherwise you just need to pass the usual :class:`~transformers.TrainingArguments` arguments. For example: - -.. code-block:: bash - - deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json - -The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever -for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states. - -Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs: - -.. code-block:: bash - - deepspeed examples/pytorch/translation/run_translation.py \ - --deepspeed tests/deepspeed/ds_config_zero3.json \ - --model_name_or_path t5-small --output_dir output_dir \ - --do_eval --max_eval_samples 50 --warmup_steps 50 \ - --max_source_length 128 --val_max_target_length 128 \ - --overwrite_output_dir --per_device_eval_batch_size 4 \ - --predict_with_generate --dataset_config "ro-en" --fp16 \ - --source_lang en --target_lang ro --dataset_name wmt16 \ - --source_prefix "translate English to Romanian: " - -Since for inference there is no need for additional large memory used by the optimizer states and the gradients you -should be able to fit much larger batches and/or sequence length onto the same hardware. - - -Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship -to the ZeRO technology, but instead uses tensor parallelism to scale models that can't fit onto a single GPU. This is a -work in progress and we will provide the integration once that product is complete. - - -Filing Issues -======================================================================================================================= - -Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work. - -In your report please always include: - -1. the full Deepspeed config file in the report - -2. either the command line arguments if you were using the :class:`~transformers.Trainer` or - :class:`~transformers.TrainingArguments` arguments if you were scripting the Trainer setup yourself. Please do not - dump the :class:`~transformers.TrainingArguments` as it has dozens of entries that are irrelevant. - -3. Output of: - -.. code-block:: bash - - python -c 'import torch; print(f"torch: {torch.__version__}")' - python -c 'import transformers; print(f"transformers: {transformers.__version__}")' - python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' - -4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this - `notebook `__ as - a starting point. - -5. Unless it's impossible please always use a standard dataset that we can use and not something custom. - -6. If possible try to use one of the existing `examples - `__ to reproduce the problem with. - -Things to consider: - -* Deepspeed is often not the cause of the problem. - - Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the - problem was still there. - - Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an - exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it. - And only if the problem persists then do mentioned Deepspeed and supply all the required details. - -* If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue - directly with `Deepspeed `__. If you aren't sure, please do not worry, - either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if - need be. - - - -Troubleshooting -======================================================================================================================= - -* ``deepspeed`` process gets killed at startup without a traceback - -If the ``deepspeed`` process gets killed at launch time without a traceback, that usually means that the program tried -to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that -process. This is because your configuration file most likely has either ``offload_optimizer`` or ``offload_param`` or -both configured to offload to ``cpu``. If you have NVMe, experiment with offloading to NVMe if you're running under -ZeRO-3. - -Work is being done to enable estimating how much memory is needed for a specific model: `PR -`__. - - - - - - -Notes -======================================================================================================================= - -* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`. -* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source - `__ to best match your hardware and also if you need to enable - certain features, like 1-bit Adam, which aren't available in the pypi distribution. -* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 🤗 Transformers - you can use any model - with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions - `__. - - - - -.. _deepspeed-non-trainer-integration: - -Non-Trainer Deepspeed Integration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The :class:`~transformers.integrations.HfDeepSpeedConfig` is used to integrate Deepspeed into the 🤗 Transformers core -functionality, when :class:`~transformers.Trainer` is not used. - -When using :class:`~transformers.Trainer` everything is automatically taken care of. - -When not using :class:`~transformers.Trainer`, to efficiently deploy DeepSpeed stage 3, you must instantiate the -:class:`~transformers.integrations.HfDeepSpeedConfig` object before instantiating the model. - -For example for a pretrained model: - -.. code-block:: python - - from transformers.deepspeed import HfDeepSpeedConfig - from transformers import AutoModel, deepspeed - - ds_config = { ... } # deepspeed config object or path to the file - # must run before instantiating the model - dschf = HfDeepSpeedConfig(ds_config) # keep this object alive - model = AutoModel.from_pretrained("gpt2") - engine = deepspeed.initialize(model=model, config_params=ds_config, ...) - -or for non-pretrained model: - -.. code-block:: python - - from transformers.deepspeed import HfDeepSpeedConfig - from transformers import AutoModel, AutoConfig, deepspeed - - ds_config = { ... } # deepspeed config object or path to the file - # must run before instantiating the model - dschf = HfDeepSpeedConfig(ds_config) # keep this object alive - config = AutoConfig.from_pretrained("gpt2") - model = AutoModel.from_config(config) - engine = deepspeed.initialize(model=model, config_params=ds_config, ...) - - -HfDeepSpeedConfig -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. autoclass:: transformers.deepspeed.HfDeepSpeedConfig - :members: - - - -Main DeepSpeed Resources -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -- `Project's github `__ -- `Usage docs `__ -- `API docs `__ -- `Blog posts `__ - -Papers: - -- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models `__ -- `ZeRO-Offload: Democratizing Billion-Scale Model Training `__ -- `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning `__ - -Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you -have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub -`__. diff --git a/docs/source/testing.mdx b/docs/source/testing.mdx new file mode 100644 index 0000000000..6e9afd0087 --- /dev/null +++ b/docs/source/testing.mdx @@ -0,0 +1,1189 @@ + + +# Testing + + +Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones. + +There are 2 test suites in the repository: + +1. `tests` -- tests for the general API +2. `examples` -- tests primarily for various applications that aren't part of the API + +## How transformers are tested + +1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs + are defined in this [config file](https://github.com/huggingface/transformers-doc2mdx/tree/master/.circleci/config.yml), so that if needed you can reproduce the same + environment on your machine. + + These CI jobs don't run `@slow` tests. + +2. There are 3 jobs run by [github actions](https://github.com/huggingface/transformers/actions): + + - [torch hub integration](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/github-torch-hub.yml): checks whether torch hub + integration works. + + - [self-hosted (push)](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-push.yml): runs fast tests on GPU only on commits on + `master`. It only runs if a commit on `master` has updated the code in one of the following folders: `src`, + `tests`, `.github` (to prevent running on added model cards, notebooks, etc.) + + - [self-hosted runner](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-scheduled.yml): runs normal and slow tests on GPU in + `tests` and `examples`: + +```bash +RUN_SLOW=1 pytest tests/ +RUN_SLOW=1 pytest examples/ +``` + + The results can be observed [here](https://github.com/huggingface/transformers/actions). + + + +## Running tests + + + + + +### Choosing which tests to run + +This document goes into many details of how tests can be run. If after reading everything, you need even more details +you will find them [here](https://docs.pytest.org/en/latest/usage.html). + +Here are some most useful ways of running tests. + +Run all: + +```console +pytest +``` + +or: + +```bash +make test +``` + +Note that the latter is defined as: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/ +``` + +which tells pytest to: + +- run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!) +- ensure that all tests from the same file will be run by the same test process +- do not capture output +- run in verbose mode + + + +### Getting the list of all tests + +All tests of the test suite: + +```bash +pytest --collect-only -q +``` + +All tests of a given test file: + +```bash +pytest tests/test_optimization.py --collect-only -q +``` + +### Run a specific test module + +To run an individual test module: + +```bash +pytest tests/test_logging.py +``` + +### Run specific tests + +Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest +class containing those tests. For example, it could be: + +```bash +pytest tests/test_optimization.py::OptimizationTest::test_adam_w +``` + +Here: + +- `tests/test_optimization.py` - the file with tests +- `OptimizationTest` - the name of the class +- `test_adam_w` - the name of the specific test function + +If the file contains multiple classes, you can choose to run only tests of a given class. For example: + +```bash +pytest tests/test_optimization.py::OptimizationTest +``` + +will run all the tests inside that class. + +As mentioned earlier you can see what tests are contained inside the `OptimizationTest` class by running: + +```bash +pytest tests/test_optimization.py::OptimizationTest --collect-only -q +``` + +You can run tests by keyword expressions. + +To run only tests whose name contains `adam`: + +```bash +pytest -k adam tests/test_optimization.py +``` + +Logical `and` and `or` can be used to indicate whether all keywords should match or either. `not` can be used to +negate. + +To run all tests except those whose name contains `adam`: + +```bash +pytest -k "not adam" tests/test_optimization.py +``` + +And you can combine the two patterns in one: + +```bash +pytest -k "ada and not adam" tests/test_optimization.py +``` + +For example to run both `test_adafactor` and `test_adam_w` you can use: + +```bash +pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py +``` + +Note that we use `or` here, since we want either of the keywords to match to include both. + +If you want to include only tests that include both patterns, `and` is to be used: + +```bash +pytest -k "test and ada" tests/test_optimization.py +``` + +### Run only modified tests + +You can run the tests related to the unstaged files or the current branch (according to Git) by using [pytest-picked](https://github.com/anapaulagomes/pytest-picked). This is a great way of quickly testing your changes didn't break +anything, since it won't run the tests related to files you didn't touch. + +```bash +pip install pytest-picked +``` + +```bash +pytest --picked +``` + +All tests will be run from files and folders which are modified, but not yet committed. + +### Automatically rerun failed tests on source modification + +[pytest-xdist](https://github.com/pytest-dev/pytest-xdist) provides a very useful feature of detecting all failed +tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you +fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after +which again a full run is performed. + +```bash +pip install pytest-xdist +``` + +To enter the mode: `pytest -f` or `pytest --looponfail` + +File changes are detected by looking at `looponfailroots` root directories and all of their contents (recursively). +If the default for this value does not work for you, you can change it in your project by setting a configuration +option in `setup.cfg`: + +```ini +[tool:pytest] +looponfailroots = transformers tests +``` + +or `pytest.ini`/``tox.ini`` files: + +```ini +[pytest] +looponfailroots = transformers tests +``` + +This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s +directory. + +[pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality. + + +### Skip a test module + +If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For +example, to run all except `test_modeling_*.py` tests: + +```bash +pytest *ls -1 tests/*py | grep -v test_modeling* +``` + +### Clearing state + +CI builds and when isolation is important (against speed), cache should be cleared: + +```bash +pytest --cache-clear tests +``` + +### Running tests in parallel + +As mentioned earlier `make test` runs tests in parallel via `pytest-xdist` plugin (`-n X` argument, e.g. `-n 2` +to run 2 parallel jobs). + +`pytest-xdist`'s `--dist=` option allows one to control how the tests are grouped. `--dist=loadfile` puts the +tests located in one file onto the same process. + +Since the order of executed tests is different and unpredictable, if running the test suite with `pytest-xdist` +produces failures (meaning we have some undetected coupled tests), use [pytest-replay](https://github.com/ESSS/pytest-replay) to replay the tests in the same order, which should help with then somehow +reducing that failing sequence to a minimum. + +### Test order and repetition + +It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential +inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect +some problems that get uncovered by randomness of DL. + + +#### Repeat tests + +- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): + +```bash +pip install pytest-flakefinder +``` + +And then run every test multiple times (50 by default): + +```bash +pytest --flake-finder --flake-runs=5 tests/test_failing_test.py +``` + + + +This plugin doesn't work with `-n` flag from `pytest-xdist`. + + + + + +There is another plugin `pytest-repeat`, but it doesn't work with `unittest`. + + + +#### Run tests in a random order + +```bash +pip install pytest-random-order +``` + +Important: the presence of `pytest-random-order` will automatically randomize tests, no configuration change or +command line options is required. + +As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When +`pytest-random-order` is installed it will print the random seed it used for that session, e.g: + +```bash +pytest tests +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.: + +```bash +pytest --random-order-seed=573663 +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to +manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order +they failed and tell pytest to not randomize them instead using `--random-order-bucket=none`, e.g.: + +```bash +pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py +``` + +To disable the shuffling for all tests: + +```bash +pytest --random-order-bucket=none +``` + +By default `--random-order-bucket=module` is implied, which will shuffle the files on the module levels. It can also +shuffle on `class`, `package`, `global` and `none` levels. For the complete details please see its +[documentation](https://github.com/jbasko/pytest-random-order). + +Another randomization alternative is: [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly). This +module has a very similar functionality/interface, but it doesn't have the bucket modes available in +`pytest-random-order`. It has the same problem of imposing itself once installed. + +### Look and feel variations + +#### pytest-sugar + +[pytest-sugar](https://github.com/Frozenball/pytest-sugar) is a plugin that improves the look-n-feel, adds a +progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation. + +```bash +pip install pytest-sugar +``` + +To run tests without it, run: + +```bash +pytest -p no:sugar +``` + +or uninstall it. + + + +#### Report each sub-test name and its progress + +For a single or a group of tests via `pytest` (after `pip install pytest-pspec`): + +```bash +pytest --pspec tests/test_optimization.py +``` + +#### Instantly shows failed tests + +[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) shows failures and errors instantly instead of +waiting until the end of test session. + +```bash +pip install pytest-instafail +``` + +```bash +pytest --instafail +``` + +### To GPU or not to GPU + +On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`: + +```bash +CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py +``` + +or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the +second gpu if you have gpus `0` and `1`, you can run: + +```bash +CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py +``` + +This is handy when you want to run different tasks on different GPUs. + +Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip +decorators are used to set the requirements of tests CPU/GPU/TPU-wise: + +- `require_torch` - this test will run only under torch +- `require_torch_gpu` - as `require_torch` plus requires at least 1 GPU +- `require_torch_multi_gpu` - as `require_torch` plus requires at least 2 GPUs +- `require_torch_non_multi_gpu` - as `require_torch` plus requires 0 or 1 GPUs +- `require_torch_up_to_2_gpus` - as `require_torch` plus requires 0 or 1 or 2 GPUs +- `require_torch_tpu` - as `require_torch` plus requires at least 1 TPU + +Let's depict the GPU requirements in the following table: + + +| n gpus | decorator | +|--------+--------------------------------| +| `>= 0` | `@require_torch` | +| `>= 1` | `@require_torch_gpu` | +| `>= 2` | `@require_torch_multi_gpu` | +| `< 2` | `@require_torch_non_multi_gpu` | +| `< 3` | `@require_torch_up_to_2_gpus` | + + +For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed: + +```python +@require_torch_multi_gpu +def test_example_with_multi_gpu(): +``` + +If a test requires `tensorflow` use the `require_tf` decorator. For example: + +```python +@require_tf +def test_tf_thing_with_tensorflow(): +``` + +These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is +how to set it up: + +```python +@require_torch_gpu +@slow +def test_example_slow_on_gpu(): +``` + +Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed +last for them to work correctly. Here is an example of the correct usage: + +```python +@parameterized.expand(...) +@require_torch_multi_gpu +def test_integration_foo(): +``` + +This order problem doesn't exist with `@pytest.mark.parametrize`, you can put it first or last and it will still +work. But it only works with non-unittests. + +Inside tests: + +- How many GPUs are available: + +```python +from transformers.testing_utils import get_gpu_count +n_gpu = get_gpu_count() # works with torch and tf +``` + +### Distributed training + +`pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right +thing and end up thinking they are `pytest` and start running the test suite in loops. It works, however, if one +spawns a normal process that then spawns off multiple workers and manages the IO pipes. + +Here are some tests that use it: + +- [test_trainer_distributed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/test_trainer_distributed.py) +- [test_deepspeed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/deepspeed/test_deepspeed.py) + +To jump right into the execution point, search for the `execute_subprocess_async` call in those tests. + +You will need at least 2 GPUs to see these tests in action: + +```bash +CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py +``` + +### Output capture + +During test execution any output sent to `stdout` and `stderr` is captured. If a test or a setup method fails, its +according captured output will usually be shown along with the failure traceback. + +To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`: + +```bash +pytest -s tests/test_logging.py +``` + +To send test results to JUnit format output: + +```bash +py.test tests --junitxml=result.xml +``` + +### Color control + +To have no color (e.g., yellow on white background is not readable): + +```bash +pytest --color=no tests/test_logging.py +``` + +### Sending test report to online pastebin service + +Creating a URL for each test failure: + +```bash +pytest --pastebin=failed tests/test_logging.py +``` + +This will submit test run information to a remote Paste service and provide a URL for each failure. You may select +tests as usual or add for example -x if you only want to send one particular failure. + +Creating a URL for a whole test session log: + +```bash +pytest --pastebin=all tests/test_logging.py +``` + +## Writing tests + +🤗 transformers tests are based on `unittest`, but run by `pytest`, so most of the time features from both systems +can be used. + +You can read [here](https://docs.pytest.org/en/stable/unittest.html) which features are supported, but the important +thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module +`parameterized` that works in a similar way. + + +### Parametrization + +Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within +the test, but then there is no way of running that test for just one set of arguments. + +```python +# test_this1.py +import unittest +from parameterized import parameterized +class TestMathUnitTest(unittest.TestCase): + @parameterized.expand([ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ]) + def test_floor(self, name, input, expected): + assert_equal(math.floor(input), expected) +``` + +Now, by default this test will be run 3 times, each time with the last 3 arguments of `test_floor` being assigned the +corresponding arguments in the parameter list. + +and you could run just the `negative` and `integer` sets of params with: + +```bash +pytest -k "negative and integer" tests/test_mytest.py +``` + +or all but `negative` sub-tests, with: + +```bash +pytest -k "not negative" tests/test_mytest.py +``` + +Besides using the `-k` filter that was just mentioned, you can find out the exact name of each sub-test and run any +or all of them using their exact names. + +```bash +pytest test_this1.py --collect-only -q +``` + +and it will list: + +```bash +test_this1.py::TestMathUnitTest::test_floor_0_negative +test_this1.py::TestMathUnitTest::test_floor_1_integer +test_this1.py::TestMathUnitTest::test_floor_2_large_fraction +``` + +So now you can run just 2 specific sub-tests: + +```bash +pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer +``` + +The module [parameterized](https://pypi.org/project/parameterized/) which is already in the developer dependencies +of `transformers` works for both: `unittests` and `pytest` tests. + +If, however, the test is not a `unittest`, you may use `pytest.mark.parametrize` (or you may see it being used in +some existing tests, mostly under `examples`). + +Here is the same example, this time using `pytest`'s `parametrize` marker: + +```python +# test_this2.py +import pytest +@pytest.mark.parametrize( + "name, input, expected", + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ], +) +def test_floor(name, input, expected): + assert_equal(math.floor(input), expected) +``` + +Same as with `parameterized`, with `pytest.mark.parametrize` you can have a fine control over which sub-tests are +run, if the `-k` filter doesn't do the job. Except, this parametrization function creates a slightly different set of +names for the sub-tests. Here is what they look like: + +```bash +pytest test_this2.py --collect-only -q +``` + +and it will list: + +```bash +test_this2.py::test_floor[integer-1-1.0] +test_this2.py::test_floor[negative--1.5--2.0] +test_this2.py::test_floor[large fraction-1.6-1] +``` + +So now you can run just the specific test: + +```bash +pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] +``` + +as in the previous example. + + + +### Files and directories + +In tests often we need to know where things are relative to the current test file, and it's not trivial since the test +could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class +`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy +accessors to them: + +- `pathlib` objects (all fully resolved): + + - `test_file_path` - the current test file path, i.e. `__file__` + - `test_file_dir` - the directory containing the current test file + - `tests_dir` - the directory of the `tests` test suite + - `examples_dir` - the directory of the `examples` test suite + - `repo_root_dir` - the directory of the repository + - `src_dir` - the directory of `src` (i.e. where the `transformers` sub-dir resides) + +- stringified paths---same as above but these return paths as strings, rather than `pathlib` objects: + + - `test_file_path_str` + - `test_file_dir_str` + - `tests_dir_str` + - `examples_dir_str` + - `repo_root_dir_str` + - `src_dir_str` + +To start using those all you need is to make sure that the test resides in a subclass of +`transformers.test_utils.TestCasePlus`. For example: + +```python +from transformers.testing_utils import TestCasePlus +class PathExampleTest(TestCasePlus): + def test_something_involving_local_locations(self): + data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" +``` + +If you don't need to manipulate paths via `pathlib` or you just need a path as a string, you can always invoked +`str()` on the `pathlib` object or use the accessors ending with `_str`. For example: + +```python +from transformers.testing_utils import TestCasePlus +class PathExampleTest(TestCasePlus): + def test_something_involving_stringified_locations(self): + examples_dir = self.examples_dir_str +``` + +### Temporary files and directories + +Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite +each other's data. Also we want to get the temporary files and directories removed at the end of each test that created +them. Therefore, using packages like `tempfile`, which address these needs is essential. + +However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want +to know it's exact path and not having it randomized on every test re-run. + +A helper class `transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of +`unittest.TestCase`, so we can easily inherit from it in the test modules. + +Here is an example of its usage: + +```python +from transformers.testing_utils import TestCasePlus +class ExamplesTests(TestCasePlus): + def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +This code creates a unique temporary directory, and sets `tmp_dir` to its location. + +- Create a unique temporary dir: + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +`tmp_dir` will contain the path to the created temporary dir. It will be automatically removed at the end of the +test. + +- Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test. + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir("./xxx") +``` + +This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't +leave any data in there. + +- You can override the default behavior by directly overriding the `before` and `after` args, leading to one of the + following behaviors: + + - `before=True`: the temporary dir will always be cleared at the beginning of the test. + - `before=False`: if the temporary dir already existed, any existing files will remain there. + - `after=True`: the temporary dir will always be deleted at the end of the test. + - `after=False`: the temporary dir will always be left intact at the end of the test. + + + +In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if +an explicit obj:*tmp_dir* is used, so that by mistake no `/tmp` or similar important part of the filesystem will +get nuked. i.e. please always pass paths that start with `./`. + + + + + +Each test can register multiple temporary directories and they all will get auto-removed, unless requested +otherwise. + + + +### Temporary sys.path override + +If you need to temporary override `sys.path` to import from another test for example, you can use the +`ExtendSysPath` context manager. Example: + + +```python +import os +from transformers.testing_utils import ExtendSysPath +bindir = os.path.abspath(os.path.dirname(__file__)) +with ExtendSysPath(f"{bindir}/.."): + from test_trainer import TrainerIntegrationCommon # noqa +``` + +### Skipping tests + +This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to +commit it to the main repository we need make sure it's skipped during `make test`. + +Methods: + +- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip + running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping + tests that depend on an external resource which is not available at the moment (for example a database). + +- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet + implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with + pytest.mark.xfail), it’s an xpass and will be reported in the test summary. + +One of the important differences between the two is that `skip` doesn't run the test, and `xfail` does. So if the +code that's buggy causes some bad state that will affect other tests, do not use `xfail`. + +#### Implementation + +- Here is how to skip whole test unconditionally: + +```python +@unittest.skip("this bug needs to be fixed") +def test_feature_x(): +``` + +or via pytest: + +```python +@pytest.mark.skip(reason="this bug needs to be fixed") +``` + +or the `xfail` way: + +```python +@pytest.mark.xfail +def test_feature_x(): +``` + +- Here is how to skip a test based on some internal check inside the test: + +```python +def test_feature_x(): + if not has_something(): + pytest.skip("unsupported configuration") +``` + +or the whole module: + +```python +import pytest +if not pytest.config.getoption("--custom-flag"): + pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) +``` + +or the `xfail` way: + +```python +def test_feature_x(): + pytest.xfail("expected to fail until bug XYZ is fixed") +``` + +- Here is how to skip all tests in a module if some import is missing: + +```python +docutils = pytest.importorskip("docutils", minversion="0.3") +``` + +- Skip a test based on a condition: + +```python +@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") +def test_feature_x(): +``` + +or: + +```python +@unittest.skipIf(torch_device == "cpu", "Can't do half precision") +def test_feature_x(): +``` + +or skip the whole module: + +```python +@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") +class TestClass(): + def test_feature_x(self): +``` + +More details, example and ways are [here](https://docs.pytest.org/en/latest/skipping.html). + +### Slow tests + +The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for +an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be +marked as in the example below: + +```python +from transformers.testing_utils import slow +@slow +def test_integration_foo(): +``` + +Once a test is marked as `@slow`, to run such tests set `RUN_SLOW=1` env var, e.g.: + +```bash +RUN_SLOW=1 pytest tests +``` + +Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators +`@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage: + +```python +@parameterized.expand(...) +@slow +def test_integration_foo(): +``` + +As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI +checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will +get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your +machine before submitting the PR. + +Here is a rough decision making mechanism for choosing which tests should be marked as slow: + +If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files, +pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library, +such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine +this approach we should have exceptions: + +- All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or + tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you + should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is + discussed in the following paragraphs. +- All tests that need to do a training not specifically optimized to be fast should be set to slow. +- We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to + `@slow`. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked + as `@slow`. +- If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless. + +Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example, +a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models +have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the `@slow` tests can use large +slow models to do qualitative testing. To see the use of these simply look for *tiny* models with: + +```bash +grep tiny tests examples +``` + +Here is a an example of a [script](https://github.com/huggingface/transformers-doc2mdx/tree/master/scripts/fsmt/fsmt-make-tiny-model.py) that created the tiny model +[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). You can easily adjust it to your specific +model's architecture. + +It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if +you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the +execution speed report in CI logs instead (the output of `pytest --durations=0 tests`). + +That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast. +If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest +tests. + + +### Testing the stdout/stderr output + +In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the +`pytest`'s [capsys system](https://docs.pytest.org/en/latest/capture.html). Here is how this is accomplished: + +```python +import sys +def print_to_stdout(s): print(s) +def print_to_stderr(s): sys.stderr.write(s) +def test_result_and_stdout(capsys): + msg = "Hello" + print_to_stdout(msg) + print_to_stderr(msg) + out, err = capsys.readouterr() # consume the captured output streams + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + sys.stderr.write(err) + # test: + assert msg in out + assert msg in err +``` + +And, of course, most of the time, `stderr` will come as a part of an exception, so try/except has to be used in such +a case: + +```python +def raise_exception(msg): raise ValueError(msg) +def test_something_exception(): + msg = "Not a good value" + error = '' + try: + raise_exception(msg) + except Exception as e: + error = str(e) + assert msg in error, f"{msg} is in the exception:\n{error}" +``` + +Another approach to capturing stdout is via `contextlib.redirect_stdout`: + +```python +from io import StringIO +from contextlib import redirect_stdout +def print_to_stdout(s): print(s) +def test_result_and_stdout(): + msg = "Hello" + buffer = StringIO() + with redirect_stdout(buffer): + print_to_stdout(msg) + out = buffer.getvalue() + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + # test: + assert msg in out +``` + +An important potential issue with capturing stdout is that it may contain `\r` characters that in normal `print` +reset everything that has been printed so far. There is no problem with `pytest`, but with `pytest -s` these +characters get included in the buffer, so to be able to have the test run with and without `-s`, you have to make an +extra cleanup to the captured output, using `re.sub(r'~.*\r', '', buf, 0, re.M)`. + +But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has +some `\r`'s in it or not, so it's a simple: + +```python +from transformers.testing_utils import CaptureStdout +with CaptureStdout() as cs: + function_that_writes_to_stdout() +print(cs.out) +``` + +Here is a full test example: + +```python +from transformers.testing_utils import CaptureStdout +msg = "Secret message\r" +final = "Hello World" +with CaptureStdout() as cs: + print(msg + final) +assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}" +``` + +If you'd like to capture `stderr` use the `CaptureStderr` class instead: + +```python +from transformers.testing_utils import CaptureStderr +with CaptureStderr() as cs: + function_that_writes_to_stderr() +print(cs.err) +``` + +If you need to capture both streams at once, use the parent `CaptureStd` class: + +```python +from transformers.testing_utils import CaptureStd +with CaptureStd() as cs: + function_that_writes_to_stdout_and_stderr() +print(cs.err, cs.out) +``` + +Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit +from the context. + + +### Capturing logger stream + +If you need to validate the output of a logger, you can use `CaptureLogger`: + +```python +from transformers import logging +from transformers.testing_utils import CaptureLogger + +msg = "Testing 1, 2, 3" +logging.set_verbosity_info() +logger = logging.get_logger("transformers.models.bart.tokenization_bart") +with CaptureLogger(logger) as cl: + logger.info(msg) +assert cl.out, msg+"\n" +``` + +### Testing with environment variables + +If you want to test the impact of environment variables for a specific test you can use a helper decorator +`transformers.testing_utils.mockenv` + +```python +from transformers.testing_utils import mockenv +class HfArgumentParserTest(unittest.TestCase): + @mockenv(TRANSFORMERS_VERBOSITY="error") + def test_env_override(self): + env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) +``` + +At times an external program needs to be called, which requires setting `PYTHONPATH` in `os.environ` to include +multiple local paths. A helper class `transformers.test_utils.TestCasePlus` comes to help: + +```python +from transformers.testing_utils import TestCasePlus +class EnvExampleTest(TestCasePlus): + def test_external_prog(self): + env = self.get_env() + # now call the external program, passing `env` to it +``` + +Depending on whether the test file was under the `tests` test suite or `examples` it'll correctly set up +`env[PYTHONPATH]` to include one of these two directories, and also the `src` directory to ensure the testing is +done against the current repo, and finally with whatever `env[PYTHONPATH]` was already set to before the test was +called if anything. + +This helper method creates a copy of the `os.environ` object, so the original remains intact. + + +### Getting reproducible results + +In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you +will need to fix the seed: + +```python +seed = 42 + +# python RNG +import random +random.seed(seed) + +# pytorch RNGs +import torch +torch.manual_seed(seed) +torch.backends.cudnn.deterministic = True +if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) + +# numpy RNG +import numpy as np +np.random.seed(seed) + +# tf RNG +tf.random.set_seed(seed) +``` + +### Debugging tests + +To start a debugger at the point of the warning, do this: + +```bash +pytest tests/test_logging.py -W error::UserWarning --pdb +``` + +## Working with github actions workflows + +To trigger a self-push workflow CI job, you must: + +1. Create a new branch on `transformers` origin (not a fork!). +2. The branch name has to start with either `ci_` or `ci-` (`master` triggers it too, but we can't do PRs on + `master`). It also gets triggered only for specific paths - you can find the up-to-date definition in case it + changed since this document has been written [here](https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml) under *push:* +3. Create a PR from this branch. +4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there + is a backlog. + + + + +## Testing Experimental CI Features + +Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a +new CI feature is to be added, it should be done as following. + +1. Create a new dedicated job that tests what needs to be tested +2. The new job must always succeed so that it gives us a green ✓ (details below). +3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches, + non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there + are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always + green) +4. When it's clear that everything is solid, then merge the new changes into existing jobs. + +That way experiments on CI functionality itself won't interfere with the normal workflow. + +Now how can we make the job always succeed while the new CI feature is being developed? + +Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and +Github Actions as of this writing don't support that. + +So the following workaround can be used: + +1. `set +euo pipefail` at the beginning of the run command to suppress most potential failures in the bash script. +2. the last command must be a success: `echo "done"` or just `true` will do + +Here is an example: + +```yaml +- run: + name: run CI experiment + command: | + set +euo pipefail + echo "setting run-all-despite-any-errors-mode" + this_command_will_fail + echo "but bash continues to run" + # emulate another failure + false + # but the last command must be a success + echo "during experiment do not remove: reporting success to CI, even if there were failures" +``` + +For simple commands you could also do: + +```bash +cmd_that_may_fail || true +``` + +Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs, +while removing `set +euo pipefail` or any other things you may have added to ensure that the experimental job doesn't +interfere with the normal CI functioning. + +This whole process would have been much easier if we only could set something like `allow-failure` for the +experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and +Github Actions don't support it at the moment. + +You can vote for this feature and see where it is at at these CI-specific threads: + +- [Github Actions:](https://github.com/actions/toolkit/issues/399) +- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) diff --git a/docs/source/testing.rst b/docs/source/testing.rst deleted file mode 100644 index f057e8bbcf..0000000000 --- a/docs/source/testing.rst +++ /dev/null @@ -1,1252 +0,0 @@ -.. - Copyright 2020 The HuggingFace Team. All rights reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on - an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the - specific language governing permissions and limitations under the License. - -Testing -======================================================================================================================= - - -Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones. - -There are 2 test suites in the repository: - -1. ``tests`` -- tests for the general API -2. ``examples`` -- tests primarily for various applications that aren't part of the API - -How transformers are tested ------------------------------------------------------------------------------------------------------------------------ - -1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs - are defined in this :prefix_link:`config file <.circleci/config.yml>`, so that if needed you can reproduce the same - environment on your machine. - - These CI jobs don't run ``@slow`` tests. - -2. There are 3 jobs run by `github actions `__: - - * :prefix_link:`torch hub integration <.github/workflows/github-torch-hub.yml>`: checks whether torch hub - integration works. - - * :prefix_link:`self-hosted (push) <.github/workflows/self-push.yml>`: runs fast tests on GPU only on commits on - ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``, - ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.) - - * :prefix_link:`self-hosted runner <.github/workflows/self-scheduled.yml>`: runs normal and slow tests on GPU in - ``tests`` and ``examples``: - - .. code-block:: bash - - RUN_SLOW=1 pytest tests/ - RUN_SLOW=1 pytest examples/ - - The results can be observed `here `__. - - - -Running tests ------------------------------------------------------------------------------------------------------------------------ - - - - - -Choosing which tests to run -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This document goes into many details of how tests can be run. If after reading everything, you need even more details -you will find them `here `__. - -Here are some most useful ways of running tests. - -Run all: - -.. code-block:: console - - pytest - -or: - -.. code-block:: bash - - make test - -Note that the latter is defined as: - -.. code-block:: bash - - python -m pytest -n auto --dist=loadfile -s -v ./tests/ - -which tells pytest to: - -* run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!) -* ensure that all tests from the same file will be run by the same test process -* do not capture output -* run in verbose mode - - - -Getting the list of all tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -All tests of the test suite: - -.. code-block:: bash - - pytest --collect-only -q - -All tests of a given test file: - -.. code-block:: bash - - pytest tests/test_optimization.py --collect-only -q - - - -Run a specific test module -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To run an individual test module: - -.. code-block:: bash - - pytest tests/test_logging.py - - -Run specific tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest -class containing those tests. For example, it could be: - -.. code-block:: bash - - pytest tests/test_optimization.py::OptimizationTest::test_adam_w - -Here: - -* ``tests/test_optimization.py`` - the file with tests -* ``OptimizationTest`` - the name of the class -* ``test_adam_w`` - the name of the specific test function - -If the file contains multiple classes, you can choose to run only tests of a given class. For example: - -.. code-block:: bash - - pytest tests/test_optimization.py::OptimizationTest - - -will run all the tests inside that class. - -As mentioned earlier you can see what tests are contained inside the ``OptimizationTest`` class by running: - -.. code-block:: bash - - pytest tests/test_optimization.py::OptimizationTest --collect-only -q - -You can run tests by keyword expressions. - -To run only tests whose name contains ``adam``: - -.. code-block:: bash - - pytest -k adam tests/test_optimization.py - -Logical ``and`` and ``or`` can be used to indicate whether all keywords should match or either. ``not`` can be used to -negate. - -To run all tests except those whose name contains ``adam``: - -.. code-block:: bash - - pytest -k "not adam" tests/test_optimization.py - -And you can combine the two patterns in one: - -.. code-block:: bash - - pytest -k "ada and not adam" tests/test_optimization.py - -For example to run both ``test_adafactor`` and ``test_adam_w`` you can use: - -.. code-block:: bash - - pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py - -Note that we use ``or`` here, since we want either of the keywords to match to include both. - -If you want to include only tests that include both patterns, ``and`` is to be used: - -.. code-block:: bash - - pytest -k "test and ada" tests/test_optimization.py - - - -Run only modified tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked -`__. This is a great way of quickly testing your changes didn't break -anything, since it won't run the tests related to files you didn't touch. - -.. code-block:: bash - - pip install pytest-picked - -.. code-block:: bash - - pytest --picked - -All tests will be run from files and folders which are modified, but not yet committed. - -Automatically rerun failed tests on source modification -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -`pytest-xdist `__ provides a very useful feature of detecting all failed -tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you -fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after -which again a full run is performed. - -.. code-block:: bash - - pip install pytest-xdist - -To enter the mode: ``pytest -f`` or ``pytest --looponfail`` - -File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively). -If the default for this value does not work for you, you can change it in your project by setting a configuration -option in ``setup.cfg``: - -.. code-block:: ini - - [tool:pytest] - looponfailroots = transformers tests - -or ``pytest.ini``/``tox.ini`` files: - -.. code-block:: ini - - [pytest] - looponfailroots = transformers tests - -This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s -directory. - -`pytest-watch `__ is an alternative implementation of this functionality. - - -Skip a test module -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For -example, to run all except ``test_modeling_*.py`` tests: - -.. code-block:: bash - - pytest `ls -1 tests/*py | grep -v test_modeling` - - -Clearing state -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -CI builds and when isolation is important (against speed), cache should be cleared: - -.. code-block:: bash - - pytest --cache-clear tests - -Running tests in parallel -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` -to run 2 parallel jobs). - -``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the -tests located in one file onto the same process. - -Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist`` -produces failures (meaning we have some undetected coupled tests), use `pytest-replay -`__ to replay the tests in the same order, which should help with then somehow -reducing that failing sequence to a minimum. - -Test order and repetition -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential -inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect -some problems that get uncovered by randomness of DL. - - -Repeat tests -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -* `pytest-flakefinder `__: - -.. code-block:: bash - - pip install pytest-flakefinder - -And then run every test multiple times (50 by default): - -.. code-block:: bash - - pytest --flake-finder --flake-runs=5 tests/test_failing_test.py - -.. note:: - This plugin doesn't work with ``-n`` flag from ``pytest-xdist``. - -.. note:: - There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``. - - -Run tests in a random order -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code-block:: bash - - pip install pytest-random-order - -Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or -command line options is required. - -As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When -``pytest-random-order`` is installed it will print the random seed it used for that session, e.g: - -.. code-block:: bash - - pytest tests - [...] - Using --random-order-bucket=module - Using --random-order-seed=573663 - -So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.: - -.. code-block:: bash - - pytest --random-order-seed=573663 - [...] - Using --random-order-bucket=module - Using --random-order-seed=573663 - -It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to -manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order -they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.: - -.. code-block:: bash - - pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py - -To disable the shuffling for all tests: - -.. code-block:: bash - - pytest --random-order-bucket=none - -By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also -shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its -`documentation `__. - -Another randomization alternative is: ``pytest-randomly`` `__. This -module has a very similar functionality/interface, but it doesn't have the bucket modes available in -``pytest-random-order``. It has the same problem of imposing itself once installed. - -Look and feel variations -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -pytest-sugar -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -`pytest-sugar `__ is a plugin that improves the look-n-feel, adds a -progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation. - -.. code-block:: bash - - pip install pytest-sugar - -To run tests without it, run: - -.. code-block:: bash - - pytest -p no:sugar - -or uninstall it. - - - -Report each sub-test name and its progress -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``): - -.. code-block:: bash - - pytest --pspec tests/test_optimization.py - - - -Instantly shows failed tests -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -`pytest-instafail `__ shows failures and errors instantly instead of -waiting until the end of test session. - -.. code-block:: bash - - pip install pytest-instafail - -.. code-block:: bash - - pytest --instafail - -To GPU or not to GPU -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``: - -.. code-block:: bash - - CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py - -or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the -second gpu if you have gpus ``0`` and ``1``, you can run: - -.. code-block:: bash - - CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py - -This is handy when you want to run different tasks on different GPUs. - -Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip -decorators are used to set the requirements of tests CPU/GPU/TPU-wise: - -* ``require_torch`` - this test will run only under torch -* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU -* ``require_torch_multi_gpu`` - as ``require_torch`` plus requires at least 2 GPUs -* ``require_torch_non_multi_gpu`` - as ``require_torch`` plus requires 0 or 1 GPUs -* ``require_torch_up_to_2_gpus`` - as ``require_torch`` plus requires 0 or 1 or 2 GPUs -* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU - -Let's depict the GPU requirements in the following table: - - -+----------+----------------------------------+ -| n gpus | decorator | -+==========+==================================+ -| ``>= 0`` | ``@require_torch`` | -+----------+----------------------------------+ -| ``>= 1`` | ``@require_torch_gpu`` | -+----------+----------------------------------+ -| ``>= 2`` | ``@require_torch_multi_gpu`` | -+----------+----------------------------------+ -| ``< 2`` | ``@require_torch_non_multi_gpu`` | -+----------+----------------------------------+ -| ``< 3`` | ``@require_torch_up_to_2_gpus`` | -+----------+----------------------------------+ - - -For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed: - -.. code-block:: python - - @require_torch_multi_gpu - def test_example_with_multi_gpu(): - -If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example: - -.. code-block:: python - - @require_tf - def test_tf_thing_with_tensorflow(): - -These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is -how to set it up: - -.. code-block:: python - - @require_torch_gpu - @slow - def test_example_slow_on_gpu(): - -Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed -last for them to work correctly. Here is an example of the correct usage: - -.. code-block:: python - - @parameterized.expand(...) - @require_torch_multi_gpu - def test_integration_foo(): - -This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still -work. But it only works with non-unittests. - -Inside tests: - -* How many GPUs are available: - -.. code-block:: bash - - from transformers.testing_utils import get_gpu_count - n_gpu = get_gpu_count() # works with torch and tf - - - -Distributed training -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right -thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one -spawns a normal process that then spawns off multiple workers and manages the IO pipes. - -Here are some tests that use it: - -* :prefix_link:`test_trainer_distributed.py ` -* :prefix_link:`test_deepspeed.py ` - -To jump right into the execution point, search for the ``execute_subprocess_async`` call in those tests. - -You will need at least 2 GPUs to see these tests in action: - -.. code-block:: bash - - CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py - - -Output capture -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its -according captured output will usually be shown along with the failure traceback. - -To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``: - -.. code-block:: bash - - pytest -s tests/test_logging.py - -To send test results to JUnit format output: - -.. code-block:: bash - - py.test tests --junitxml=result.xml - - -Color control -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To have no color (e.g., yellow on white background is not readable): - -.. code-block:: bash - - pytest --color=no tests/test_logging.py - - - -Sending test report to online pastebin service -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Creating a URL for each test failure: - -.. code-block:: bash - - pytest --pastebin=failed tests/test_logging.py - -This will submit test run information to a remote Paste service and provide a URL for each failure. You may select -tests as usual or add for example -x if you only want to send one particular failure. - -Creating a URL for a whole test session log: - -.. code-block:: bash - - pytest --pastebin=all tests/test_logging.py - - - -Writing tests ------------------------------------------------------------------------------------------------------------------------ - -🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems -can be used. - -You can read `here `__ which features are supported, but the important -thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module -``parameterized`` that works in a similar way. - - -Parametrization -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within -the test, but then there is no way of running that test for just one set of arguments. - -.. code-block:: python - - # test_this1.py - import unittest - from parameterized import parameterized - class TestMathUnitTest(unittest.TestCase): - @parameterized.expand([ - ("negative", -1.5, -2.0), - ("integer", 1, 1.0), - ("large fraction", 1.6, 1), - ]) - def test_floor(self, name, input, expected): - assert_equal(math.floor(input), expected) - -Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the -corresponding arguments in the parameter list. - -and you could run just the ``negative`` and ``integer`` sets of params with: - -.. code-block:: bash - - pytest -k "negative and integer" tests/test_mytest.py - -or all but ``negative`` sub-tests, with: - -.. code-block:: bash - - pytest -k "not negative" tests/test_mytest.py - -Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any -or all of them using their exact names. - -.. code-block:: bash - - pytest test_this1.py --collect-only -q - -and it will list: - -.. code-block:: bash - - test_this1.py::TestMathUnitTest::test_floor_0_negative - test_this1.py::TestMathUnitTest::test_floor_1_integer - test_this1.py::TestMathUnitTest::test_floor_2_large_fraction - -So now you can run just 2 specific sub-tests: - -.. code-block:: bash - - pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer - -The module `parameterized `__ which is already in the developer dependencies -of ``transformers`` works for both: ``unittests`` and ``pytest`` tests. - -If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in -some existing tests, mostly under ``examples``). - -Here is the same example, this time using ``pytest``'s ``parametrize`` marker: - -.. code-block:: python - - # test_this2.py - import pytest - @pytest.mark.parametrize( - "name, input, expected", - [ - ("negative", -1.5, -2.0), - ("integer", 1, 1.0), - ("large fraction", 1.6, 1), - ], - ) - def test_floor(name, input, expected): - assert_equal(math.floor(input), expected) - -Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are -run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of -names for the sub-tests. Here is what they look like: - -.. code-block:: bash - - pytest test_this2.py --collect-only -q - -and it will list: - -.. code-block:: bash - - test_this2.py::test_floor[integer-1-1.0] - test_this2.py::test_floor[negative--1.5--2.0] - test_this2.py::test_floor[large fraction-1.6-1] - -So now you can run just the specific test: - -.. code-block:: bash - - pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] - -as in the previous example. - - - -Files and directories -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In tests often we need to know where things are relative to the current test file, and it's not trivial since the test -could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class -:obj:`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy -accessors to them: - -* ``pathlib`` objects (all fully resolved): - - - ``test_file_path`` - the current test file path, i.e. ``__file__`` - - ``test_file_dir`` - the directory containing the current test file - - ``tests_dir`` - the directory of the ``tests`` test suite - - ``examples_dir`` - the directory of the ``examples`` test suite - - ``repo_root_dir`` - the directory of the repository - - ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides) - -* stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects: - - - ``test_file_path_str`` - - ``test_file_dir_str`` - - ``tests_dir_str`` - - ``examples_dir_str`` - - ``repo_root_dir_str`` - - ``src_dir_str`` - -To start using those all you need is to make sure that the test resides in a subclass of -:obj:`transformers.test_utils.TestCasePlus`. For example: - -.. code-block:: python - - from transformers.testing_utils import TestCasePlus - class PathExampleTest(TestCasePlus): - def test_something_involving_local_locations(self): - data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" - -If you don't need to manipulate paths via ``pathlib`` or you just need a path as a string, you can always invoked -``str()`` on the ``pathlib`` object or use the accessors ending with ``_str``. For example: - -.. code-block:: python - - from transformers.testing_utils import TestCasePlus - class PathExampleTest(TestCasePlus): - def test_something_involving_stringified_locations(self): - examples_dir = self.examples_dir_str - - - - -Temporary files and directories -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite -each other's data. Also we want to get the temporary files and directories removed at the end of each test that created -them. Therefore, using packages like ``tempfile``, which address these needs is essential. - -However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want -to know it's exact path and not having it randomized on every test re-run. - -A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of -:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules. - -Here is an example of its usage: - -.. code-block:: python - - from transformers.testing_utils import TestCasePlus - class ExamplesTests(TestCasePlus): - def test_whatever(self): - tmp_dir = self.get_auto_remove_tmp_dir() - -This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location. - -* Create a unique temporary dir: - -.. code-block:: python - - def test_whatever(self): - tmp_dir = self.get_auto_remove_tmp_dir() - -``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the -test. - -* Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test. - -.. code-block:: python - - def test_whatever(self): - tmp_dir = self.get_auto_remove_tmp_dir("./xxx") - -This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't -leave any data in there. - -* You can override the default behavior by directly overriding the ``before`` and ``after`` args, leading to one of the - following behaviors: - - - ``before=True``: the temporary dir will always be cleared at the beginning of the test. - - ``before=False``: if the temporary dir already existed, any existing files will remain there. - - ``after=True``: the temporary dir will always be deleted at the end of the test. - - ``after=False``: the temporary dir will always be left intact at the end of the test. - -.. note:: - In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if - an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will - get nuked. i.e. please always pass paths that start with ``./``. - -.. note:: - Each test can register multiple temporary directories and they all will get auto-removed, unless requested - otherwise. - - -Temporary sys.path override -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you need to temporary override ``sys.path`` to import from another test for example, you can use the -``ExtendSysPath`` context manager. Example: - - -.. code-block:: python - - import os - from transformers.testing_utils import ExtendSysPath - bindir = os.path.abspath(os.path.dirname(__file__)) - with ExtendSysPath(f"{bindir}/.."): - from test_trainer import TrainerIntegrationCommon # noqa - - - -Skipping tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to -commit it to the main repository we need make sure it's skipped during ``make test``. - -Methods: - -- A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip - running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping - tests that depend on an external resource which is not available at the moment (for example a database). - -- A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet - implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with - pytest.mark.xfail), it’s an xpass and will be reported in the test summary. - -One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the -code that's buggy causes some bad state that will affect other tests, do not use ``xfail``. - -Implementation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -- Here is how to skip whole test unconditionally: - -.. code-block:: python - - @unittest.skip("this bug needs to be fixed") - def test_feature_x(): - -or via pytest: - -.. code-block:: python - - @pytest.mark.skip(reason="this bug needs to be fixed") - -or the ``xfail`` way: - -.. code-block:: python - - @pytest.mark.xfail - def test_feature_x(): - -- Here is how to skip a test based on some internal check inside the test: - -.. code-block:: python - - def test_feature_x(): - if not has_something(): - pytest.skip("unsupported configuration") - -or the whole module: - -.. code-block:: python - - import pytest - if not pytest.config.getoption("--custom-flag"): - pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) - -or the ``xfail`` way: - -.. code-block:: python - - def test_feature_x(): - pytest.xfail("expected to fail until bug XYZ is fixed") - -- Here is how to skip all tests in a module if some import is missing: - -.. code-block:: python - - docutils = pytest.importorskip("docutils", minversion="0.3") - -- Skip a test based on a condition: - -.. code-block:: python - - @pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") - def test_feature_x(): - -or: - -.. code-block:: python - - @unittest.skipIf(torch_device == "cpu", "Can't do half precision") - def test_feature_x(): - -or skip the whole module: - -.. code-block:: python - - @pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") - class TestClass(): - def test_feature_x(self): - -More details, example and ways are `here `__. - -Slow tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for -an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be -marked as in the example below: - -.. code-block:: python - - from transformers.testing_utils import slow - @slow - def test_integration_foo(): - -Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var, e.g.: - -.. code-block:: bash - - RUN_SLOW=1 pytest tests - -Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators -``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage: - -.. code-block:: python - - @parameterized.expand(...) - @slow - def test_integration_foo(): - -As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI -checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will -get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your -machine before submitting the PR. - -Here is a rough decision making mechanism for choosing which tests should be marked as slow: - -If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files, -pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library, -such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine -this approach we should have exceptions: - -* All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or - tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you - should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is - discussed in the following paragraphs. -* All tests that need to do a training not specifically optimized to be fast should be set to slow. -* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to - ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked - as ``@slow``. -* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless. - -Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example, -a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models -have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large -slow models to do qualitative testing. To see the use of these simply look for *tiny* models with: - -.. code-block:: bash - - grep tiny tests examples - -Here is a an example of a :prefix_link:`script ` that created the tiny model -`stas/tiny-wmt19-en-de `__. You can easily adjust it to your specific -model's architecture. - -It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if -you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the -execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``). - -That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast. -If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest -tests. - - -Testing the stdout/stderr output -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the -``pytest``'s `capsys system `__. Here is how this is accomplished: - -.. code-block:: python - - import sys - def print_to_stdout(s): print(s) - def print_to_stderr(s): sys.stderr.write(s) - def test_result_and_stdout(capsys): - msg = "Hello" - print_to_stdout(msg) - print_to_stderr(msg) - out, err = capsys.readouterr() # consume the captured output streams - # optional: if you want to replay the consumed streams: - sys.stdout.write(out) - sys.stderr.write(err) - # test: - assert msg in out - assert msg in err - -And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such -a case: - -.. code-block:: python - - def raise_exception(msg): raise ValueError(msg) - def test_something_exception(): - msg = "Not a good value" - error = '' - try: - raise_exception(msg) - except Exception as e: - error = str(e) - assert msg in error, f"{msg} is in the exception:\n{error}" - -Another approach to capturing stdout is via ``contextlib.redirect_stdout``: - -.. code-block:: python - - from io import StringIO - from contextlib import redirect_stdout - def print_to_stdout(s): print(s) - def test_result_and_stdout(): - msg = "Hello" - buffer = StringIO() - with redirect_stdout(buffer): - print_to_stdout(msg) - out = buffer.getvalue() - # optional: if you want to replay the consumed streams: - sys.stdout.write(out) - # test: - assert msg in out - -An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print`` -reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these -characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an -extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``. - -But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has -some ``\r``'s in it or not, so it's a simple: - -.. code-block:: python - - from transformers.testing_utils import CaptureStdout - with CaptureStdout() as cs: - function_that_writes_to_stdout() - print(cs.out) - -Here is a full test example: - -.. code-block:: python - - from transformers.testing_utils import CaptureStdout - msg = "Secret message\r" - final = "Hello World" - with CaptureStdout() as cs: - print(msg + final) - assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}" - -If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead: - -.. code-block:: python - - from transformers.testing_utils import CaptureStderr - with CaptureStderr() as cs: - function_that_writes_to_stderr() - print(cs.err) - -If you need to capture both streams at once, use the parent :obj:`CaptureStd` class: - -.. code-block:: python - - from transformers.testing_utils import CaptureStd - with CaptureStd() as cs: - function_that_writes_to_stdout_and_stderr() - print(cs.err, cs.out) - -Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit -from the context. - - -Capturing logger stream -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you need to validate the output of a logger, you can use :obj:`CaptureLogger`: - -.. code-block:: python - - from transformers import logging - from transformers.testing_utils import CaptureLogger - - msg = "Testing 1, 2, 3" - logging.set_verbosity_info() - logger = logging.get_logger("transformers.models.bart.tokenization_bart") - with CaptureLogger(logger) as cl: - logger.info(msg) - assert cl.out, msg+"\n" - - -Testing with environment variables -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you want to test the impact of environment variables for a specific test you can use a helper decorator -``transformers.testing_utils.mockenv`` - -.. code-block:: python - - from transformers.testing_utils import mockenv - class HfArgumentParserTest(unittest.TestCase): - @mockenv(TRANSFORMERS_VERBOSITY="error") - def test_env_override(self): - env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) - -At times an external program needs to be called, which requires setting ``PYTHONPATH`` in ``os.environ`` to include -multiple local paths. A helper class :obj:`transformers.test_utils.TestCasePlus` comes to help: - -.. code-block:: python - - from transformers.testing_utils import TestCasePlus - class EnvExampleTest(TestCasePlus): - def test_external_prog(self): - env = self.get_env() - # now call the external program, passing ``env`` to it - -Depending on whether the test file was under the ``tests`` test suite or ``examples`` it'll correctly set up -``env[PYTHONPATH]`` to include one of these two directories, and also the ``src`` directory to ensure the testing is -done against the current repo, and finally with whatever ``env[PYTHONPATH]`` was already set to before the test was -called if anything. - -This helper method creates a copy of the ``os.environ`` object, so the original remains intact. - - -Getting reproducible results -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you -will need to fix the seed: - -.. code-block:: python - - seed = 42 - - # python RNG - import random - random.seed(seed) - - # pytorch RNGs - import torch - torch.manual_seed(seed) - torch.backends.cudnn.deterministic = True - if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) - - # numpy RNG - import numpy as np - np.random.seed(seed) - - # tf RNG - tf.random.set_seed(seed) - -Debugging tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To start a debugger at the point of the warning, do this: - -.. code-block:: bash - - pytest tests/test_logging.py -W error::UserWarning --pdb - - -Working with github actions workflows ------------------------------------------------------------------------------------------------------------------------ - -To trigger a self-push workflow CI job, you must: - -1. Create a new branch on ``transformers`` origin (not a fork!). -2. The branch name has to start with either ``ci_`` or ``ci-`` (``master`` triggers it too, but we can't do PRs on - ``master``). It also gets triggered only for specific paths - you can find the up-to-date definition in case it - changed since this document has been written `here - `__ under `push:` -3. Create a PR from this branch. -4. Then you can see the job appear `here - `__. It may not run right away if there - is a backlog. - - - - -Testing Experimental CI Features ------------------------------------------------------------------------------------------------------------------------ - -Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a -new CI feature is to be added, it should be done as following. - -1. Create a new dedicated job that tests what needs to be tested -2. The new job must always succeed so that it gives us a green ✓ (details below). -3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches, - non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there - are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always - green) -4. When it's clear that everything is solid, then merge the new changes into existing jobs. - -That way experiments on CI functionality itself won't interfere with the normal workflow. - -Now how can we make the job always succeed while the new CI feature is being developed? - -Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and -Github Actions as of this writing don't support that. - -So the following workaround can be used: - -1. ``set +euo pipefail`` at the beginning of the run command to suppress most potential failures in the bash script. -2. the last command must be a success: ``echo "done"`` or just ``true`` will do - -Here is an example: - -.. code-block:: yaml - - - run: - name: run CI experiment - command: | - set +euo pipefail - echo "setting run-all-despite-any-errors-mode" - this_command_will_fail - echo "but bash continues to run" - # emulate another failure - false - # but the last command must be a success - echo "during experiment do not remove: reporting success to CI, even if there were failures" - -For simple commands you could also do: - -.. code-block:: bash - - cmd_that_may_fail || true - -Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs, -while removing ``set +euo pipefail`` or any other things you may have added to ensure that the experimental job doesn't -interfere with the normal CI functioning. - -This whole process would have been much easier if we only could set something like ``allow-failure`` for the -experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and -Github Actions don't support it at the moment. - -You can vote for this feature and see where it is at at these CI-specific threads: - -* `Github Actions: `__ -* `CircleCI: `__