diff --git a/docs/source/debugging.mdx b/docs/source/debugging.mdx
new file mode 100644
index 0000000000..a3f05df48e
--- /dev/null
+++ b/docs/source/debugging.mdx
@@ -0,0 +1,299 @@
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Debugging
+
+## Underflow and Overflow Detection
+
+<Tip>
+
+This feature is currently available for PyTorch-only.
+
+</Tip>
+
+<Tip>
+
+For multi-GPU training it requires DDP (`torch.distributed.launch`).
+
+</Tip>
+
+<Tip>
+
+This feature can be used with any `nn.Module`-based model.
+
+</Tip>
+
+If you start getting `loss=NaN` or the model inhibits some other abnormal behavior due to `inf` or `nan` in
+activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
+you can accomplish that easily by activating a special module that will do the detection automatically.
+
+If you're using [`Trainer`], you just need to add:
+
+```bash
+--debug underflow_overflow
+```
+
+to the normal command line arguments, or pass `debug="underflow_overflow"` when creating the
+[`TrainingArguments`] object.
+
+If you're using your own training loop or another Trainer you can accomplish the same with:
+
+```python
+from .debug_utils import DebugUnderflowOverflow
+debug_overflow = DebugUnderflowOverflow(model)
+```
+
+[`~debug_utils.DebugUnderflowOverflow`] inserts hooks into the model that immediately after each
+forward call will test input and output variables and also the corresponding module's weights. As soon as `inf` or
+`nan` is detected in at least one element of the activations or weights, the program will assert and print a report
+like this (this was caught with `google/mt5-small` under fp16 mixed precision):
+
+```
+Detected inf/nan during batch_number=0
+Last 21 forward frames:
+abs min  abs max  metadata
+                  encoder.block.1.layer.1.DenseReluDense.dropout Dropout
+0.00e+00 2.57e+02 input[0]
+0.00e+00 2.85e+02 output
+[...]
+                  encoder.block.2.layer.0 T5LayerSelfAttention
+6.78e-04 3.15e+03 input[0]
+2.65e-04 3.42e+03 output[0]
+             None output[1]
+2.25e-01 1.00e+04 output[2]
+                  encoder.block.2.layer.1.layer_norm T5LayerNorm
+8.69e-02 4.18e-01 weight
+2.65e-04 3.42e+03 input[0]
+1.79e-06 4.65e+00 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
+2.17e-07 4.50e+00 weight
+1.79e-06 4.65e+00 input[0]
+2.68e-06 3.70e+01 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
+8.08e-07 2.66e+01 weight
+1.79e-06 4.65e+00 input[0]
+1.27e-04 2.37e+02 output
+                  encoder.block.2.layer.1.DenseReluDense.dropout Dropout
+0.00e+00 8.76e+03 input[0]
+0.00e+00 9.74e+03 output
+                  encoder.block.2.layer.1.DenseReluDense.wo Linear
+1.01e-06 6.44e+00 weight
+0.00e+00 9.74e+03 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
+1.79e-06 4.65e+00 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.dropout Dropout
+3.18e-04 6.27e+04 input[0]
+0.00e+00      inf output
+```
+
+The example output has been trimmed in the middle for brevity.
+
+The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
+the inputs and outputs were in the range of `1e4`. So when this training was done under fp16 mixed precision the very
+last step overflowed (since under `fp16` the largest number before `inf` is `64e3`). To avoid overflows under
+`fp16` the activations must remain way below `1e4`, because `1e4 * 1e4 = 1e8` so any matrix multiplication with
+large activations is going to lead to a numerical overflow condition.
+
+At the very start of the trace you can discover at which batch number the problem occurred (here `Detected inf/nan during batch_number=0` means the problem occurred on the first batch).
+
+Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
+for. If we look just at this frame:
+
+```
+                  encoder.block.2.layer.1.layer_norm T5LayerNorm
+8.69e-02 4.18e-01 weight
+2.65e-04 3.42e+03 input[0]
+1.79e-06 4.65e+00 output
+```
+
+Here, `encoder.block.2.layer.1.layer_norm` indicates that it was a layer norm for the first layer, of the second
+block of the encoder. And the specific calls of the `forward` is `T5LayerNorm`.
+
+Let's look at the last few frames of that report:
+
+```
+Detected inf/nan during batch_number=0
+Last 21 forward frames:
+abs min  abs max  metadata
+[...]
+                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
+2.17e-07 4.50e+00 weight
+1.79e-06 4.65e+00 input[0]
+2.68e-06 3.70e+01 output
+                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
+8.08e-07 2.66e+01 weight
+1.79e-06 4.65e+00 input[0]
+1.27e-04 2.37e+02 output
+                  encoder.block.2.layer.1.DenseReluDense.wo Linear
+1.01e-06 6.44e+00 weight
+0.00e+00 9.74e+03 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
+1.79e-06 4.65e+00 input[0]
+3.18e-04 6.27e+04 output
+                  encoder.block.2.layer.1.dropout Dropout
+3.18e-04 6.27e+04 input[0]
+0.00e+00      inf output
+```
+
+The last frame reports for `Dropout.forward` function with the first entry for the only input and the second for the
+only output. You can see that it was called from an attribute `dropout` inside `DenseReluDense` class. We can see
+that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
+input elements was `6.27e+04` and same for the output was `inf`.
+
+You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
+around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which renormalizes
+the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
+overflow (`inf`).
+
+As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
+numbers.
+
+Let's match the report to the code from `models/t5/modeling_t5.py`:
+
+```python
+class T5DenseGatedGeluDense(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
+        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+        self.gelu_act = ACT2FN["gelu_new"]
+
+    def forward(self, hidden_states):
+        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+        hidden_linear = self.wi_1(hidden_states)
+        hidden_states = hidden_gelu * hidden_linear
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.wo(hidden_states)
+        return hidden_states
+```
+
+Now it's easy to see the `dropout` call, and all the previous calls as well.
+
+Since the detection is happening in a forward hook, these reports are printed immediately after each `forward`
+returns.
+
+Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
+started to go up and most likely switch to the `fp32` mode here, so that the numbers don't overflow when multiplied
+or summed up. Of course, there might be other solutions. For example, we could turn off `amp` temporarily if it's
+enabled, after moving the original `forward` into a helper wrapper, like so:
+
+```python
+def _forward(self, hidden_states):
+    hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
+    hidden_linear = self.wi_1(hidden_states)
+    hidden_states = hidden_gelu * hidden_linear
+    hidden_states = self.dropout(hidden_states)
+    hidden_states = self.wo(hidden_states)
+    return hidden_states
+
+import torch
+def forward(self, hidden_states):
+    if torch.is_autocast_enabled():
+         with torch.cuda.amp.autocast(enabled=False):
+             return self._forward(hidden_states)
+     else:
+         return self._forward(hidden_states)
+```
+
+Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
+want to analyse the intermediary stages of any specific `forward` function as well. In such a case you can use the
+`detect_overflow` helper function to inject the detector where you want it, for example:
+
+```python
+from debug_utils import detect_overflow
+
+class T5LayerFF(nn.Module):
+    [...]
+    def forward(self, hidden_states):
+        forwarded_states = self.layer_norm(hidden_states)
+        detect_overflow(forwarded_states, "after layer_norm")
+        forwarded_states = self.DenseReluDense(forwarded_states)
+        detect_overflow(forwarded_states, "after DenseReluDense")
+        return hidden_states + self.dropout(forwarded_states)
+```
+
+You can see that we added 2 of these and now we track if `inf` or `nan` for `forwarded_states` was detected
+somewhere in between.
+
+Actually, the detector already reports these because each of the calls in the example above is a `nn.Module`, but
+let's say if you had some local direct calculations this is how you'd do that.
+
+Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
+its default, e.g.:
+
+```python
+from .debug_utils import DebugUnderflowOverflow
+debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
+```
+
+### Specific batch absolute mix and max value tracing
+
+The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
+
+Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a given
+batch, and only do that for batches 1 and 3. Then you instantiate this class as:
+
+```python
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
+```
+
+And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
+
+Batches are 0-indexed.
+
+This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
+right to that area. Here is a sample truncated output for such configuration:
+
+```
+                  *** Starting batch number=1 ***
+abs min  abs max  metadata
+                  shared Embedding
+1.01e-06 7.92e+02 weight
+0.00e+00 2.47e+04 input[0]
+5.36e-05 7.92e+02 output
+[...]
+                  decoder.dropout Dropout
+1.60e-07 2.27e+01 input[0]
+0.00e+00 2.52e+01 output
+                  decoder T5Stack
+     not a tensor output
+                  lm_head Linear
+1.01e-06 7.92e+02 weight
+0.00e+00 1.11e+00 input[0]
+6.06e-02 8.39e+01 output
+                   T5ForConditionalGeneration
+     not a tensor output
+
+                  *** Starting batch number=3 ***
+abs min  abs max  metadata
+                  shared Embedding
+1.01e-06 7.92e+02 weight
+0.00e+00 2.78e+04 input[0]
+5.36e-05 7.92e+02 output
+[...]
+```
+
+Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
+not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
+a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
+numbers started to diverge.
+
+You can also specify the batch number after which to stop the training, with:
+
+```python
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
+```
diff --git a/docs/source/debugging.rst b/docs/source/debugging.rst
deleted file mode 100644
index 235e32b77f..0000000000
--- a/docs/source/debugging.rst
+++ /dev/null
@@ -1,299 +0,0 @@
-..
-    Copyright 2021 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-
-
-Debugging
-=======================================================================================================================
-
-Underflow and Overflow Detection
------------------------------------------------------------------------------------------------------------------------
-
-.. note::
-
-   This feature is currently available for PyTorch-only.
-
-.. note::
-
-   For multi-GPU training it requires DDP (``torch.distributed.launch``).
-
-.. note::
-
-   This feature can be used with any ``nn.Module``-based model.
-
-If you start getting ``loss=NaN`` or the model inhibits some other abnormal behavior due to ``inf`` or ``nan`` in
-activations or weights one needs to discover where the first underflow or overflow happens and what led to it. Luckily
-you can accomplish that easily by activating a special module that will do the detection automatically.
-
-If you're using :class:`~transformers.Trainer`, you just need to add:
-
-.. code-block:: bash
-
-    --debug underflow_overflow
-
-to the normal command line arguments, or pass ``debug="underflow_overflow"`` when creating the
-:class:`~transformers.TrainingArguments` object.
-
-If you're using your own training loop or another Trainer you can accomplish the same with:
-
-.. code-block:: python
-
-    from .debug_utils import DebugUnderflowOverflow
-    debug_overflow = DebugUnderflowOverflow(model)
-
-:class:`~transformers.debug_utils.DebugUnderflowOverflow` inserts hooks into the model that immediately after each
-forward call will test input and output variables and also the corresponding module's weights. As soon as ``inf`` or
-``nan`` is detected in at least one element of the activations or weights, the program will assert and print a report
-like this (this was caught with ``google/mt5-small`` under fp16 mixed precision):
-
-.. code-block::
-
-    Detected inf/nan during batch_number=0
-    Last 21 forward frames:
-    abs min  abs max  metadata
-                      encoder.block.1.layer.1.DenseReluDense.dropout Dropout
-    0.00e+00 2.57e+02 input[0]
-    0.00e+00 2.85e+02 output
-    [...]
-                      encoder.block.2.layer.0 T5LayerSelfAttention
-    6.78e-04 3.15e+03 input[0]
-    2.65e-04 3.42e+03 output[0]
-                 None output[1]
-    2.25e-01 1.00e+04 output[2]
-                      encoder.block.2.layer.1.layer_norm T5LayerNorm
-    8.69e-02 4.18e-01 weight
-    2.65e-04 3.42e+03 input[0]
-    1.79e-06 4.65e+00 output
-                      encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
-    2.17e-07 4.50e+00 weight
-    1.79e-06 4.65e+00 input[0]
-    2.68e-06 3.70e+01 output
-                      encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
-    8.08e-07 2.66e+01 weight
-    1.79e-06 4.65e+00 input[0]
-    1.27e-04 2.37e+02 output
-                      encoder.block.2.layer.1.DenseReluDense.dropout Dropout
-    0.00e+00 8.76e+03 input[0]
-    0.00e+00 9.74e+03 output
-                      encoder.block.2.layer.1.DenseReluDense.wo Linear
-    1.01e-06 6.44e+00 weight
-    0.00e+00 9.74e+03 input[0]
-    3.18e-04 6.27e+04 output
-                      encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
-    1.79e-06 4.65e+00 input[0]
-    3.18e-04 6.27e+04 output
-                      encoder.block.2.layer.1.dropout Dropout
-    3.18e-04 6.27e+04 input[0]
-    0.00e+00      inf output
-
-The example output has been trimmed in the middle for brevity.
-
-The second column shows the value of the absolute largest element, so if you have a closer look at the last few frames,
-the inputs and outputs were in the range of ``1e4``. So when this training was done under fp16 mixed precision the very
-last step overflowed (since under ``fp16`` the largest number before ``inf`` is ``64e3``). To avoid overflows under
-``fp16`` the activations must remain way below ``1e4``, because ``1e4 * 1e4 = 1e8`` so any matrix multiplication with
-large activations is going to lead to a numerical overflow condition.
-
-At the very start of the trace you can discover at which batch number the problem occurred (here ``Detected inf/nan
-during batch_number=0`` means the problem occurred on the first batch).
-
-Each reported frame starts by declaring the fully qualified entry for the corresponding module this frame is reporting
-for. If we look just at this frame:
-
-.. code-block::
-
-                      encoder.block.2.layer.1.layer_norm T5LayerNorm
-    8.69e-02 4.18e-01 weight
-    2.65e-04 3.42e+03 input[0]
-    1.79e-06 4.65e+00 output
-
-Here, ``encoder.block.2.layer.1.layer_norm`` indicates that it was a layer norm for the first layer, of the second
-block of the encoder. And the specific calls of the ``forward`` is ``T5LayerNorm``.
-
-Let's look at the last few frames of that report:
-
-.. code-block::
-
-        Detected inf/nan during batch_number=0
-        Last 21 forward frames:
-        abs min  abs max  metadata
-        [...]
-                          encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
-        2.17e-07 4.50e+00 weight
-        1.79e-06 4.65e+00 input[0]
-        2.68e-06 3.70e+01 output
-                          encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
-        8.08e-07 2.66e+01 weight
-        1.79e-06 4.65e+00 input[0]
-        1.27e-04 2.37e+02 output
-                          encoder.block.2.layer.1.DenseReluDense.wo Linear
-        1.01e-06 6.44e+00 weight
-        0.00e+00 9.74e+03 input[0]
-        3.18e-04 6.27e+04 output
-                          encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
-        1.79e-06 4.65e+00 input[0]
-        3.18e-04 6.27e+04 output
-                          encoder.block.2.layer.1.dropout Dropout
-        3.18e-04 6.27e+04 input[0]
-        0.00e+00      inf output
-
-The last frame reports for ``Dropout.forward`` function with the first entry for the only input and the second for the
-only output. You can see that it was called from an attribute ``dropout`` inside ``DenseReluDense`` class. We can see
-that it happened during the first layer, of the 2nd block, during the very first batch. Finally, the absolute largest
-input elements was ``6.27e+04`` and same for the output was ``inf``.
-
-You can see here, that ``T5DenseGatedGeluDense.forward`` resulted in output activations, whose absolute max value was
-around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have ``Dropout`` which renormalizes
-the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an
-overflow (``inf``).
-
-As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16
-numbers.
-
-Let's match the report to the code from ``models/t5/modeling_t5.py``:
-
-.. code-block:: python
-
-    class T5DenseGatedGeluDense(nn.Module):
-        def __init__(self, config):
-            super().__init__()
-            self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
-            self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
-            self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
-            self.dropout = nn.Dropout(config.dropout_rate)
-            self.gelu_act = ACT2FN["gelu_new"]
-
-        def forward(self, hidden_states):
-            hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
-            hidden_linear = self.wi_1(hidden_states)
-            hidden_states = hidden_gelu * hidden_linear
-            hidden_states = self.dropout(hidden_states)
-            hidden_states = self.wo(hidden_states)
-            return hidden_states
-
-Now it's easy to see the ``dropout`` call, and all the previous calls as well.
-
-Since the detection is happening in a forward hook, these reports are printed immediately after each ``forward``
-returns.
-
-Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers
-started to go up and most likely switch to the ``fp32`` mode here, so that the numbers don't overflow when multiplied
-or summed up. Of course, there might be other solutions. For example, we could turn off ``amp`` temporarily if it's
-enabled, after moving the original ``forward`` into a helper wrapper, like so:
-
-.. code-block:: python
-
-    def _forward(self, hidden_states):
-        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
-        hidden_linear = self.wi_1(hidden_states)
-        hidden_states = hidden_gelu * hidden_linear
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.wo(hidden_states)
-        return hidden_states
-
-    import torch
-    def forward(self, hidden_states):
-        if torch.is_autocast_enabled():
-             with torch.cuda.amp.autocast(enabled=False):
-                 return self._forward(hidden_states)
-         else:
-             return self._forward(hidden_states)
-
-Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
-want to analyse the intermediary stages of any specific ``forward`` function as well. In such a case you can use the
-``detect_overflow`` helper function to inject the detector where you want it, for example:
-
-.. code-block:: python
-
-    from debug_utils import detect_overflow
-
-    class T5LayerFF(nn.Module):
-        [...]
-        def forward(self, hidden_states):
-            forwarded_states = self.layer_norm(hidden_states)
-            detect_overflow(forwarded_states, "after layer_norm")
-            forwarded_states = self.DenseReluDense(forwarded_states)
-            detect_overflow(forwarded_states, "after DenseReluDense")
-            return hidden_states + self.dropout(forwarded_states)
-
-You can see that we added 2 of these and now we track if ``inf`` or ``nan`` for ``forwarded_states`` was detected
-somewhere in between.
-
-Actually, the detector already reports these because each of the calls in the example above is a `nn.Module``, but
-let's say if you had some local direct calculations this is how you'd do that.
-
-Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from
-its default, e.g.:
-
-.. code-block:: python
-
-    from .debug_utils import DebugUnderflowOverflow
-    debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
-
-Specific batch absolute mix and max value tracing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off.
-
-Let's say you want to watch the absolute min and max values for all the ingredients of each ``forward`` call of a given
-batch, and only do that for batches 1 and 3. Then you instantiate this class as:
-
-.. code-block:: python
-
-    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
-
-And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
-
-Batches are 0-indexed.
-
-This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward
-right to that area. Here is a sample truncated output for such configuration:
-
-.. code-block::
-
-                      *** Starting batch number=1 ***
-    abs min  abs max  metadata
-                      shared Embedding
-    1.01e-06 7.92e+02 weight
-    0.00e+00 2.47e+04 input[0]
-    5.36e-05 7.92e+02 output
-    [...]
-                      decoder.dropout Dropout
-    1.60e-07 2.27e+01 input[0]
-    0.00e+00 2.52e+01 output
-                      decoder T5Stack
-         not a tensor output
-                      lm_head Linear
-    1.01e-06 7.92e+02 weight
-    0.00e+00 1.11e+00 input[0]
-    6.06e-02 8.39e+01 output
-                       T5ForConditionalGeneration
-         not a tensor output
-
-                      *** Starting batch number=3 ***
-    abs min  abs max  metadata
-                      shared Embedding
-    1.01e-06 7.92e+02 weight
-    0.00e+00 2.78e+04 input[0]
-    5.36e-05 7.92e+02 output
-    [...]
-
-Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may
-not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if
-a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where
-numbers started to diverge.
-
-You can also specify the batch number after which to stop the training, with:
-
-.. code-block:: python
-
-    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
diff --git a/docs/source/main_classes/deepspeed.mdx b/docs/source/main_classes/deepspeed.mdx
new file mode 100644
index 0000000000..c68a15fbc6
--- /dev/null
+++ b/docs/source/main_classes/deepspeed.mdx
@@ -0,0 +1,1758 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# DeepSpeed Integration
+
+[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for:
+
+1. Optimizer state partitioning (ZeRO stage 1)
+2. Gradient partitioning (ZeRO stage 2)
+3. Parameter partitioning (ZeRO stage 3)
+4. Custom mixed precision training handling
+5. A range of fast CUDA-extension-based optimizers
+6. ZeRO-Offload to CPU and NVMe
+
+ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
+Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
+
+DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
+
+DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
+won't be possible on a single GPU.
+
+🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
+
+1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
+   of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
+   this document is focused on this feature.
+2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
+   yourself, core functionality functions like `from_pretrained` and `from_config` include integration of essential
+   parts of DeepSpeed like `zero.Init` for ZeRO stage 3 and higher. To tap into this feature read the docs on
+   [deepspeed-non-trainer-integration](#deepspeed-non-trainer-integration).
+
+What is integrated:
+
+Training:
+
+1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 with ZeRO-Infinity (CPU and NVME offload).
+
+Inference:
+
+1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
+   it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
+   [deepspeed-zero-inference](#deepspeed-zero-inference).
+
+There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of
+ZeRO (coming soon).
+
+
+
+<a id='deepspeed-trainer-integration'></a>
+
+
+## Trainer Deepspeed Integration
+
+
+<a id='deepspeed-installation'></a>
+
+### Installation
+
+Install the library via pypi:
+
+```bash
+pip install deepspeed
+```
+
+or via `transformers`' `extras`:
+
+```bash
+pip install transformers[deepspeed]
+```
+
+or find more details on [the DeepSpeed's GitHub page](https://github.com/microsoft/deepspeed#installation) and
+[advanced install](https://www.deepspeed.ai/tutorials/advanced-install/).
+
+If you're still struggling with the build, first make sure to read [zero-install-notes](#zero-install-notes).
+
+If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions
+to no avail, the next thing to try is to pre-build the modules before installing them.
+
+To make a local build for DeepSpeed:
+
+```bash
+git clone https://github.com/microsoft/DeepSpeed/
+cd DeepSpeed
+rm -rf build
+TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
+--global-option="build_ext" --global-option="-j8" --no-cache -v \
+--disable-pip-version-check 2>&1 | tee build.log
+```
+
+If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
+install *libaio-dev* system-wide).
+
+Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
+your cards are the same you can get the arch via:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
+```
+
+So if you get `8, 6`, then use `TORCH_CUDA_ARCH_LIST="8.6"`. If you have multiple different cards, you can list all
+of them like so `TORCH_CUDA_ARCH_LIST="6.1;8.6"`
+
+If you need to use the same setup on multiple machines, make a binary wheel:
+
+```bash
+git clone https://github.com/microsoft/DeepSpeed/
+cd DeepSpeed
+rm -rf build
+TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
+python setup.py build_ext -j8 bdist_wheel
+```
+
+it will generate something like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` which now you can install
+as `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` locally or on any other machine.
+
+Again, remember to ensure to adjust `TORCH_CUDA_ARCH_LIST` to the target architectures.
+
+You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this
+context) [here](https://developer.nvidia.com/cuda-gpus).
+
+You can check the archs pytorch was built with using:
+
+```bash
+python -c "import torch; print(torch.cuda.get_arch_list())"
+```
+
+Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
+print(torch.cuda.get_device_properties(torch.device('cuda')))"
+```
+
+If the output is:
+
+```bash
+_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
+```
+
+then you know that this card's arch is `8.6`.
+
+You can also leave `TORCH_CUDA_ARCH_LIST` out completely and then the build program will automatically query the
+architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
+it's best to specify the desired archs explicitly.
+
+If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
+[Deepspeed](https://github.com/microsoft/DeepSpeed/issues),
+
+
+
+<a id='deepspeed-multi-gpu'></a>
+
+### Deployment with multiple GPUs
+
+To deploy this feature with multiple GPUs adjust the [`Trainer`] command line arguments as
+following:
+
+1. replace `python -m torch.distributed.launch` with `deepspeed`.
+2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
+   documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.
+
+Therefore, if your original command line looked as following:
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
+```
+
+Now it should be:
+
+```bash
+deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
+```
+
+Unlike, `torch.distributed.launch` where you have to specify how many GPUs to use with `--nproc_per_node`, with the
+`deepspeed` launcher you don't have to use the corresponding `--num_gpus` if you want all of your GPUs used. The
+full details on how to configure various nodes and GPUs can be found [here](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node).
+
+In fact, you can continue using `-m torch.distributed.launch` with DeepSpeed as long as you don't need to use
+`deepspeed` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
+the `deepspeed` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
+use it here as well.
+
+Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
+
+```bash
+deepspeed examples/pytorch/translation/run_translation.py \
+--deepspeed tests/deepspeed/ds_config_zero3.json \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
+--output_dir output_dir --overwrite_output_dir --fp16 \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro
+```
+
+Note that in the DeepSpeed documentation you are likely to see `--deepspeed --deepspeed_config ds_config.json` - i.e.
+two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
+with, we combined the two into a single argument.
+
+For some practical usage examples, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400).
+
+
+
+<a id='deepspeed-one-gpu'></a>
+
+### Deployment with one GPU
+
+To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
+
+```bash
+deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
+--deepspeed tests/deepspeed/ds_config_zero2.json \
+--model_name_or_path t5-small --per_device_train_batch_size 1 \
+--output_dir output_dir --overwrite_output_dir --fp16 \
+--do_train --max_train_samples 500 --num_train_epochs 1 \
+--dataset_name wmt16 --dataset_config "ro-en" \
+--source_lang en --target_lang ro
+```
+
+This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via
+`--num_gpus=1`. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
+with, then you don't need this argument. The following [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) discusses the launcher options.
+
+Why would you want to use DeepSpeed with just one GPU?
+
+1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
+   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
+   normally won't fit.
+2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
+   bigger models and data batches.
+
+While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
+with DeepSpeed is to have at least the following configuration in the configuration file:
+
+```json
+{
+  "zero_optimization": {
+     "stage": 2,
+     "offload_optimizer": {
+         "device": "cpu",
+         "pin_memory": true
+     },
+     "allgather_partitions": true,
+     "allgather_bucket_size": 2e8,
+     "reduce_scatter": true,
+     "reduce_bucket_size": 2e8,
+     "overlap_comm": true,
+     "contiguous_gradients": true
+  }
+}
+```
+
+which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will
+find more details in the discussion below.
+
+For a practical usage example of this type of deployment, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685).
+
+You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document.
+
+<!--- TODO: Benchmark whether we can get better performance out of ZeRO-3 vs. ZeRO-2 on a single GPU, and then
+recommend ZeRO-3 config as starting one. -->
+
+Notes:
+
+- if you need to run on a specific GPU, which is different from GPU 0, you can't use `CUDA_VISIBLE_DEVICES` to limit
+  the visible scope of available GPUs. Instead, you have to use the following syntax:
+
+  ```bash
+  deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
+  ```
+
+  In this example, we tell DeepSpeed to use GPU 1 (second gpu).
+
+
+
+<a id='deepspeed-notebook'></a>
+
+### Deployment in Notebooks
+
+The problem with running notebook cells as a script is that there is no normal `deepspeed` launcher to rely on, so
+under certain setups we have to emulate it.
+
+If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
+
+```python
+# DeepSpeed requires a distributed environment even when only one process is used.
+# This emulates a launcher in the notebook
+import os
+os.environ['MASTER_ADDR'] = 'localhost'
+os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
+os.environ['RANK'] = "0"
+os.environ['LOCAL_RANK'] = "0"
+os.environ['WORLD_SIZE'] = "1"
+
+# Now proceed as normal, plus pass the deepspeed config file
+training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
+trainer = Trainer(...)
+trainer.train()
+```
+
+Note: `...` stands for the normal arguments that you'd pass to the functions.
+
+If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have
+to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented
+at the beginning of this section.
+
+If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
+cell with:
+
+```python
+%%bash
+cat <<'EOT' > ds_config_zero3.json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
+EOT
+```
+
+If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via
+shell from a cell. For example, to use `run_translation.py` you would launch it with:
+
+```python
+!git clone https://github.com/huggingface/transformers
+!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
+```
+
+or with `%%bash` magic, where you can write a multi-line code for the shell program to run:
+
+```python
+%%bash
+
+git clone https://github.com/huggingface/transformers
+cd transformers
+deepspeed examples/pytorch/translation/run_translation.py ...
+```
+
+In such case you don't need any of the code presented at the beginning of this section.
+
+Note: While `%%bash` magic is neat, but currently it buffers the output so you won't see the logs until the process
+completes.
+
+
+
+
+<a id='deepspeed-config'></a>
+
+### Configuration
+
+For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
+to the [following documentation](https://www.deepspeed.ai/docs/config-json/).
+
+You can find dozens of DeepSpeed configuration examples that address various practical needs in [the DeepSpeedExamples
+repo](https://github.com/microsoft/DeepSpeedExamples):
+
+```bash
+git clone https://github.com/microsoft/DeepSpeedExamples
+cd DeepSpeedExamples
+find . -name '*json'
+```
+
+Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the
+example `.json` files with:
+
+```bash
+grep -i Lamb $(find . -name '*json')
+```
+
+Some more examples are to be found in the [main repo](https://github.com/microsoft/DeepSpeed) as well.
+
+When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
+to be configured via the command line. You will find the nuances in the rest of this guide.
+
+To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
+including optimizer states cpu offload, uses `AdamW` optimizer and `WarmupLR` scheduler and will enable mixed
+precision training if `--fp16` is passed:
+
+```json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true
+    },
+
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+}
+```
+
+When you execute the program, DeepSpeed will log the configuration it received from the [`Trainer`]
+to the console, so you can see exactly what was the final configuration passed to it.
+
+
+
+<a id='deepspeed-config-passing'></a>
+
+### Passing Configuration
+
+As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
+not using the command line interface to configure the training, and instead instantiate the
+[`Trainer`] via [`TrainingArguments`] then for the `deepspeed` argument you can
+pass a nested `dict`. This allows you to create the configuration on the fly and doesn't require you to write it to
+the file system before passing it to [`TrainingArguments`].
+
+To summarize you can do:
+
+```python
+TrainingArguments(..., deepspeed="/path/to/ds_config.json")
+```
+
+or:
+
+```python
+ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
+TrainingArguments(..., deepspeed=ds_config_dict)
+```
+
+<a id='deepspeed-config-shared'></a>
+
+### Shared Configuration
+
+
+<Tip warning={true}>
+
+This section is a must-read
+
+</Tip>
+
+Some configuration values are required by both the [`Trainer`] and DeepSpeed to function correctly,
+therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those
+via the [`Trainer`] command line arguments.
+
+Additionally, some configuration values are derived automatically based on the model's configuration, so instead of
+remembering to manually adjust multiple values, it's the best to let the [`Trainer`] do the majority
+of configuration for you.
+
+Therefore, in the rest of this guide you will find a special configuration value: `auto`, which when set will be
+automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this
+recommendation and set the values explicitly, in which case be very careful that your the
+[`Trainer`] arguments and DeepSpeed configurations agree. For example, are you using the same
+learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very
+difficult to detect ways. You have been warned.
+
+There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit
+your needs.
+
+In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master
+and configure [`TrainingArguments`] based on that. The steps are:
+
+1. Create or load the DeepSpeed configuration to be used as a master configuration
+2. Create the [`TrainingArguments`] object based on these values
+
+Do note that some values, such as `scheduler.params.total_num_steps` are calculated by
+[`Trainer`] during `train`, but you can of course do the math yourself.
+
+<a id='deepspeed-zero'></a>
+
+### ZeRO
+
+[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
+support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
+therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
+You will find more indepth information in the DeepSpeed documentation.
+
+The `zero_optimization` section of the configuration file is the most important part ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)), since that is where you define
+which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the
+DeepSpeed docs.
+
+This section has to be configured exclusively via DeepSpeed configuration - the [`Trainer`] provides
+no equivalent command line arguments.
+
+Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for
+the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is
+going to use.
+
+
+
+<a id='deepspeed-zero2-config'></a>
+
+#### ZeRO-2 Config
+
+The following is an example configuration for ZeRO stage 2:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 5e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 5e8,
+        "contiguous_gradients": true
+    }
+}
+```
+
+**Performance tuning:**
+
+- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
+- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
+  the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
+  footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
+  OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
+  the same on larger capacity GPU as well, if you're starting to hit OOM.
+- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
+  the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
+  important, getting a slightly slower training time could be a good trade.
+
+
+
+<a id='deepspeed-zero3-config'></a>
+
+#### ZeRO-3 Config
+
+The following is an example configuration for ZeRO stage 3:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    }
+}
+```
+
+If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU
+memory offloading the optimizer states and parameters to CPU memory with `"device": "cpu"` may solve this limitation.
+If you don't want to offload to CPU memory, use `none` instead of `cpu` for the `device` entry. Offloading to
+NVMe is discussed further down.
+
+Pinned memory is enabled with `pin_memory` set to `true`. This feature can improve the throughput at the cost of
+making less memory available to other processes. Pinned memory is set aside to the specific process that requested it
+and its typically accessed much faster than normal CPU memory.
+
+**Performance tuning:**
+
+- `stage3_max_live_parameters`: `1e9`
+- `stage3_max_reuse_distance`: `1e9`
+
+If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
+on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
+`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
+
+`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
+time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
+use the `stage3_max_reuse_distance` to decide whether to throw away the parameter or to keep it. If a parameter is
+going to be used again in near future (less than `stage3_max_reuse_distance`) then we keep it to reduce communication
+overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
+backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
+
+The following configuration values depend on the model's hidden size:
+
+- `reduce_bucket_size`: `hidden_size*hidden_size`
+- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size`
+- `stage3_param_persistence_threshold`: `10 * hidden_size`
+
+therefore set these values to `auto` and the [`Trainer`] will automatically assign the recommended
+values. But, of course, feel free to set these explicitly as well.
+
+`stage3_gather_fp16_weights_on_model_save` enables model fp16 weights consolidation when model gets saved. With large
+models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
+you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
+flexible.
+
+If you're migrating from ZeRO-2 configuration note that `allgather_partitions`, `allgather_bucket_size` and
+`reduce_scatter` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
+be ignored.
+
+- `sub_group_size`: `1e9`
+
+`sub_group_size` controls the granularity in which parameters are updated during optimizer steps. Parameters are
+grouped into buckets of `sub_group_size` and each buckets is updated one at a time. When used with NVMe offload in
+ZeRO-Infinity, `sub_group_size` therefore controls the granularity in which model states are moved in and out of CPU
+memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
+
+You can leave `sub_group_size` to its default value of *1e9* when not using NVMe offload. You may want to change its
+default value in the following cases:
+
+1. Running into OOM during optimizer step: Reduce `sub_group_size` to reduce memory utilization of temporary buffers
+2. Optimizer Step is taking a long time: Increase `sub_group_size` to improve bandwidth utilization as a result of
+   the increased data buffers.
+
+
+<a id='deepspeed-nvme'></a>
+
+### NVMe Support
+
+ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to
+smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during
+offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training
+process. ZeRO-Infinity requires ZeRO-3 enabled.
+
+The following configuration example enables NVMe to offload both optimizer states and the params:
+
+```json
+{
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "nvme",
+            "nvme_path": "/local_nvme",
+            "pin_memory": true,
+            "buffer_count": 4,
+            "fast_init": false
+        },
+        "offload_param": {
+            "device": "nvme",
+            "nvme_path": "/local_nvme",
+            "pin_memory": true,
+            "buffer_count": 5,
+            "buffer_size": 1e8,
+            "max_in_cpu": 1e9
+        }
+        "aio": {
+            "block_size": 262144,
+            "queue_depth": 32,
+            "thread_count": 1,
+            "single_submit": false,
+            "overlap_events": true
+        }
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+}
+```
+
+You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you
+have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint:
+*"device": "cpu"*).
+
+Here is the full documentation for offloading [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading).
+
+Make sure that your `nvme_path` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
+be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this
+writing one can have ~3.5GB/s read, ~3GB/s write peak speeds).
+
+In order to figure out the optimal `aio` configuration block you must run a benchmark on your target setup, as
+[explained here](https://github.com/microsoft/DeepSpeed/issues/998).
+
+
+
+<a id='deepspeed-zero2-zero3-performance'></a>
+
+#### ZeRO-2 vs ZeRO-3 Performance
+
+ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather
+model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs
+then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity
+at a cost of speed.
+
+It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2:
+
+- set `stage3_param_persistence_threshold` to a very large number - larger than the largest parameter, e.g., `6 * hidden_size * hidden_size`. This will keep the parameters on the GPUs.
+- turn off `offload_params` since ZeRO-2 doesn't have that option.
+
+The performance will likely improve significantly with just `offload_params` turned off, even if you don't change
+`stage3_param_persistence_threshold`. Of course, these changes will impact the size of the model you can train. So
+these help you to trade scalability for speed depending on your needs.
+
+
+
+<a id='deepspeed-zero2-example'></a>
+
+#### ZeRO-2 Example
+
+Here is a full ZeRO-2 auto-configuration file `ds_config_zero2.json`:
+
+```json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true
+    },
+
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
+```
+
+Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical
+values look like, but we highly recommend using the one with multiple `auto` settings in it.
+
+```json
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": 3e-5,
+            "betas": [0.8, 0.999],
+            "eps": 1e-8,
+            "weight_decay": 3e-7
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": 0,
+            "warmup_max_lr": 3e-5,
+            "warmup_num_steps": 500
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 2,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "allgather_partitions": true,
+        "allgather_bucket_size": 2e8,
+        "overlap_comm": true,
+        "reduce_scatter": true,
+        "reduce_bucket_size": 2e8,
+        "contiguous_gradients": true
+    },
+
+    "steps_per_print": 2000,
+    "wall_clock_breakdown": false
+}
+```
+
+<a id='deepspeed-zero3-example'></a>
+
+#### ZeRO-3 Example
+
+Here is a full ZeRO-3 auto-configuration file `ds_config_zero3.json`:
+
+
+```json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": "auto",
+            "betas": "auto",
+            "eps": "auto",
+            "weight_decay": "auto"
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": "auto",
+            "warmup_max_lr": "auto",
+            "warmup_num_steps": "auto"
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": "auto",
+        "stage3_prefetch_bucket_size": "auto",
+        "stage3_param_persistence_threshold": "auto",
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+
+    "gradient_accumulation_steps": "auto",
+    "gradient_clipping": "auto",
+    "steps_per_print": 2000,
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto",
+    "wall_clock_breakdown": false
+}
+```
+
+Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical
+values look like, but we highly recommend using the one with multiple `auto` settings in it.
+
+```json
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    },
+
+    "optimizer": {
+        "type": "AdamW",
+        "params": {
+            "lr": 3e-5,
+            "betas": [0.8, 0.999],
+            "eps": 1e-8,
+            "weight_decay": 3e-7
+        }
+    },
+
+    "scheduler": {
+        "type": "WarmupLR",
+        "params": {
+            "warmup_min_lr": 0,
+            "warmup_max_lr": 3e-5,
+            "warmup_num_steps": 500
+        }
+    },
+
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "offload_param": {
+            "device": "cpu",
+            "pin_memory": true
+        },
+        "overlap_comm": true,
+        "contiguous_gradients": true,
+        "sub_group_size": 1e9,
+        "reduce_bucket_size": 1e6,
+        "stage3_prefetch_bucket_size": 0.94e6,
+        "stage3_param_persistence_threshold": 1e4,
+        "stage3_max_live_parameters": 1e9,
+        "stage3_max_reuse_distance": 1e9,
+        "stage3_gather_fp16_weights_on_model_save": true
+    },
+
+    "steps_per_print": 2000,
+    "wall_clock_breakdown": false
+}
+```
+
+### Optimizer and Scheduler
+
+As long as you don't enable `offload_optimizer` you can mix and match DeepSpeed and HuggingFace schedulers and
+optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:
+
+| Combos       | HF Scheduler | DS Scheduler |
+| HF Optimizer | Yes          | Yes          |
+| DS Optimizer | No           | Yes          |
+
+It is possible to use a non-DeepSpeed optimizer when `offload_optimizer` is enabled, as long as it has both CPU and
+GPU implementation (except LAMB).
+
+
+
+
+<a id='deepspeed-optimizer'></a>
+
+#### Optimizer
+
+
+DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
+thus recommended to be used. It, however, can import other optimizers from `torch`. The full documentation is [here](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters).
+
+If you don't configure the `optimizer` entry in the configuration file, the [`Trainer`] will
+automatically set it to `AdamW` and will use the supplied values or the defaults for the following command line
+arguments: `--learning_rate`, `--adam_beta1`, `--adam_beta2`, `--adam_epsilon` and `--weight_decay`.
+
+Here is an example of the auto-configured `optimizer` entry for `AdamW`:
+
+```json
+{
+   "optimizer": {
+       "type": "AdamW",
+       "params": {
+         "lr": "auto",
+         "betas": "auto",
+         "eps": "auto",
+         "weight_decay": "auto"
+       }
+   }
+}
+```
+
+Note that the command line arguments will set the values in the configuration file. This is so that there is one
+definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
+different values in different places. Command line rules. The values that get overridden are:
+
+- `lr` with the value of `--learning_rate`
+- `betas` with the value of `--adam_beta1 --adam_beta2`
+- `eps` with the value of `--adam_epsilon`
+- `weight_decay` with the value of `--weight_decay`
+
+Therefore please remember to tune the shared hyperparameters on the command line.
+
+You can also set the values explicitly:
+
+```json
+{
+   "optimizer": {
+       "type": "AdamW",
+       "params": {
+         "lr": 0.001,
+         "betas": [0.8, 0.999],
+         "eps": 1e-8,
+         "weight_decay": 3e-7
+       }
+   }
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+If you want to use another optimizer which is not listed above, you will have to add to the top level configuration.
+
+```json
+{
+   "zero_allow_untested_optimizer": true
+}
+```
+
+Similarly to `AdamW`, you can configure other officially supported optimizers. Just remember that may have different
+config values. e.g. for Adam you will want `weight_decay` around `0.01`.
+
+
+
+<a id='deepspeed-scheduler'></a>
+
+#### Scheduler
+
+DeepSpeed supports `LRRangeTest`, `OneCycle`, `WarmupLR` and `WarmupDecayLR` learning rate schedulers. The full
+documentation is [here](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters).
+
+Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
+
+- `WarmupLR` via `--lr_scheduler_type constant_with_warmup`
+- `WarmupDecayLR` via `--lr_scheduler_type linear`. This is also the default value for `--lr_scheduler_type`,
+  therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
+
+If you don't configure the `scheduler` entry in the configuration file, the [`Trainer`] will use
+the values of `--lr_scheduler_type`, `--learning_rate` and `--warmup_steps` or `--warmup_ratio` to configure a
+🤗 Transformers version of it.
+
+Here is an example of the auto-configured `scheduler` entry for `WarmupLR`:
+
+```json
+{
+   "scheduler": {
+         "type": "WarmupLR",
+         "params": {
+             "warmup_min_lr": "auto",
+             "warmup_max_lr": "auto",
+             "warmup_num_steps": "auto"
+         }
+     }
+}
+```
+
+Since *"auto"* is used the [`Trainer`] arguments will set the correct values in the configuration
+file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example,
+the learning rate is set to different values in different places. Command line rules. The values that get set are:
+
+- `warmup_min_lr` with the value of `0`.
+- `warmup_max_lr` with the value of `--learning_rate`.
+- `warmup_num_steps` with the value of `--warmup_steps` if provided. Otherwise will use `--warmup_ratio`
+  multiplied by the number of training steps and rounded up.
+- `total_num_steps` with either the value of `--max_steps` or if it is not provided, derived automatically at run
+  time based on the environment and the size of the dataset and other command line arguments (needed for
+  `WarmupDecayLR`).
+
+You can, of course, take over any or all of the configuration values and set those yourself:
+
+```json
+{
+   "scheduler": {
+         "type": "WarmupLR",
+         "params": {
+             "warmup_min_lr": 0,
+             "warmup_max_lr": 0.001,
+             "warmup_num_steps": 1000
+         }
+     }
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+For example, for `WarmupDecayLR`, you can use the following entry:
+
+```json
+{
+   "scheduler": {
+         "type": "WarmupDecayLR",
+         "params": {
+             "last_batch_iteration": -1,
+             "total_num_steps": "auto",
+             "warmup_min_lr": "auto",
+             "warmup_max_lr": "auto",
+             "warmup_num_steps": "auto"
+         }
+     }
+}
+```
+
+and `total_num_steps`, `warmup_max_lr`, `warmup_num_steps` and `total_num_steps` will be set at loading time.
+
+
+
+
+<a id='deepspeed-fp32'></a>
+
+### fp32 Precision
+
+Deepspeed supports the full fp32 and the fp16 mixed precision.
+
+Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you
+will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this
+happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained
+models). Such models may overflow or underflow leading to `NaN` loss. If this is your case then you will want to use
+the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with:
+
+```json
+{
+    "fp16": {
+        "enabled": "false",
+    }
+}
+```
+
+If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using
+the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and
+benchmarks, please, see [TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices). The document includes
+instructions on how to disable this automatic conversion if for some reason you prefer not to use it.
+
+
+
+
+<a id='deepspeed-amp'></a>
+
+### Automatic Mixed Precision
+
+You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
+
+To configure pytorch AMP-like mode set:
+
+```json
+{
+    "fp16": {
+        "enabled": "auto",
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    }
+}
+```
+
+and the [`Trainer`] will automatically enable or disable it based on the value of
+`args.fp16_backend`. The rest of config values are up to you.
+
+This mode gets enabled when `--fp16 --fp16_backend amp` command line args are passed.
+
+You can also enable/disable this mode explicitly:
+
+```json
+{
+    "fp16": {
+        "enabled": true,
+        "loss_scale": 0,
+        "loss_scale_window": 1000,
+        "initial_scale_power": 16,
+        "hysteresis": 2,
+        "min_loss_scale": 1
+    }
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#fp16-training-options).
+
+To configure apex AMP-like mode set:
+
+```json
+"amp": {
+    "enabled": "auto",
+    "opt_level": "auto"
+}
+```
+
+and the [`Trainer`] will automatically configure it based on the values of `args.fp16_backend` and
+`args.fp16_opt_level`.
+
+This mode gets enabled when `--fp16 --fp16_backend apex --fp16_opt_level 01` command line args are passed.
+
+You can also configure this mode explicitly:
+
+```json
+{
+    "amp": {
+        "enabled": true,
+        "opt_level": "O1"
+    }
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options).
+
+
+
+<a id='deepspeed-bs'></a>
+
+### Batch Size
+
+To configure batch size, use:
+
+```json
+{
+    "train_batch_size": "auto",
+    "train_micro_batch_size_per_gpu": "auto"
+}
+```
+
+and the [`Trainer`] will automatically set `train_micro_batch_size_per_gpu` to the value of
+`args.per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`.
+
+You can also set the values explicitly:
+
+```json
+{
+    "train_batch_size": 12,
+    "train_micro_batch_size_per_gpu": 4
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+
+
+<a id='deepspeed-grad-acc'></a>
+
+### Gradient Accumulation
+
+To configure gradient accumulation set:
+
+```json
+{
+    "gradient_accumulation_steps": "auto"
+}
+```
+
+and the [`Trainer`] will automatically set it to the value of `args.gradient_accumulation_steps`.
+
+You can also set the value explicitly:
+
+```json
+{
+    "gradient_accumulation_steps": 3
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+
+
+<a id='deepspeed-grad-clip'></a>
+
+### Gradient Clipping
+
+To configure gradient gradient clipping set:
+
+```json
+{
+    "gradient_clipping": "auto"
+}
+```
+
+and the [`Trainer`] will automatically set it to the value of `args.max_grad_norm`.
+
+You can also set the value explicitly:
+
+```json
+{
+    "gradient_clipping": 1.0
+}
+```
+
+But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed
+configuration.
+
+
+
+<a id='deepspeed-weight-extraction'></a>
+
+### Getting The Model Weights Out
+
+As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
+fp32 master weights in its custom checkpoint optimizer files, which are `global_step*/*optim_states.pt` (this is glob
+pattern), and are saved under the normal checkpoint.
+
+**FP16 Weights:**
+
+When a model is saved under ZeRO-2, you end up having the normal `pytorch_model.bin` file with the model weights, but
+they are only the fp16 version of the weights.
+
+Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs,
+therefore `"stage3_gather_fp16_weights_on_model_save": true` is required to get the `Trainer` to save the fp16
+version of the weights. If this setting is `False` ``pytorch_model.bin` won't be created. This is because by default DeepSpeed's `state_dict` contains a placeholder and not the real weights. If we were to save this `state_dict`` it
+won't be possible to load it back.
+
+
+```json
+{
+    "zero_optimization": {
+        "stage3_gather_fp16_weights_on_model_save": true
+    }
+}
+```
+
+**FP32 Weights:**
+
+While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
+the [models hub](https://huggingface.co/models) or pass it to someone else you most likely will want to get the fp32
+weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
+therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
+memory it can be done in the same training script. The following sections will discuss both approaches.
+
+
+**Live FP32 Weights Recovery:**
+
+This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
+
+If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
+
+```python
+from transformers.trainer_utils import get_last_checkpoint
+from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
+fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+```
+
+If you're using the `--load_best_model_at_end` class:*~transformers.TrainingArguments* argument (to track the best
+checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
+
+```python
+from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
+trainer.deepspeed.save_checkpoint(checkpoint_dir)
+fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+```
+
+<Tip>
+
+Note, that once `load_state_dict_from_zero_checkpoint` was run, the `model` will no longer be useable in the
+DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+`model.load_state_dict(state_dict)` will remove all the DeepSpeed magic from it. So do this only at the very end
+of the training.
+
+</Tip>
+
+Of course, you don't have to use class:*~transformers.Trainer* and you can adjust the examples above to your own
+trainer.
+
+If for some reason you want more refinement, you can also extract the fp32 `state_dict` of the weights and apply
+these yourself as is shown in the following example:
+
+```python
+from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+model = model.cpu()
+model.load_state_dict(state_dict)
+```
+
+**Offline FP32 Weights Recovery:**
+
+DeepSpeed creates a special conversion script `zero_to_fp32.py` which it places in the top-level of the checkpoint
+folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
+have the configuration file or a `Trainer` to do the extraction.
+
+Let's say your checkpoint folder looks like this:
+
+```bash
+$ ls -l output_dir/checkpoint-1/
+-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
+drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
+-rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
+-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
+-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
+-rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
+-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
+-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
+-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
+-rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
+-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
+-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
+```
+
+In this example there is just one DeepSpeed checkpoint sub-folder *global_step1*. Therefore to reconstruct the fp32
+weights just run:
+
+```bash
+python zero_to_fp32.py . pytorch_model.bin
+```
+
+This is it. `pytorch_model.bin` will now contain the full fp32 model weights consolidated from multiple GPUs.
+
+The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
+
+`python zero_to_fp32.py -h` will give you usage details.
+
+The script will auto-discover the deepspeed sub-folder using the contents of the file `latest`, which in the current
+example will contain `global_step1`.
+
+Note: currently the script requires 2x general RAM of the final fp32 model weights.
+
+
+### ZeRO-3 and Infinity Nuances
+
+ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature.
+
+ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements.
+
+While all the efforts were made for things to just work without needing any special changes to your models, in certain
+circumstances you may find the following information to be needed.
+
+
+
+#### Constructing Massive Models
+
+DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases,
+but also if you want the initialization to happen much faster, initialize the model using *deepspeed.zero.Init()*
+context manager (which is also a function decorator), like so:
+
+```python
+from transformers import T5ForConditionalGeneration, T5Config
+import deepspeed
+with deepspeed.zero.Init():
+   config = T5Config.from_pretrained("t5-small")
+   model = T5ForConditionalGeneration(config)
+```
+
+As you can see this gives you a randomly initialized model.
+
+If you want to use a pretrained model, `model_class.from_pretrained` will activate this feature as long as
+`is_deepspeed_zero3_enabled()` returns `True`, which currently is setup by the
+[`TrainingArguments`] object if the passed DeepSpeed configuration file contains ZeRO-3 config
+section. Thus you must create the [`TrainingArguments`] object **before** calling
+`from_pretrained`. Here is an example of a possible sequence:
+
+```python
+from transformers import AutoModel, Trainer, TrainingArguments
+training_args = TrainingArguments(..., deepspeed=ds_config)
+model = AutoModel.from_pretrained("t5-small")
+trainer = Trainer(model=model, args=training_args, ...)
+```
+
+If you're using the official example scripts and your command line arguments include `--deepspeed ds_config.json`
+with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written.
+
+Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
+
+For full details on this method and other related features please refer to [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models).
+
+Also when loading fp16-pretrained models, you will want to tell `from_pretrained` to use
+`torch_dtype=torch.float16`. For details, please, see [from_pretrained-torch-dtype](#from_pretrained-torch-dtype).
+
+
+#### Gathering Parameters
+
+Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently
+executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it.
+Most likely you won't need it, but if you do please refer to [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination)
+
+We do however use it internally in several places, one such example is when loading pretrained model weights in
+`from_pretrained`. We load one layer at a time and immediately partition it to all participating GPUs, as for very
+large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory
+limitations.
+
+Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
+
+```python
+tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
+```
+
+stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much
+larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
+
+
+
+<a id='deepspeed-zero-inference'></a>
+
+
+### ZeRO Inference
+
+ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In
+fact you can leave these in the config file if you want to share the same one with the training. They will just be
+ignored.
+
+Otherwise you just need to pass the usual [`TrainingArguments`] arguments. For example:
+
+```bash
+deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
+```
+
+The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever
+for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states.
+
+Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs:
+
+```bash
+deepspeed examples/pytorch/translation/run_translation.py \
+--deepspeed tests/deepspeed/ds_config_zero3.json \
+--model_name_or_path t5-small --output_dir output_dir \
+--do_eval --max_eval_samples 50 --warmup_steps 50  \
+--max_source_length 128 --val_max_target_length 128 \
+--overwrite_output_dir --per_device_eval_batch_size 4 \
+--predict_with_generate --dataset_config "ro-en" --fp16 \
+--source_lang en --target_lang ro --dataset_name wmt16 \
+--source_prefix "translate English to Romanian: "
+```
+
+Since for inference there is no need for additional large memory used by the optimizer states and the gradients you
+should be able to fit much larger batches and/or sequence length onto the same hardware.
+
+
+Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship
+to the ZeRO technology, but instead uses tensor parallelism to scale models that can't fit onto a single GPU. This is a
+work in progress and we will provide the integration once that product is complete.
+
+
+### Filing Issues
+
+Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work.
+
+In your report please always include:
+
+1. the full Deepspeed config file in the report
+
+2. either the command line arguments if you were using the [`Trainer`] or
+   [`TrainingArguments`] arguments if you were scripting the Trainer setup yourself. Please do not
+   dump the [`TrainingArguments`] as it has dozens of entries that are irrelevant.
+
+3. Output of:
+
+    ```bash
+    python -c 'import torch; print(f"torch: {torch.__version__}")'
+    python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
+    python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
+    ```
+
+4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this
+   [notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) as
+   a starting point.
+
+5. Unless it's impossible please always use a standard dataset that we can use and not something custom.
+
+6. If possible try to use one of the existing [examples](https://github.com/huggingface/transformers/tree/master/examples/pytorch) to reproduce the problem with.
+
+Things to consider:
+
+- Deepspeed is often not the cause of the problem.
+
+  Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the
+  problem was still there.
+
+  Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an
+  exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it.
+  And only if the problem persists then do mentioned Deepspeed and supply all the required details.
+
+- If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
+  directly with [Deepspeed](https://github.com/microsoft/DeepSpeed/). If you aren't sure, please do not worry,
+  either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if
+  need be.
+
+
+
+### Troubleshooting
+
+- `deepspeed` process gets killed at startup without a traceback
+
+If the `deepspeed` process gets killed at launch time without a traceback, that usually means that the program tried
+to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that
+process. This is because your configuration file most likely has either `offload_optimizer` or `offload_param` or
+both configured to offload to `cpu`. If you have NVMe, experiment with offloading to NVMe if you're running under
+ZeRO-3.
+
+Work is being done to enable estimating how much memory is needed for a specific model: [PR](https://github.com/microsoft/DeepSpeed/pull/965).
+
+
+
+
+
+
+### Notes
+
+- DeepSpeed works with the PyTorch [`Trainer`] but not TF [`TFTrainer`].
+- While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from [source](https://github.com/microsoft/deepspeed#installation) to best match your hardware and also if you need to enable
+  certain features, like 1-bit Adam, which aren't available in the pypi distribution.
+- You don't have to use the [`Trainer`] to use DeepSpeed with 🤗 Transformers - you can use any model
+  with your own trainer, and you will have to adapt the latter according to [the DeepSpeed integration instructions](https://www.deepspeed.ai/getting-started/#writing-deepspeed-models).
+
+
+
+
+<a id='deepspeed-non-trainer-integration'></a>
+
+## Non-Trainer Deepspeed Integration
+
+The [`~integrations.HfDeepSpeedConfig`] is used to integrate Deepspeed into the 🤗 Transformers core
+functionality, when [`Trainer`] is not used.
+
+When using [`Trainer`] everything is automatically taken care of.
+
+When not using [`Trainer`], to efficiently deploy DeepSpeed stage 3, you must instantiate the
+[`~integrations.HfDeepSpeedConfig`] object before instantiating the model.
+
+For example for a pretrained model:
+
+```python
+from transformers.deepspeed import HfDeepSpeedConfig
+from transformers import AutoModel, deepspeed
+
+ds_config = { ... } # deepspeed config object or path to the file
+# must run before instantiating the model
+dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+model = AutoModel.from_pretrained("gpt2")
+engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
+```
+
+or for non-pretrained model:
+
+```python
+from transformers.deepspeed import HfDeepSpeedConfig
+from transformers import AutoModel, AutoConfig, deepspeed
+
+ds_config = { ... } # deepspeed config object or path to the file
+# must run before instantiating the model
+dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+config = AutoConfig.from_pretrained("gpt2")
+model = AutoModel.from_config(config)
+engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
+```
+
+## HfDeepSpeedConfig
+
+[[autodoc]] deepspeed.HfDeepSpeedConfig
+    - all
+
+## Main DeepSpeed Resources
+
+- [Project's github](https://github.com/microsoft/deepspeed)
+- [Usage docs](https://www.deepspeed.ai/getting-started/)
+- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
+- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
+
+Papers:
+
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+
+Finally, please, remember that, HuggingFace [`Trainer`] only integrates DeepSpeed, therefore if you
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with [DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues).
diff --git a/docs/source/main_classes/deepspeed.rst b/docs/source/main_classes/deepspeed.rst
deleted file mode 100644
index 4b0b8c5bdb..0000000000
--- a/docs/source/main_classes/deepspeed.rst
+++ /dev/null
@@ -1,1833 +0,0 @@
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-
-DeepSpeed Integration
------------------------------------------------------------------------------------------------------------------------
-
-
-`DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
-<https://arxiv.org/abs/1910.02054>`__. Currently it provides full support for:
-
-1. Optimizer state partitioning (ZeRO stage 1)
-2. Gradient partitioning (ZeRO stage 2)
-3. Parameter partitioning (ZeRO stage 3)
-4. Custom mixed precision training handling
-5. A range of fast CUDA-extension-based optimizers
-6. ZeRO-Offload to CPU and NVMe
-
-ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training
-<https://arxiv.org/abs/2101.06840>`__. And NVMe-support is described in the paper `ZeRO-Infinity: Breaking the GPU
-Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__.
-
-DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
-
-DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which
-won't be possible on a single GPU.
-
-
-
-🤗 Transformers integrates `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ via 2 options:
-
-1. Integration of the core DeepSpeed features via :class:`~transformers.Trainer`. This is everything done for you type
-   of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
-   this document is focused on this feature.
-2. If you don't use :class:`~transformers.Trainer` and want to use your own Trainer where you integrated DeepSpeed
-   yourself, core functionality functions like ``from_pretrained`` and ``from_config`` include integration of essential
-   parts of DeepSpeed like ``zero.Init`` for ZeRO stage 3 and higher. To tap into this feature read the docs on
-   :ref:`deepspeed-non-trainer-integration`.
-
-What is integrated:
-
-Training:
-
-1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 with ZeRO-Infinity (CPU and NVME offload).
-
-Inference:
-
-1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but
-   it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see:
-   :ref:`deepspeed-zero-inference`.
-
-There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of
-ZeRO (coming soon).
-
-
-
-.. _deepspeed-trainer-integration:
-
-
-Trainer Deepspeed Integration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-
-.. _deepspeed-installation:
-
-Installation
-=======================================================================================================================
-
-Install the library via pypi:
-
-.. code-block:: bash
-
-    pip install deepspeed
-
-or via ``transformers``' ``extras``:
-
-.. code-block:: bash
-
-    pip install transformers[deepspeed]
-
-or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__ and
-`advanced install <https://www.deepspeed.ai/tutorials/advanced-install/>`__.
-
-If you're still struggling with the build, first make sure to read :ref:`zero-install-notes`.
-
-If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions
-to no avail, the next thing to try is to pre-build the modules before installing them.
-
-To make a local build for DeepSpeed:
-
-.. code-block:: bash
-
-    git clone https://github.com/microsoft/DeepSpeed/
-    cd DeepSpeed
-    rm -rf build
-    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
-    --global-option="build_ext" --global-option="-j8" --no-cache -v \
-    --disable-pip-version-check 2>&1 | tee build.log
-
-If you intend to use NVMe offload you will need to also include ``DS_BUILD_AIO=1`` in the instructions above (and also
-install `libaio-dev` system-wide).
-
-Edit ``TORCH_CUDA_ARCH_LIST`` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
-your cards are the same you can get the arch via:
-
-.. code-block:: bash
-
-    CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
-
-So if you get ``8, 6``, then use ``TORCH_CUDA_ARCH_LIST="8.6"``. If you have multiple different cards, you can list all
-of them like so ``TORCH_CUDA_ARCH_LIST="6.1;8.6"``
-
-If you need to use the same setup on multiple machines, make a binary wheel:
-
-.. code-block:: bash
-
-    git clone https://github.com/microsoft/DeepSpeed/
-    cd DeepSpeed
-    rm -rf build
-    TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
-    python setup.py build_ext -j8 bdist_wheel
-
-it will generate something like ``dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` which now you can install
-as ``pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`` locally or on any other machine.
-
-Again, remember to ensure to adjust ``TORCH_CUDA_ARCH_LIST`` to the target architectures.
-
-You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this
-context) `here <https://developer.nvidia.com/cuda-gpus>`__.
-
-You can check the archs pytorch was built with using:
-
-.. code-block:: bash
-
-    python -c "import torch; print(torch.cuda.get_arch_list())"
-
-Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
-
-.. code-block:: bash
-
-    CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
-    print(torch.cuda.get_device_properties(torch.device('cuda')))"
-
-If the output is:
-
-.. code-block:: bash
-
-    _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
-
-then you know that this card's arch is ``8.6``.
-
-You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
-architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
-it's best to specify the desired archs explicitly.
-
-If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of
-`Deepspeed <https://github.com/microsoft/DeepSpeed/issues>`__,
-
-
-
-.. _deepspeed-multi-gpu:
-
-Deployment with multiple GPUs
-=======================================================================================================================
-
-To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
-following:
-
-1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
-2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
-   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
-
-Therefore, if your original command line looked as following:
-
-.. code-block:: bash
-
-    python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
-
-Now it should be:
-
-.. code-block:: bash
-
-    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
-
-Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
-``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
-full details on how to configure various nodes and GPUs can be found `here
-<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
-
-In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use
-``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use
-the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will
-use it here as well.
-
-Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:
-
-.. code-block:: bash
-
-    deepspeed examples/pytorch/translation/run_translation.py \
-    --deepspeed tests/deepspeed/ds_config_zero3.json \
-    --model_name_or_path t5-small --per_device_train_batch_size 1   \
-    --output_dir output_dir --overwrite_output_dir --fp16 \
-    --do_train --max_train_samples 500 --num_train_epochs 1 \
-    --dataset_name wmt16 --dataset_config "ro-en" \
-    --source_lang en --target_lang ro
-
-
-Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
-two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
-with, we combined the two into a single argument.
-
-For some practical usage examples, please, see this `post
-<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
-
-
-
-.. _deepspeed-one-gpu:
-
-Deployment with one GPU
-=======================================================================================================================
-
-To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
-
-.. code-block:: bash
-
-    deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
-    --deepspeed tests/deepspeed/ds_config_zero2.json \
-    --model_name_or_path t5-small --per_device_train_batch_size 1   \
-    --output_dir output_dir --overwrite_output_dir --fp16 \
-    --do_train --max_train_samples 500 --num_train_epochs 1 \
-    --dataset_name wmt16 --dataset_config "ro-en" \
-    --source_lang en --target_lang ro
-
-This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via
-``--num_gpus=1``. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start
-with, then you don't need this argument. The following `documentation
-<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the launcher options.
-
-Why would you want to use DeepSpeed with just one GPU?
-
-1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
-   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
-   normally won't fit.
-2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
-   bigger models and data batches.
-
-While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
-with DeepSpeed is to have at least the following configuration in the configuration file:
-
-.. code-block:: json
-
-    {
-      "zero_optimization": {
-         "stage": 2,
-         "offload_optimizer": {
-             "device": "cpu",
-             "pin_memory": true
-         },
-         "allgather_partitions": true,
-         "allgather_bucket_size": 2e8,
-         "reduce_scatter": true,
-         "reduce_bucket_size": 2e8,
-         "overlap_comm": true,
-         "contiguous_gradients": true
-      }
-    }
-
-which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will
-find more details in the discussion below.
-
-For a practical usage example of this type of deployment, please, see this `post
-<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
-
-You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document.
-
-<!--- TODO: Benchmark whether we can get better performance out of ZeRO-3 vs. ZeRO-2 on a single GPU, and then
-recommend ZeRO-3 config as starting one. -->
-
-Notes:
-
-- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit
-  the visible scope of available GPUs. Instead, you have to use the following syntax:
-
-   .. code-block:: bash
-
-       deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
-
-   In this example, we tell DeepSpeed to use GPU 1 (second gpu).
-
-
-
-.. _deepspeed-notebook:
-
-Deployment in Notebooks
-=======================================================================================================================
-
-The problem with running notebook cells as a script is that there is no normal ``deepspeed`` launcher to rely on, so
-under certain setups we have to emulate it.
-
-If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed.
-
-.. code-block:: python
-
-    # DeepSpeed requires a distributed environment even when only one process is used.
-    # This emulates a launcher in the notebook
-    import os
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
-    os.environ['RANK'] = "0"
-    os.environ['LOCAL_RANK'] = "0"
-    os.environ['WORLD_SIZE'] = "1"
-
-    # Now proceed as normal, plus pass the deepspeed config file
-    training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
-    trainer = Trainer(...)
-    trainer.train()
-
-Note: ``...`` stands for the normal arguments that you'd pass to the functions.
-
-If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have
-to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented
-at the beginning of this section.
-
-If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
-cell with:
-
-.. code-block:: python
-
-    %%bash
-    cat <<'EOT' > ds_config_zero3.json
-    {
-        "fp16": {
-            "enabled": "auto",
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": "auto",
-                "betas": "auto",
-                "eps": "auto",
-                "weight_decay": "auto"
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": "auto",
-                "warmup_max_lr": "auto",
-                "warmup_num_steps": "auto"
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 3,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "offload_param": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "overlap_comm": true,
-            "contiguous_gradients": true,
-            "sub_group_size": 1e9,
-            "reduce_bucket_size": "auto",
-            "stage3_prefetch_bucket_size": "auto",
-            "stage3_param_persistence_threshold": "auto",
-            "stage3_max_live_parameters": 1e9,
-            "stage3_max_reuse_distance": 1e9,
-            "stage3_gather_fp16_weights_on_model_save": true
-        },
-
-        "gradient_accumulation_steps": "auto",
-        "gradient_clipping": "auto",
-        "steps_per_print": 2000,
-        "train_batch_size": "auto",
-        "train_micro_batch_size_per_gpu": "auto",
-        "wall_clock_breakdown": false
-    }
-    EOT
-
-
-If the training script is in a normal file and not in the notebook cells, you can launch ``deepspeed`` normally via
-shell from a cell. For example, to use ``run_translation.py`` you would launch it with:
-
-.. code-block::
-
-    !git clone https://github.com/huggingface/transformers
-    !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
-
-or with ``%%bash`` magic, where you can write a multi-line code for the shell program to run:
-
-.. code-block::
-
-    %%bash
-
-    git clone https://github.com/huggingface/transformers
-    cd transformers
-    deepspeed examples/pytorch/translation/run_translation.py ...
-
-In such case you don't need any of the code presented at the beginning of this section.
-
-Note: While ``%%bash`` magic is neat, but currently it buffers the output so you won't see the logs until the process
-completes.
-
-
-
-
-.. _deepspeed-config:
-
-Configuration
-=======================================================================================================================
-
-For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
-to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
-
-You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples
-repo <https://github.com/microsoft/DeepSpeedExamples>`__:
-
-.. code-block:: bash
-
-    git clone https://github.com/microsoft/DeepSpeedExamples
-    cd DeepSpeedExamples
-    find . -name '*json'
-
-Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the
-example ``.json`` files with:
-
-.. code-block:: bash
-
-    grep -i Lamb $(find . -name '*json')
-
-Some more examples are to be found in the `main repo <https://github.com/microsoft/DeepSpeed>`__ as well.
-
-When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have
-to be configured via the command line. You will find the nuances in the rest of this guide.
-
-To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features,
-including optimizer states cpu offload, uses ``AdamW`` optimizer and ``WarmupLR`` scheduler and will enable mixed
-precision training if ``--fp16`` is passed:
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": "auto",
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": "auto",
-                "betas": "auto",
-                "eps": "auto",
-                "weight_decay": "auto"
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": "auto",
-                "warmup_max_lr": "auto",
-                "warmup_num_steps": "auto"
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 2,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "allgather_partitions": true,
-            "allgather_bucket_size": 2e8,
-            "overlap_comm": true,
-            "reduce_scatter": true,
-            "reduce_bucket_size": 2e8,
-            "contiguous_gradients": true
-        },
-
-        "gradient_accumulation_steps": "auto",
-        "gradient_clipping": "auto",
-        "train_batch_size": "auto",
-        "train_micro_batch_size_per_gpu": "auto",
-    }
-
-When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
-to the console, so you can see exactly what was the final configuration passed to it.
-
-
-
-.. _deepspeed-config-passing:
-
-Passing Configuration
-=======================================================================================================================
-
-As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're
-not using the command line interface to configure the training, and instead instantiate the
-:class:`~transformers.Trainer` via :class:`~transformers.TrainingArguments` then for the ``deepspeed`` argument you can
-pass a nested ``dict``. This allows you to create the configuration on the fly and doesn't require you to write it to
-the file system before passing it to :class:`~transformers.TrainingArguments`.
-
-To summarize you can do:
-
-.. code-block:: python
-
-    TrainingArguments(..., deepspeed="/path/to/ds_config.json")
-
-or:
-
-.. code-block:: python
-
-    ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
-    TrainingArguments(..., deepspeed=ds_config_dict)
-
-
-
-.. _deepspeed-config-shared:
-
-Shared Configuration
-=======================================================================================================================
-
-
-.. warning::
-
-    This section is a must-read
-
-Some configuration values are required by both the :class:`~transformers.Trainer` and DeepSpeed to function correctly,
-therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those
-via the :class:`~transformers.Trainer` command line arguments.
-
-Additionally, some configuration values are derived automatically based on the model's configuration, so instead of
-remembering to manually adjust multiple values, it's the best to let the :class:`~transformers.Trainer` do the majority
-of configuration for you.
-
-Therefore, in the rest of this guide you will find a special configuration value: ``auto``, which when set will be
-automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this
-recommendation and set the values explicitly, in which case be very careful that your the
-:class:`~transformers.Trainer` arguments and DeepSpeed configurations agree. For example, are you using the same
-learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very
-difficult to detect ways. You have been warned.
-
-There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit
-your needs.
-
-In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master
-and configure :class:`~transformers.TrainingArguments` based on that. The steps are:
-
-1. Create or load the DeepSpeed configuration to be used as a master configuration
-2. Create the :class:`~transformers.TrainingArguments` object based on these values
-
-Do note that some values, such as :obj:`scheduler.params.total_num_steps` are calculated by
-:class:`~transformers.Trainer` during ``train``, but you can of course do the math yourself.
-
-.. _deepspeed-zero:
-
-ZeRO
-=======================================================================================================================
-
-`Zero Redundancy Optimizer (ZeRO) <https://www.deepspeed.ai/tutorials/zero/>`__ is the workhorse of DeepSpeed. It
-support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
-therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
-You will find more indepth information in the DeepSpeed documentation.
-
-The ``zero_optimization`` section of the configuration file is the most important part (`docs
-<https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
-which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the
-DeepSpeed docs.
-
-This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides
-no equivalent command line arguments.
-
-Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for
-the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is
-going to use.
-
-
-
-.. _deepspeed-zero2-config:
-
-ZeRO-2 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-The following is an example configuration for ZeRO stage 2:
-
-.. code-block:: json
-
-    {
-        "zero_optimization": {
-            "stage": 2,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "allgather_partitions": true,
-            "allgather_bucket_size": 5e8,
-            "overlap_comm": true,
-            "reduce_scatter": true,
-            "reduce_bucket_size": 5e8,
-            "contiguous_gradients": true
-        }
-    }
-
-**Performance tuning:**
-
-- enabling ``offload_optimizer`` should reduce GPU RAM usage (it requires ``"stage": 2``)
-- ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x
-  the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB
-  footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
-  OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do
-  the same on larger capacity GPU as well, if you're starting to hit OOM.
-- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
-  the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
-  important, getting a slightly slower training time could be a good trade.
-
-
-
-.. _deepspeed-zero3-config:
-
-ZeRO-3 Config
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-The following is an example configuration for ZeRO stage 3:
-
-.. code-block:: json
-
-    {
-        "zero_optimization": {
-            "stage": 3,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "offload_param": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "overlap_comm": true,
-            "contiguous_gradients": true,
-            "sub_group_size": 1e9,
-            "reduce_bucket_size": "auto",
-            "stage3_prefetch_bucket_size": "auto",
-            "stage3_param_persistence_threshold": "auto",
-            "stage3_max_live_parameters": 1e9,
-            "stage3_max_reuse_distance": 1e9,
-            "stage3_gather_fp16_weights_on_model_save": true
-        }
-    }
-
-If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU
-memory offloading the optimizer states and parameters to CPU memory with ``"device": "cpu"`` may solve this limitation.
-If you don't want to offload to CPU memory, use ``none`` instead of ``cpu`` for the ``device`` entry. Offloading to
-NVMe is discussed further down.
-
-Pinned memory is enabled with ``pin_memory`` set to ``true``. This feature can improve the throughput at the cost of
-making less memory available to other processes. Pinned memory is set aside to the specific process that requested it
-and its typically accessed much faster than normal CPU memory.
-
-**Performance tuning:**
-
-- ``stage3_max_live_parameters``: ``1e9``
-- ``stage3_max_reuse_distance``: ``1e9``
-
-If hitting OOM reduce ``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``. They should have minimal impact
-on performance unless you are doing activation checkpointing. ``1e9`` would consume ~2GB. The memory is shared by
-``stage3_max_live_parameters`` and ``stage3_max_reuse_distance``, so its not additive, its just 2GB total.
-
-``stage3_max_live_parameters`` is the upper limit on how many full parameters you want to keep on the GPU at any given
-time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
-use the ``stage3_max_reuse_distance`` to decide whether to throw away the parameter or to keep it. If a parameter is
-going to be used again in near future (less than ``stage3_max_reuse_distance``) then we keep it to reduce communication
-overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and
-backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward
-
-The following configuration values depend on the model's hidden size:
-
-- ``reduce_bucket_size``: ``hidden_size*hidden_size``
-- ``stage3_prefetch_bucket_size``: ``0.9 * hidden_size * hidden_size``
-- ``stage3_param_persistence_threshold``: ``10 * hidden_size``
-
-therefore set these values to ``auto`` and the :class:`~transformers.Trainer` will automatically assign the recommended
-values. But, of course, feel free to set these explicitly as well.
-
-``stage3_gather_fp16_weights_on_model_save`` enables model fp16 weights consolidation when model gets saved. With large
-models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if
-you plan to resume the training. Watch out for future updates that will remove this limitation and make things more
-flexible.
-
-If you're migrating from ZeRO-2 configuration note that ``allgather_partitions``, ``allgather_bucket_size`` and
-``reduce_scatter`` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just
-be ignored.
-
-- ``sub_group_size``: ``1e9``
-
-``sub_group_size`` controls the granularity in which parameters are updated during optimizer steps. Parameters are
-grouped into buckets of ``sub_group_size`` and each buckets is updated one at a time. When used with NVMe offload in
-ZeRO-Infinity, ``sub_group_size`` therefore controls the granularity in which model states are moved in and out of CPU
-memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models.
-
-You can leave ``sub_group_size`` to its default value of `1e9` when not using NVMe offload. You may want to change its
-default value in the following cases:
-
-1. Running into OOM during optimizer step: Reduce ``sub_group_size`` to reduce memory utilization of temporary buffers
-2. Optimizer Step is taking a long time: Increase ``sub_group_size`` to improve bandwidth utilization as a result of
-   the increased data buffers.
-
-
-.. _deepspeed-nvme:
-
-NVMe Support
-=======================================================================================================================
-
-ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to
-smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during
-offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training
-process. ZeRO-Infinity requires ZeRO-3 enabled.
-
-The following configuration example enables NVMe to offload both optimizer states and the params:
-
-.. code-block:: json
-
-    {
-        "zero_optimization": {
-            "stage": 3,
-            "offload_optimizer": {
-                "device": "nvme",
-                "nvme_path": "/local_nvme",
-                "pin_memory": true,
-                "buffer_count": 4,
-                "fast_init": false
-            },
-            "offload_param": {
-                "device": "nvme",
-                "nvme_path": "/local_nvme",
-                "pin_memory": true,
-                "buffer_count": 5,
-                "buffer_size": 1e8,
-                "max_in_cpu": 1e9
-            }
-            "aio": {
-                "block_size": 262144,
-                "queue_depth": 32,
-                "thread_count": 1,
-                "single_submit": false,
-                "overlap_events": true
-            }
-            "overlap_comm": true,
-            "contiguous_gradients": true,
-            "sub_group_size": 1e9,
-            "reduce_bucket_size": "auto",
-            "stage3_prefetch_bucket_size": "auto",
-            "stage3_param_persistence_threshold": "auto",
-            "stage3_max_live_parameters": 1e9,
-            "stage3_max_reuse_distance": 1e9,
-            "stage3_gather_fp16_weights_on_model_save": true
-        },
-    }
-
-You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you
-have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint:
-`"device": "cpu"`).
-
-Here is the full documentation for offloading `optimizer states
-<https://www.deepspeed.ai/docs/config-json/#optimizer-offloading>`__ and `parameters
-<https://www.deepspeed.ai/docs/config-json/#parameter-offloading>`__.
-
-Make sure that your ``nvme_path`` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll
-be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this
-writing one can have ~3.5GB/s read, ~3GB/s write peak speeds).
-
-In order to figure out the optimal ``aio`` configuration block you must run a benchmark on your target setup, as
-`explained here <https://github.com/microsoft/DeepSpeed/issues/998>`__.
-
-
-
-.. _deepspeed-zero2-zero3-performance:
-
-ZeRO-2 vs ZeRO-3 Performance
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather
-model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs
-then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity
-at a cost of speed.
-
-It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2:
-
-- set ``stage3_param_persistence_threshold`` to a very large number - larger than the largest parameter, e.g., ``6 *
-  hidden_size * hidden_size``. This will keep the parameters on the GPUs.
-- turn off ``offload_params`` since ZeRO-2 doesn't have that option.
-
-The performance will likely improve significantly with just ``offload_params`` turned off, even if you don't change
-``stage3_param_persistence_threshold``. Of course, these changes will impact the size of the model you can train. So
-these help you to trade scalability for speed depending on your needs.
-
-
-
-.. _deepspeed-zero2-example:
-
-ZeRO-2 Example
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Here is a full ZeRO-2 auto-configuration file ``ds_config_zero2.json``:
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": "auto",
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": "auto",
-                "betas": "auto",
-                "eps": "auto",
-                "weight_decay": "auto"
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": "auto",
-                "warmup_max_lr": "auto",
-                "warmup_num_steps": "auto"
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 2,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "allgather_partitions": true,
-            "allgather_bucket_size": 2e8,
-            "overlap_comm": true,
-            "reduce_scatter": true,
-            "reduce_bucket_size": 2e8,
-            "contiguous_gradients": true
-        },
-
-        "gradient_accumulation_steps": "auto",
-        "gradient_clipping": "auto",
-        "steps_per_print": 2000,
-        "train_batch_size": "auto",
-        "train_micro_batch_size_per_gpu": "auto",
-        "wall_clock_breakdown": false
-    }
-
-
-Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical
-values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": true,
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": 3e-5,
-                "betas": [0.8, 0.999],
-                "eps": 1e-8,
-                "weight_decay": 3e-7
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": 0,
-                "warmup_max_lr": 3e-5,
-                "warmup_num_steps": 500
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 2,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "allgather_partitions": true,
-            "allgather_bucket_size": 2e8,
-            "overlap_comm": true,
-            "reduce_scatter": true,
-            "reduce_bucket_size": 2e8,
-            "contiguous_gradients": true
-        },
-
-        "steps_per_print": 2000,
-        "wall_clock_breakdown": false
-    }
-
-
-
-.. _deepspeed-zero3-example:
-
-ZeRO-3 Example
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Here is a full ZeRO-3 auto-configuration file ``ds_config_zero3.json``:
-
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": "auto",
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": "auto",
-                "betas": "auto",
-                "eps": "auto",
-                "weight_decay": "auto"
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": "auto",
-                "warmup_max_lr": "auto",
-                "warmup_num_steps": "auto"
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 3,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "offload_param": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "overlap_comm": true,
-            "contiguous_gradients": true,
-            "sub_group_size": 1e9,
-            "reduce_bucket_size": "auto",
-            "stage3_prefetch_bucket_size": "auto",
-            "stage3_param_persistence_threshold": "auto",
-            "stage3_max_live_parameters": 1e9,
-            "stage3_max_reuse_distance": 1e9,
-            "stage3_gather_fp16_weights_on_model_save": true
-        },
-
-        "gradient_accumulation_steps": "auto",
-        "gradient_clipping": "auto",
-        "steps_per_print": 2000,
-        "train_batch_size": "auto",
-        "train_micro_batch_size_per_gpu": "auto",
-        "wall_clock_breakdown": false
-    }
-
-Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical
-values look like, but we highly recommend using the one with multiple ``auto`` settings in it.
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": true,
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        },
-
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": 3e-5,
-                "betas": [0.8, 0.999],
-                "eps": 1e-8,
-                "weight_decay": 3e-7
-            }
-        },
-
-        "scheduler": {
-            "type": "WarmupLR",
-            "params": {
-                "warmup_min_lr": 0,
-                "warmup_max_lr": 3e-5,
-                "warmup_num_steps": 500
-            }
-        },
-
-        "zero_optimization": {
-            "stage": 3,
-            "offload_optimizer": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "offload_param": {
-                "device": "cpu",
-                "pin_memory": true
-            },
-            "overlap_comm": true,
-            "contiguous_gradients": true,
-            "sub_group_size": 1e9,
-            "reduce_bucket_size": 1e6,
-            "stage3_prefetch_bucket_size": 0.94e6,
-            "stage3_param_persistence_threshold": 1e4,
-            "stage3_max_live_parameters": 1e9,
-            "stage3_max_reuse_distance": 1e9,
-            "stage3_gather_fp16_weights_on_model_save": true
-        },
-
-        "steps_per_print": 2000,
-        "wall_clock_breakdown": false
-    }
-
-
-Optimizer and Scheduler
-=======================================================================================================================
-
-As long as you don't enable ``offload_optimizer`` you can mix and match DeepSpeed and HuggingFace schedulers and
-optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer:
-
-+--------------+--------------+--------------+
-| Combos       | HF Scheduler | DS Scheduler |
-+--------------+--------------+--------------+
-| HF Optimizer | Yes          | Yes          |
-+--------------+--------------+--------------+
-| DS Optimizer | No           | Yes          |
-+--------------+--------------+--------------+
-
-It is possible to use a non-DeepSpeed optimizer when ``offload_optimizer`` is enabled, as long as it has both CPU and
-GPU implementation (except LAMB).
-
-
-
-
-.. _deepspeed-optimizer:
-
-Optimizer
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-
-DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are
-thus recommended to be used. It, however, can import other optimizers from ``torch``. The full documentation is `here
-<https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`__.
-
-If you don't configure the ``optimizer`` entry in the configuration file, the :class:`~transformers.Trainer` will
-automatically set it to ``AdamW`` and will use the supplied values or the defaults for the following command line
-arguments: ``--learning_rate``, ``--adam_beta1``, ``--adam_beta2``, ``--adam_epsilon`` and ``--weight_decay``.
-
-Here is an example of the auto-configured ``optimizer`` entry for ``AdamW``:
-
-.. code-block:: json
-
-    {
-       "optimizer": {
-           "type": "AdamW",
-           "params": {
-             "lr": "auto",
-             "betas": "auto",
-             "eps": "auto",
-             "weight_decay": "auto"
-           }
-       }
-    }
-
-
-Note that the command line arguments will set the values in the configuration file. This is so that there is one
-definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to
-different values in different places. Command line rules. The values that get overridden are:
-
-- ``lr`` with the value of ``--learning_rate``
-- ``betas`` with the value of ``--adam_beta1 --adam_beta2``
-- ``eps`` with the value of ``--adam_epsilon``
-- ``weight_decay`` with the value of ``--weight_decay``
-
-Therefore please remember to tune the shared hyperparameters on the command line.
-
-You can also set the values explicitly:
-
-.. code-block:: json
-
-    {
-       "optimizer": {
-           "type": "AdamW",
-           "params": {
-             "lr": 0.001,
-             "betas": [0.8, 0.999],
-             "eps": 1e-8,
-             "weight_decay": 3e-7
-           }
-       }
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-If you want to use another optimizer which is not listed above, you will have to add to the top level configuration.
-
-.. code-block:: json
-
-    {
-       "zero_allow_untested_optimizer": true
-    }
-
-Similarly to ``AdamW``, you can configure other officially supported optimizers. Just remember that may have different
-config values. e.g. for Adam you will want ``weight_decay`` around ``0.01``.
-
-
-
-.. _deepspeed-scheduler:
-
-Scheduler
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-DeepSpeed supports ``LRRangeTest``, ``OneCycle``, ``WarmupLR`` and ``WarmupDecayLR`` learning rate schedulers. The full
-documentation is `here <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
-
-Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed:
-
-* ``WarmupLR`` via ``--lr_scheduler_type constant_with_warmup``
-* ``WarmupDecayLR`` via ``--lr_scheduler_type linear``. This is also the default value for ``--lr_scheduler_type``,
-  therefore, if you don't configure the scheduler this is scheduler that will get configured by default.
-
-If you don't configure the ``scheduler`` entry in the configuration file, the :class:`~transformers.Trainer` will use
-the values of ``--lr_scheduler_type``, ``--learning_rate`` and ``--warmup_steps`` or ``--warmup_ratio`` to configure a
-🤗 Transformers version of it.
-
-Here is an example of the auto-configured ``scheduler`` entry for ``WarmupLR``:
-
-.. code-block:: json
-
-    {
-       "scheduler": {
-             "type": "WarmupLR",
-             "params": {
-                 "warmup_min_lr": "auto",
-                 "warmup_max_lr": "auto",
-                 "warmup_num_steps": "auto"
-             }
-         }
-    }
-
-Since `"auto"` is used the :class:`~transformers.Trainer` arguments will set the correct values in the configuration
-file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example,
-the learning rate is set to different values in different places. Command line rules. The values that get set are:
-
-- ``warmup_min_lr`` with the value of ``0``.
-- ``warmup_max_lr`` with the value of ``--learning_rate``.
-- ``warmup_num_steps`` with the value of ``--warmup_steps`` if provided. Otherwise will use ``--warmup_ratio``
-  multiplied by the number of training steps and rounded up.
-- ``total_num_steps`` with either the value of ``--max_steps`` or if it is not provided, derived automatically at run
-  time based on the environment and the size of the dataset and other command line arguments (needed for
-  ``WarmupDecayLR``).
-
-You can, of course, take over any or all of the configuration values and set those yourself:
-
-.. code-block:: json
-
-    {
-       "scheduler": {
-             "type": "WarmupLR",
-             "params": {
-                 "warmup_min_lr": 0,
-                 "warmup_max_lr": 0.001,
-                 "warmup_num_steps": 1000
-             }
-         }
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-For example, for ``WarmupDecayLR``, you can use the following entry:
-
-.. code-block:: json
-
-    {
-       "scheduler": {
-             "type": "WarmupDecayLR",
-             "params": {
-                 "last_batch_iteration": -1,
-                 "total_num_steps": "auto",
-                 "warmup_min_lr": "auto",
-                 "warmup_max_lr": "auto",
-                 "warmup_num_steps": "auto"
-             }
-         }
-    }
-
-and ``total_num_steps`, ``warmup_max_lr``, ``warmup_num_steps`` and ``total_num_steps`` will be set at loading time.
-
-
-
-
-.. _deepspeed-fp32:
-
-fp32 Precision
-=======================================================================================================================
-
-Deepspeed supports the full fp32 and the fp16 mixed precision.
-
-Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you
-will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this
-happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained
-models). Such models may overflow or underflow leading to ``NaN`` loss. If this is your case then you will want to use
-the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with:
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": "false",
-        }
-    }
-
-If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using
-the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and
-benchmarks, please, see `TensorFloat-32(TF32) on Ampere devices
-<https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices>`__. The document includes
-instructions on how to disable this automatic conversion if for some reason you prefer not to use it.
-
-
-
-
-.. _deepspeed-amp:
-
-Automatic Mixed Precision
-=======================================================================================================================
-
-You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way:
-
-To configure pytorch AMP-like mode set:
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": "auto",
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        }
-    }
-
-and the :class:`~transformers.Trainer` will automatically enable or disable it based on the value of
-``args.fp16_backend``. The rest of config values are up to you.
-
-This mode gets enabled when ``--fp16 --fp16_backend amp`` command line args are passed.
-
-You can also enable/disable this mode explicitly:
-
-.. code-block:: json
-
-    {
-        "fp16": {
-            "enabled": true,
-            "loss_scale": 0,
-            "loss_scale_window": 1000,
-            "initial_scale_power": 16,
-            "hysteresis": 2,
-            "min_loss_scale": 1
-        }
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-Here is the `documentation <https://www.deepspeed.ai/docs/config-json/#fp16-training-options>`__.
-
-To configure apex AMP-like mode set:
-
-.. code-block:: json
-
-    "amp": {
-        "enabled": "auto",
-        "opt_level": "auto"
-    }
-
-and the :class:`~transformers.Trainer` will automatically configure it based on the values of ``args.fp16_backend`` and
-``args.fp16_opt_level``.
-
-This mode gets enabled when ``--fp16 --fp16_backend apex --fp16_opt_level 01`` command line args are passed.
-
-You can also configure this mode explicitly:
-
-.. code-block:: json
-
-    {
-        "amp": {
-            "enabled": true,
-            "opt_level": "O1"
-        }
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-Here is the `documentation
-<https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.
-
-
-
-.. _deepspeed-bs:
-
-Batch Size
-=======================================================================================================================
-
-To configure batch size, use:
-
-.. code-block:: json
-
-    {
-        "train_batch_size": "auto",
-        "train_micro_batch_size_per_gpu": "auto"
-    }
-
-and the :class:`~transformers.Trainer` will automatically set ``train_micro_batch_size_per_gpu`` to the value of
-``args.per_device_train_batch_size`` and ``train_batch_size`` to ``args.world_size * args.per_device_train_batch_size *
-args.gradient_accumulation_steps``.
-
-You can also set the values explicitly:
-
-.. code-block:: json
-
-    {
-        "train_batch_size": 12,
-        "train_micro_batch_size_per_gpu": 4
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-
-
-.. _deepspeed-grad-acc:
-
-Gradient Accumulation
-=======================================================================================================================
-
-To configure gradient accumulation set:
-
-.. code-block:: json
-
-    {
-        "gradient_accumulation_steps": "auto"
-    }
-
-and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.gradient_accumulation_steps``.
-
-You can also set the value explicitly:
-
-.. code-block:: json
-
-    {
-        "gradient_accumulation_steps": 3
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-
-
-.. _deepspeed-grad-clip:
-
-Gradient Clipping
-=======================================================================================================================
-
-To configure gradient gradient clipping set:
-
-.. code-block:: json
-
-    {
-        "gradient_clipping": "auto"
-    }
-
-and the :class:`~transformers.Trainer` will automatically set it to the value of ``args.max_grad_norm``.
-
-You can also set the value explicitly:
-
-.. code-block:: json
-
-    {
-        "gradient_clipping": 1.0
-    }
-
-But then you're on your own synchronizing the :class:`~transformers.Trainer` command line arguments and the DeepSpeed
-configuration.
-
-
-
-.. _deepspeed-weight-extraction:
-
-Getting The Model Weights Out
-=======================================================================================================================
-
-As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores
-fp32 master weights in its custom checkpoint optimizer files, which are ``global_step*/*optim_states.pt`` (this is glob
-pattern), and are saved under the normal checkpoint.
-
-**FP16 Weights:**
-
-When a model is saved under ZeRO-2, you end up having the normal ``pytorch_model.bin`` file with the model weights, but
-they are only the fp16 version of the weights.
-
-Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs,
-therefore ``"stage3_gather_fp16_weights_on_model_save": true`` is required to get the ``Trainer`` to save the fp16
-version of the weights. If this setting is ``False`` ``pytorch_model.bin`` won't be created. This is because by default
-DeepSpeed's ``state_dict`` contains a placeholder and not the real weights. If we were to save this ``state_dict`` it
-won't be possible to load it back.
-
-
-.. code-block:: json
-
-    {
-        "zero_optimization": {
-            "stage3_gather_fp16_weights_on_model_save": true
-        }
-    }
-
-
-**FP32 Weights:**
-
-While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
-the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
-weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
-therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
-memory it can be done in the same training script. The following sections will discuss both approaches.
-
-
-**Live FP32 Weights Recovery:**
-
-This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
-
-If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
-
-.. code-block:: python
-
-    from transformers.trainer_utils import get_last_checkpoint
-    from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
-    checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
-    fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
-
-If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
-checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
-
-.. code-block:: python
-
-    from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
-    checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
-    trainer.deepspeed.save_checkpoint(checkpoint_dir)
-    fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
-
-.. note::
-
-    Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
-    DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
-    ``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
-    of the training.
-
-Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
-trainer.
-
-If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
-these yourself as is shown in the following example:
-
-.. code-block:: python
-
-    from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
-    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
-    model = model.cpu()
-    model.load_state_dict(state_dict)
-
-
-**Offline FP32 Weights Recovery:**
-
-DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
-folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
-have the configuration file or a ``Trainer`` to do the extraction.
-
-Let's say your checkpoint folder looks like this:
-
-.. code-block:: bash
-
-    $ ls -l output_dir/checkpoint-1/
-    -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
-    drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-    -rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
-    -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-    -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-    -rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
-    -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-    -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-    -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-    -rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
-    -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-    -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
-
-In this example there is just one DeepSpeed checkpoint sub-folder `global_step1`. Therefore to reconstruct the fp32
-weights just run:
-
-.. code-block:: bash
-
-    python zero_to_fp32.py . pytorch_model.bin
-
-This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
-
-The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
-
-``python zero_to_fp32.py -h`` will give you usage details.
-
-The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
-example will contain ``global_step1``.
-
-Note: currently the script requires 2x general RAM of the final fp32 model weights.
-
-
-ZeRO-3 and Infinity Nuances
-=======================================================================================================================
-
-ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature.
-
-ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements.
-
-While all the efforts were made for things to just work without needing any special changes to your models, in certain
-circumstances you may find the following information to be needed.
-
-
-
-Constructing Massive Models
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases,
-but also if you want the initialization to happen much faster, initialize the model using `deepspeed.zero.Init()`
-context manager (which is also a function decorator), like so:
-
-.. code-block:: python
-
-    from transformers import T5ForConditionalGeneration, T5Config
-    import deepspeed
-    with deepspeed.zero.Init():
-       config = T5Config.from_pretrained("t5-small")
-       model = T5ForConditionalGeneration(config)
-
-As you can see this gives you a randomly initialized model.
-
-If you want to use a pretrained model, ``model_class.from_pretrained`` will activate this feature as long as
-``is_deepspeed_zero3_enabled()`` returns ``True``, which currently is setup by the
-class:`~transformers.TrainingArguments` object if the passed DeepSpeed configuration file contains ZeRO-3 config
-section. Thus you must create the :class:`~transformers.TrainingArguments` object **before** calling
-``from_pretrained``. Here is an example of a possible sequence:
-
-.. code-block:: python
-
-    from transformers import AutoModel, Trainer, TrainingArguments
-    training_args = TrainingArguments(..., deepspeed=ds_config)
-    model = AutoModel.from_pretrained("t5-small")
-    trainer = Trainer(model=model, args=training_args, ...)
-
-If you're using the official example scripts and your command line arguments include ``--deepspeed ds_config.json``
-with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written.
-
-Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used.
-
-For full details on this method and other related features please refer to `Constructing Massive Models
-<https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models>`__.
-
-Also when loading fp16-pretrained models, you will want to tell ``from_pretrained`` to use
-``torch_dtype=torch.float16``. For details, please, see :ref:`from_pretrained-torch-dtype`.
-
-
-Gathering Parameters
-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
-Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently
-executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it.
-Most likely you won't need it, but if you do please refer to `Gathering Parameters
-<https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination>`__
-
-We do however use it internally in several places, one such example is when loading pretrained model weights in
-``from_pretrained``. We load one layer at a time and immediately partition it to all participating GPUs, as for very
-large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory
-limitations.
-
-Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:
-
-.. code-block:: python
-
-    tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
-
-stress on ``tensor([1.])``, or if you get an error where it says the parameter is of size ``1``, instead of some much
-larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder.
-
-
-
-.. _deepspeed-zero-inference:
-
-
-ZeRO Inference
-=======================================================================================================================
-
-ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In
-fact you can leave these in the config file if you want to share the same one with the training. They will just be
-ignored.
-
-Otherwise you just need to pass the usual :class:`~transformers.TrainingArguments` arguments. For example:
-
-.. code-block:: bash
-
-    deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
-
-The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever
-for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states.
-
-Here is an example of running ``run_translation.py`` under DeepSpeed deploying all available GPUs:
-
-.. code-block:: bash
-
-    deepspeed examples/pytorch/translation/run_translation.py \
-    --deepspeed tests/deepspeed/ds_config_zero3.json \
-    --model_name_or_path t5-small --output_dir output_dir \
-    --do_eval --max_eval_samples 50 --warmup_steps 50  \
-    --max_source_length 128 --val_max_target_length 128 \
-    --overwrite_output_dir --per_device_eval_batch_size 4 \
-    --predict_with_generate --dataset_config "ro-en" --fp16 \
-    --source_lang en --target_lang ro --dataset_name wmt16 \
-    --source_prefix "translate English to Romanian: "
-
-Since for inference there is no need for additional large memory used by the optimizer states and the gradients you
-should be able to fit much larger batches and/or sequence length onto the same hardware.
-
-
-Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship
-to the ZeRO technology, but instead uses tensor parallelism to scale models that can't fit onto a single GPU. This is a
-work in progress and we will provide the integration once that product is complete.
-
-
-Filing Issues
-=======================================================================================================================
-
-Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work.
-
-In your report please always include:
-
-1. the full Deepspeed config file in the report
-
-2. either the command line arguments if you were using the :class:`~transformers.Trainer` or
-   :class:`~transformers.TrainingArguments` arguments if you were scripting the Trainer setup yourself. Please do not
-   dump the :class:`~transformers.TrainingArguments` as it has dozens of entries that are irrelevant.
-
-3. Output of:
-
-.. code-block:: bash
-
-    python -c 'import torch; print(f"torch: {torch.__version__}")'
-    python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
-    python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
-
-4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this
-   `notebook <https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb>`__ as
-   a starting point.
-
-5. Unless it's impossible please always use a standard dataset that we can use and not something custom.
-
-6. If possible try to use one of the existing `examples
-   <https://github.com/huggingface/transformers/tree/master/examples/pytorch>`__ to reproduce the problem with.
-
-Things to consider:
-
-* Deepspeed is often not the cause of the problem.
-
-    Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the
-    problem was still there.
-
-    Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an
-    exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it.
-    And only if the problem persists then do mentioned Deepspeed and supply all the required details.
-
-* If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue
-  directly with `Deepspeed <https://github.com/microsoft/DeepSpeed/>`__. If you aren't sure, please do not worry,
-  either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if
-  need be.
-
-
-
-Troubleshooting
-=======================================================================================================================
-
-* ``deepspeed`` process gets killed at startup without a traceback
-
-If the ``deepspeed`` process gets killed at launch time without a traceback, that usually means that the program tried
-to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that
-process. This is because your configuration file most likely has either ``offload_optimizer`` or ``offload_param`` or
-both configured to offload to ``cpu``. If you have NVMe, experiment with offloading to NVMe if you're running under
-ZeRO-3.
-
-Work is being done to enable estimating how much memory is needed for a specific model: `PR
-<https://github.com/microsoft/DeepSpeed/pull/965>`__.
-
-
-
-
-
-
-Notes
-=======================================================================================================================
-
-* DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
-* While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
-  <https://github.com/microsoft/deepspeed#installation>`__ to best match your hardware and also if you need to enable
-  certain features, like 1-bit Adam, which aren't available in the pypi distribution.
-* You don't have to use the :class:`~transformers.Trainer` to use DeepSpeed with 🤗 Transformers - you can use any model
-  with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration instructions
-  <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.
-
-
-
-
-.. _deepspeed-non-trainer-integration:
-
-Non-Trainer Deepspeed Integration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The :class:`~transformers.integrations.HfDeepSpeedConfig` is used to integrate Deepspeed into the 🤗 Transformers core
-functionality, when :class:`~transformers.Trainer` is not used.
-
-When using :class:`~transformers.Trainer` everything is automatically taken care of.
-
-When not using :class:`~transformers.Trainer`, to efficiently deploy DeepSpeed stage 3, you must instantiate the
-:class:`~transformers.integrations.HfDeepSpeedConfig` object before instantiating the model.
-
-For example for a pretrained model:
-
-.. code-block:: python
-
-    from transformers.deepspeed import HfDeepSpeedConfig
-    from transformers import AutoModel, deepspeed
-
-    ds_config = { ... } # deepspeed config object or path to the file
-    # must run before instantiating the model
-    dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
-    model = AutoModel.from_pretrained("gpt2")
-    engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
-
-or for non-pretrained model:
-
-.. code-block:: python
-
-    from transformers.deepspeed import HfDeepSpeedConfig
-    from transformers import AutoModel, AutoConfig, deepspeed
-
-    ds_config = { ... } # deepspeed config object or path to the file
-    # must run before instantiating the model
-    dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
-    config = AutoConfig.from_pretrained("gpt2")
-    model = AutoModel.from_config(config)
-    engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
-
-
-HfDeepSpeedConfig
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. autoclass:: transformers.deepspeed.HfDeepSpeedConfig
-    :members:
-
-
-
-Main DeepSpeed Resources
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-- `Project's github <https://github.com/microsoft/deepspeed>`__
-- `Usage docs <https://www.deepspeed.ai/getting-started/>`__
-- `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
-- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__
-
-Papers:
-
-- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models <https://arxiv.org/abs/1910.02054>`__
-- `ZeRO-Offload: Democratizing Billion-Scale Model Training <https://arxiv.org/abs/2101.06840>`__
-- `ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning <https://arxiv.org/abs/2104.07857>`__
-
-Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
-have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
-<https://github.com/microsoft/DeepSpeed/issues>`__.
diff --git a/docs/source/testing.mdx b/docs/source/testing.mdx
new file mode 100644
index 0000000000..6e9afd0087
--- /dev/null
+++ b/docs/source/testing.mdx
@@ -0,0 +1,1189 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Testing
+
+
+Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
+
+There are 2 test suites in the repository:
+
+1. `tests` -- tests for the general API
+2. `examples` -- tests primarily for various applications that aren't part of the API
+
+## How transformers are tested
+
+1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
+   are defined in this [config file](https://github.com/huggingface/transformers-doc2mdx/tree/master/.circleci/config.yml), so that if needed you can reproduce the same
+   environment on your machine.
+
+   These CI jobs don't run `@slow` tests.
+
+2. There are 3 jobs run by [github actions](https://github.com/huggingface/transformers/actions):
+
+   - [torch hub integration](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/github-torch-hub.yml): checks whether torch hub
+     integration works.
+
+   - [self-hosted (push)](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-push.yml): runs fast tests on GPU only on commits on
+     `master`. It only runs if a commit on `master` has updated the code in one of the following folders: `src`,
+     `tests`, `.github` (to prevent running on added model cards, notebooks, etc.)
+
+   - [self-hosted runner](https://github.com/huggingface/transformers-doc2mdx/tree/master/.github/workflows/self-scheduled.yml): runs normal and slow tests on GPU in
+     `tests` and `examples`:
+
+```bash
+RUN_SLOW=1 pytest tests/
+RUN_SLOW=1 pytest examples/
+```
+
+   The results can be observed [here](https://github.com/huggingface/transformers/actions).
+
+
+
+## Running tests
+
+
+
+
+
+### Choosing which tests to run
+
+This document goes into many details of how tests can be run. If after reading everything, you need even more details
+you will find them [here](https://docs.pytest.org/en/latest/usage.html).
+
+Here are some most useful ways of running tests.
+
+Run all:
+
+```console
+pytest
+```
+
+or:
+
+```bash
+make test
+```
+
+Note that the latter is defined as:
+
+```bash
+python -m pytest -n auto --dist=loadfile -s -v ./tests/
+```
+
+which tells pytest to:
+
+- run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
+- ensure that all tests from the same file will be run by the same test process
+- do not capture output
+- run in verbose mode
+
+
+
+### Getting the list of all tests
+
+All tests of the test suite:
+
+```bash
+pytest --collect-only -q
+```
+
+All tests of a given test file:
+
+```bash
+pytest tests/test_optimization.py --collect-only -q
+```
+
+### Run a specific test module
+
+To run an individual test module:
+
+```bash
+pytest tests/test_logging.py
+```
+
+### Run specific tests
+
+Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
+class containing those tests. For example, it could be:
+
+```bash
+pytest tests/test_optimization.py::OptimizationTest::test_adam_w
+```
+
+Here:
+
+- `tests/test_optimization.py` - the file with tests
+- `OptimizationTest` - the name of the class
+- `test_adam_w` - the name of the specific test function
+
+If the file contains multiple classes, you can choose to run only tests of a given class. For example:
+
+```bash
+pytest tests/test_optimization.py::OptimizationTest
+```
+
+will run all the tests inside that class.
+
+As mentioned earlier you can see what tests are contained inside the `OptimizationTest` class by running:
+
+```bash
+pytest tests/test_optimization.py::OptimizationTest --collect-only -q
+```
+
+You can run tests by keyword expressions.
+
+To run only tests whose name contains `adam`:
+
+```bash
+pytest -k adam tests/test_optimization.py
+```
+
+Logical `and` and `or` can be used to indicate whether all keywords should match or either. `not` can be used to
+negate.
+
+To run all tests except those whose name contains `adam`:
+
+```bash
+pytest -k "not adam" tests/test_optimization.py
+```
+
+And you can combine the two patterns in one:
+
+```bash
+pytest -k "ada and not adam" tests/test_optimization.py
+```
+
+For example to run both `test_adafactor` and `test_adam_w` you can use:
+
+```bash
+pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
+```
+
+Note that we use `or` here, since we want either of the keywords to match to include both.
+
+If you want to include only tests that include both patterns, `and` is to be used:
+
+```bash
+pytest -k "test and ada" tests/test_optimization.py
+```
+
+### Run only modified tests
+
+You can run the tests related to the unstaged files or the current branch (according to Git) by using [pytest-picked](https://github.com/anapaulagomes/pytest-picked). This is a great way of quickly testing your changes didn't break
+anything, since it won't run the tests related to files you didn't touch.
+
+```bash
+pip install pytest-picked
+```
+
+```bash
+pytest --picked
+```
+
+All tests will be run from files and folders which are modified, but not yet committed.
+
+### Automatically rerun failed tests on source modification
+
+[pytest-xdist](https://github.com/pytest-dev/pytest-xdist) provides a very useful feature of detecting all failed
+tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
+fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
+which again a full run is performed.
+
+```bash
+pip install pytest-xdist
+```
+
+To enter the mode: `pytest -f` or `pytest --looponfail`
+
+File changes are detected by looking at `looponfailroots` root directories and all of their contents (recursively).
+If the default for this value does not work for you, you can change it in your project by setting a configuration
+option in `setup.cfg`:
+
+```ini
+[tool:pytest]
+looponfailroots = transformers tests
+```
+
+or `pytest.ini`/``tox.ini`` files:
+
+```ini
+[pytest]
+looponfailroots = transformers tests
+```
+
+This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
+directory.
+
+[pytest-watch](https://github.com/joeyespo/pytest-watch) is an alternative implementation of this functionality.
+
+
+### Skip a test module
+
+If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
+example, to run all except `test_modeling_*.py` tests:
+
+```bash
+pytest *ls -1 tests/*py | grep -v test_modeling*
+```
+
+### Clearing state
+
+CI builds and when isolation is important (against speed), cache should be cleared:
+
+```bash
+pytest --cache-clear tests
+```
+
+### Running tests in parallel
+
+As mentioned earlier `make test` runs tests in parallel via `pytest-xdist` plugin (`-n X` argument, e.g. `-n 2`
+to run 2 parallel jobs).
+
+`pytest-xdist`'s `--dist=` option allows one to control how the tests are grouped. `--dist=loadfile` puts the
+tests located in one file onto the same process.
+
+Since the order of executed tests is different and unpredictable, if running the test suite with `pytest-xdist`
+produces failures (meaning we have some undetected coupled tests), use [pytest-replay](https://github.com/ESSS/pytest-replay) to replay the tests in the same order, which should help with then somehow
+reducing that failing sequence to a minimum.
+
+### Test order and repetition
+
+It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
+inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
+some problems that get uncovered by randomness of DL.
+
+
+#### Repeat tests
+
+- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder):
+
+```bash
+pip install pytest-flakefinder
+```
+
+And then run every test multiple times (50 by default):
+
+```bash
+pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
+```
+
+<Tip>
+
+This plugin doesn't work with `-n` flag from `pytest-xdist`.
+
+</Tip>
+
+<Tip>
+
+There is another plugin `pytest-repeat`, but it doesn't work with `unittest`.
+
+</Tip>
+
+#### Run tests in a random order
+
+```bash
+pip install pytest-random-order
+```
+
+Important: the presence of `pytest-random-order` will automatically randomize tests, no configuration change or
+command line options is required.
+
+As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
+`pytest-random-order` is installed it will print the random seed it used for that session, e.g:
+
+```bash
+pytest tests
+[...]
+Using --random-order-bucket=module
+Using --random-order-seed=573663
+```
+
+So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
+
+```bash
+pytest --random-order-seed=573663
+[...]
+Using --random-order-bucket=module
+Using --random-order-seed=573663
+```
+
+It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
+manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
+they failed and tell pytest to not randomize them instead using `--random-order-bucket=none`, e.g.:
+
+```bash
+pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
+```
+
+To disable the shuffling for all tests:
+
+```bash
+pytest --random-order-bucket=none
+```
+
+By default `--random-order-bucket=module` is implied, which will shuffle the files on the module levels. It can also
+shuffle on `class`, `package`, `global` and `none` levels. For the complete details please see its
+[documentation](https://github.com/jbasko/pytest-random-order).
+
+Another randomization alternative is: [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly). This
+module has a very similar functionality/interface, but it doesn't have the bucket modes available in
+`pytest-random-order`. It has the same problem of imposing itself once installed.
+
+### Look and feel variations
+
+#### pytest-sugar
+
+[pytest-sugar](https://github.com/Frozenball/pytest-sugar) is a plugin that improves the look-n-feel, adds a
+progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
+
+```bash
+pip install pytest-sugar
+```
+
+To run tests without it, run:
+
+```bash
+pytest -p no:sugar
+```
+
+or uninstall it.
+
+
+
+#### Report each sub-test name and its progress
+
+For a single or a group of tests via `pytest` (after `pip install pytest-pspec`):
+
+```bash
+pytest --pspec tests/test_optimization.py
+```
+
+#### Instantly shows failed tests
+
+[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) shows failures and errors instantly instead of
+waiting until the end of test session.
+
+```bash
+pip install pytest-instafail
+```
+
+```bash
+pytest --instafail
+```
+
+### To GPU or not to GPU
+
+On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`:
+
+```bash
+CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
+```
+
+or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the
+second gpu if you have gpus `0` and `1`, you can run:
+
+```bash
+CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
+```
+
+This is handy when you want to run different tasks on different GPUs.
+
+Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
+decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
+
+- `require_torch` - this test will run only under torch
+- `require_torch_gpu` - as `require_torch` plus requires at least 1 GPU
+- `require_torch_multi_gpu` - as `require_torch` plus requires at least 2 GPUs
+- `require_torch_non_multi_gpu` - as `require_torch` plus requires 0 or 1 GPUs
+- `require_torch_up_to_2_gpus` - as `require_torch` plus requires 0 or 1 or 2 GPUs
+- `require_torch_tpu` - as `require_torch` plus requires at least 1 TPU
+
+Let's depict the GPU requirements in the following table:
+
+
+| n gpus | decorator                      |
+|--------+--------------------------------|
+| `>= 0` | `@require_torch`               |
+| `>= 1` | `@require_torch_gpu`           |
+| `>= 2` | `@require_torch_multi_gpu`     |
+| `< 2`  | `@require_torch_non_multi_gpu` |
+| `< 3`  | `@require_torch_up_to_2_gpus`  |
+
+
+For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
+
+```python
+@require_torch_multi_gpu
+def test_example_with_multi_gpu():
+```
+
+If a test requires `tensorflow` use the `require_tf` decorator. For example:
+
+```python
+@require_tf
+def test_tf_thing_with_tensorflow():
+```
+
+These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
+how to set it up:
+
+```python
+@require_torch_gpu
+@slow
+def test_example_slow_on_gpu():
+```
+
+Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed
+last for them to work correctly. Here is an example of the correct usage:
+
+```python
+@parameterized.expand(...)
+@require_torch_multi_gpu
+def test_integration_foo():
+```
+
+This order problem doesn't exist with `@pytest.mark.parametrize`, you can put it first or last and it will still
+work. But it only works with non-unittests.
+
+Inside tests:
+
+- How many GPUs are available:
+
+```python
+from transformers.testing_utils import get_gpu_count
+n_gpu = get_gpu_count() # works with torch and tf
+```
+
+### Distributed training
+
+`pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
+thing and end up thinking they are `pytest` and start running the test suite in loops. It works, however, if one
+spawns a normal process that then spawns off multiple workers and manages the IO pipes.
+
+Here are some tests that use it:
+
+- [test_trainer_distributed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/test_trainer_distributed.py)
+- [test_deepspeed.py](https://github.com/huggingface/transformers-doc2mdx/tree/master/tests/deepspeed/test_deepspeed.py)
+
+To jump right into the execution point, search for the `execute_subprocess_async` call in those tests.
+
+You will need at least 2 GPUs to see these tests in action:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
+```
+
+### Output capture
+
+During test execution any output sent to `stdout` and `stderr` is captured. If a test or a setup method fails, its
+according captured output will usually be shown along with the failure traceback.
+
+To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`:
+
+```bash
+pytest -s tests/test_logging.py
+```
+
+To send test results to JUnit format output:
+
+```bash
+py.test tests --junitxml=result.xml
+```
+
+### Color control
+
+To have no color (e.g., yellow on white background is not readable):
+
+```bash
+pytest --color=no tests/test_logging.py
+```
+
+### Sending test report to online pastebin service
+
+Creating a URL for each test failure:
+
+```bash
+pytest --pastebin=failed tests/test_logging.py
+```
+
+This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
+tests as usual or add for example -x if you only want to send one particular failure.
+
+Creating a URL for a whole test session log:
+
+```bash
+pytest --pastebin=all tests/test_logging.py
+```
+
+## Writing tests
+
+🤗 transformers tests are based on `unittest`, but run by `pytest`, so most of the time features from both systems
+can be used.
+
+You can read [here](https://docs.pytest.org/en/stable/unittest.html) which features are supported, but the important
+thing to remember is that most `pytest` fixtures don't work. Neither parametrization, but we use the module
+`parameterized` that works in a similar way.
+
+
+### Parametrization
+
+Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
+the test, but then there is no way of running that test for just one set of arguments.
+
+```python
+# test_this1.py
+import unittest
+from parameterized import parameterized
+class TestMathUnitTest(unittest.TestCase):
+    @parameterized.expand([
+        ("negative", -1.5, -2.0),
+        ("integer", 1, 1.0),
+        ("large fraction", 1.6, 1),
+    ])
+    def test_floor(self, name, input, expected):
+        assert_equal(math.floor(input), expected)
+```
+
+Now, by default this test will be run 3 times, each time with the last 3 arguments of `test_floor` being assigned the
+corresponding arguments in the parameter list.
+
+and you could run just the `negative` and `integer` sets of params with:
+
+```bash
+pytest -k "negative and integer" tests/test_mytest.py
+```
+
+or all but `negative` sub-tests, with:
+
+```bash
+pytest -k "not negative" tests/test_mytest.py
+```
+
+Besides using the `-k` filter that was just mentioned, you can find out the exact name of each sub-test and run any
+or all of them using their exact names.
+
+```bash
+pytest test_this1.py --collect-only -q
+```
+
+and it will list:
+
+```bash
+test_this1.py::TestMathUnitTest::test_floor_0_negative
+test_this1.py::TestMathUnitTest::test_floor_1_integer
+test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
+```
+
+So now you can run just 2 specific sub-tests:
+
+```bash
+pytest test_this1.py::TestMathUnitTest::test_floor_0_negative  test_this1.py::TestMathUnitTest::test_floor_1_integer
+```
+
+The module [parameterized](https://pypi.org/project/parameterized/) which is already in the developer dependencies
+of `transformers` works for both: `unittests` and `pytest` tests.
+
+If, however, the test is not a `unittest`, you may use `pytest.mark.parametrize` (or you may see it being used in
+some existing tests, mostly under `examples`).
+
+Here is the same example, this time using `pytest`'s `parametrize` marker:
+
+```python
+# test_this2.py
+import pytest
+@pytest.mark.parametrize(
+    "name, input, expected",
+    [
+        ("negative", -1.5, -2.0),
+        ("integer", 1, 1.0),
+        ("large fraction", 1.6, 1),
+    ],
+)
+def test_floor(name, input, expected):
+    assert_equal(math.floor(input), expected)
+```
+
+Same as with `parameterized`, with `pytest.mark.parametrize` you can have a fine control over which sub-tests are
+run, if the `-k` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
+names for the sub-tests. Here is what they look like:
+
+```bash
+pytest test_this2.py --collect-only -q
+```
+
+and it will list:
+
+```bash
+test_this2.py::test_floor[integer-1-1.0]
+test_this2.py::test_floor[negative--1.5--2.0]
+test_this2.py::test_floor[large fraction-1.6-1]
+```
+
+So now you can run just the specific test:
+
+```bash
+pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
+```
+
+as in the previous example.
+
+
+
+### Files and directories
+
+In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
+could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
+`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
+accessors to them:
+
+- `pathlib` objects (all fully resolved):
+
+  - `test_file_path` - the current test file path, i.e. `__file__`
+  - `test_file_dir` - the directory containing the current test file
+  - `tests_dir` - the directory of the `tests` test suite
+  - `examples_dir` - the directory of the `examples` test suite
+  - `repo_root_dir` - the directory of the repository
+  - `src_dir` - the directory of `src` (i.e. where the `transformers` sub-dir resides)
+
+- stringified paths---same as above but these return paths as strings, rather than `pathlib` objects:
+
+  - `test_file_path_str`
+  - `test_file_dir_str`
+  - `tests_dir_str`
+  - `examples_dir_str`
+  - `repo_root_dir_str`
+  - `src_dir_str`
+
+To start using those all you need is to make sure that the test resides in a subclass of
+`transformers.test_utils.TestCasePlus`. For example:
+
+```python
+from transformers.testing_utils import TestCasePlus
+class PathExampleTest(TestCasePlus):
+    def test_something_involving_local_locations(self):
+        data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
+```
+
+If you don't need to manipulate paths via `pathlib` or you just need a path as a string, you can always invoked
+`str()` on the `pathlib` object or use the accessors ending with `_str`. For example:
+
+```python
+from transformers.testing_utils import TestCasePlus
+class PathExampleTest(TestCasePlus):
+    def test_something_involving_stringified_locations(self):
+        examples_dir = self.examples_dir_str
+```
+
+### Temporary files and directories
+
+Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
+each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
+them. Therefore, using packages like `tempfile`, which address these needs is essential.
+
+However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
+to know it's exact path and not having it randomized on every test re-run.
+
+A helper class `transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
+`unittest.TestCase`, so we can easily inherit from it in the test modules.
+
+Here is an example of its usage:
+
+```python
+from transformers.testing_utils import TestCasePlus
+class ExamplesTests(TestCasePlus):
+    def test_whatever(self):
+        tmp_dir = self.get_auto_remove_tmp_dir()
+```
+
+This code creates a unique temporary directory, and sets `tmp_dir` to its location.
+
+- Create a unique temporary dir:
+
+```python
+def test_whatever(self):
+    tmp_dir = self.get_auto_remove_tmp_dir()
+```
+
+`tmp_dir` will contain the path to the created temporary dir. It will be automatically removed at the end of the
+test.
+
+- Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
+
+```python
+def test_whatever(self):
+    tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
+```
+
+This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
+leave any data in there.
+
+- You can override the default behavior by directly overriding the `before` and `after` args, leading to one of the
+  following behaviors:
+
+  - `before=True`: the temporary dir will always be cleared at the beginning of the test.
+  - `before=False`: if the temporary dir already existed, any existing files will remain there.
+  - `after=True`: the temporary dir will always be deleted at the end of the test.
+  - `after=False`: the temporary dir will always be left intact at the end of the test.
+
+<Tip>
+
+In order to run the equivalent of `rm -r` safely, only subdirs of the project repository checkout are allowed if
+an explicit obj:*tmp_dir* is used, so that by mistake no `/tmp` or similar important part of the filesystem will
+get nuked. i.e. please always pass paths that start with `./`.
+
+</Tip>
+
+<Tip>
+
+Each test can register multiple temporary directories and they all will get auto-removed, unless requested
+otherwise.
+
+</Tip>
+
+### Temporary sys.path override
+
+If you need to temporary override `sys.path` to import from another test for example, you can use the
+`ExtendSysPath` context manager. Example:
+
+
+```python
+import os
+from transformers.testing_utils import ExtendSysPath
+bindir = os.path.abspath(os.path.dirname(__file__))
+with ExtendSysPath(f"{bindir}/.."):
+    from test_trainer import TrainerIntegrationCommon  # noqa
+```
+
+### Skipping tests
+
+This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
+commit it to the main repository we need make sure it's skipped during `make test`.
+
+Methods:
+
+-  A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
+  running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
+  tests that depend on an external resource which is not available at the moment (for example a database).
+
+-  A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
+  implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
+  pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
+
+One of the important differences between the two is that `skip` doesn't run the test, and `xfail` does. So if the
+code that's buggy causes some bad state that will affect other tests, do not use `xfail`.
+
+#### Implementation
+
+- Here is how to skip whole test unconditionally:
+
+```python
+@unittest.skip("this bug needs to be fixed")
+def test_feature_x():
+```
+
+or via pytest:
+
+```python
+@pytest.mark.skip(reason="this bug needs to be fixed")
+```
+
+or the `xfail` way:
+
+```python
+@pytest.mark.xfail
+def test_feature_x():
+```
+
+- Here is how to skip a test based on some internal check inside the test:
+
+```python
+def test_feature_x():
+    if not has_something():
+        pytest.skip("unsupported configuration")
+```
+
+or the whole module:
+
+```python
+import pytest
+if not pytest.config.getoption("--custom-flag"):
+    pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
+```
+
+or the `xfail` way:
+
+```python
+def test_feature_x():
+    pytest.xfail("expected to fail until bug XYZ is fixed")
+```
+
+- Here is how to skip all tests in a module if some import is missing:
+
+```python
+docutils = pytest.importorskip("docutils", minversion="0.3")
+```
+
+-  Skip a test based on a condition:
+
+```python
+@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
+def test_feature_x():
+```
+
+or:
+
+```python
+@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
+def test_feature_x():
+```
+
+or skip the whole module:
+
+```python
+@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
+class TestClass():
+    def test_feature_x(self):
+```
+
+More details, example and ways are [here](https://docs.pytest.org/en/latest/skipping.html).
+
+### Slow tests
+
+The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
+an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
+marked as in the example below:
+
+```python
+from transformers.testing_utils import slow
+@slow
+def test_integration_foo():
+```
+
+Once a test is marked as `@slow`, to run such tests set `RUN_SLOW=1` env var, e.g.:
+
+```bash
+RUN_SLOW=1 pytest tests
+```
+
+Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators
+`@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage:
+
+```python
+@parameterized.expand(...)
+@slow
+def test_integration_foo():
+```
+
+As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
+checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
+get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
+machine before submitting the PR.
+
+Here is a rough decision making mechanism for choosing which tests should be marked as slow:
+
+If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
+pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
+such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
+this approach we should have exceptions:
+
+- All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
+  tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
+  should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
+  discussed in the following paragraphs.
+- All tests that need to do a training not specifically optimized to be fast should be set to slow.
+- We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
+  `@slow`. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
+  as `@slow`.
+- If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
+
+Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
+a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
+have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the `@slow` tests can use large
+slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
+
+```bash
+grep tiny tests examples
+```
+
+Here is a an example of a [script](https://github.com/huggingface/transformers-doc2mdx/tree/master/scripts/fsmt/fsmt-make-tiny-model.py) that created the tiny model
+[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). You can easily adjust it to your specific
+model's architecture.
+
+It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
+you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
+execution speed report in CI logs instead (the output of `pytest --durations=0 tests`).
+
+That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
+If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
+tests.
+
+
+### Testing the stdout/stderr output
+
+In order to test functions that write to `stdout` and/or `stderr`, the test can access those streams using the
+`pytest`'s [capsys system](https://docs.pytest.org/en/latest/capture.html). Here is how this is accomplished:
+
+```python
+import sys
+def print_to_stdout(s): print(s)
+def print_to_stderr(s): sys.stderr.write(s)
+def test_result_and_stdout(capsys):
+    msg = "Hello"
+    print_to_stdout(msg)
+    print_to_stderr(msg)
+    out, err = capsys.readouterr() # consume the captured output streams
+    # optional: if you want to replay the consumed streams:
+    sys.stdout.write(out)
+    sys.stderr.write(err)
+    # test:
+    assert msg in out
+    assert msg in err
+```
+
+And, of course, most of the time, `stderr` will come as a part of an exception, so try/except has to be used in such
+a case:
+
+```python
+def raise_exception(msg): raise ValueError(msg)
+def test_something_exception():
+    msg = "Not a good value"
+    error = ''
+    try:
+        raise_exception(msg)
+    except Exception as e:
+        error = str(e)
+        assert msg in error, f"{msg} is in the exception:\n{error}"
+```
+
+Another approach to capturing stdout is via `contextlib.redirect_stdout`:
+
+```python
+from io import StringIO
+from contextlib import redirect_stdout
+def print_to_stdout(s): print(s)
+def test_result_and_stdout():
+    msg = "Hello"
+    buffer = StringIO()
+    with redirect_stdout(buffer):
+        print_to_stdout(msg)
+    out = buffer.getvalue()
+    # optional: if you want to replay the consumed streams:
+    sys.stdout.write(out)
+    # test:
+    assert msg in out
+```
+
+An important potential issue with capturing stdout is that it may contain `\r` characters that in normal `print`
+reset everything that has been printed so far. There is no problem with `pytest`, but with `pytest -s` these
+characters get included in the buffer, so to be able to have the test run with and without `-s`, you have to make an
+extra cleanup to the captured output, using `re.sub(r'~.*\r', '', buf, 0, re.M)`.
+
+But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
+some `\r`'s in it or not, so it's a simple:
+
+```python
+from transformers.testing_utils import CaptureStdout
+with CaptureStdout() as cs:
+    function_that_writes_to_stdout()
+print(cs.out)
+```
+
+Here is a full test example:
+
+```python
+from transformers.testing_utils import CaptureStdout
+msg = "Secret message\r"
+final = "Hello World"
+with CaptureStdout() as cs:
+    print(msg + final)
+assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
+```
+
+If you'd like to capture `stderr` use the `CaptureStderr` class instead:
+
+```python
+from transformers.testing_utils import CaptureStderr
+with CaptureStderr() as cs:
+    function_that_writes_to_stderr()
+print(cs.err)
+```
+
+If you need to capture both streams at once, use the parent `CaptureStd` class:
+
+```python
+from transformers.testing_utils import CaptureStd
+with CaptureStd() as cs:
+    function_that_writes_to_stdout_and_stderr()
+print(cs.err, cs.out)
+```
+
+Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
+from the context.
+
+
+### Capturing logger stream
+
+If you need to validate the output of a logger, you can use `CaptureLogger`:
+
+```python
+from transformers import logging
+from transformers.testing_utils import CaptureLogger
+
+msg = "Testing 1, 2, 3"
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.bart.tokenization_bart")
+with CaptureLogger(logger) as cl:
+    logger.info(msg)
+assert cl.out, msg+"\n"
+```
+
+### Testing with environment variables
+
+If you want to test the impact of environment variables for a specific test you can use a helper decorator
+`transformers.testing_utils.mockenv`
+
+```python
+from transformers.testing_utils import mockenv
+class HfArgumentParserTest(unittest.TestCase):
+    @mockenv(TRANSFORMERS_VERBOSITY="error")
+    def test_env_override(self):
+        env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
+```
+
+At times an external program needs to be called, which requires setting `PYTHONPATH` in `os.environ` to include
+multiple local paths. A helper class `transformers.test_utils.TestCasePlus` comes to help:
+
+```python
+from transformers.testing_utils import TestCasePlus
+class EnvExampleTest(TestCasePlus):
+    def test_external_prog(self):
+        env = self.get_env()
+        # now call the external program, passing `env` to it
+```
+
+Depending on whether the test file was under the `tests` test suite or `examples` it'll correctly set up
+`env[PYTHONPATH]` to include one of these two directories, and also the `src` directory to ensure the testing is
+done against the current repo, and finally with whatever `env[PYTHONPATH]` was already set to before the test was
+called if anything.
+
+This helper method creates a copy of the `os.environ` object, so the original remains intact.
+
+
+### Getting reproducible results
+
+In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
+will need to fix the seed:
+
+```python
+seed = 42
+
+# python RNG
+import random
+random.seed(seed)
+
+# pytorch RNGs
+import torch
+torch.manual_seed(seed)
+torch.backends.cudnn.deterministic = True
+if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
+
+# numpy RNG
+import numpy as np
+np.random.seed(seed)
+
+# tf RNG
+tf.random.set_seed(seed)
+```
+
+### Debugging tests
+
+To start a debugger at the point of the warning, do this:
+
+```bash
+pytest tests/test_logging.py -W error::UserWarning --pdb
+```
+
+## Working with github actions workflows
+
+To trigger a self-push workflow CI job, you must:
+
+1. Create a new branch on `transformers` origin (not a fork!).
+2. The branch name has to start with either `ci_` or `ci-` (`master` triggers it too, but we can't do PRs on
+   `master`). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
+   changed since this document has been written [here](https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml) under *push:*
+3. Create a PR from this branch.
+4. Then you can see the job appear [here](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). It may not run right away if there
+   is a backlog.
+
+
+
+
+## Testing Experimental CI Features
+
+Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
+new CI feature is to be added, it should be done as following.
+
+1. Create a new dedicated job that tests what needs to be tested
+2. The new job must always succeed so that it gives us a green ✓ (details below).
+3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
+   non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
+   are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
+   green)
+4. When it's clear that everything is solid, then merge the new changes into existing jobs.
+
+That way experiments on CI functionality itself won't interfere with the normal workflow.
+
+Now how can we make the job always succeed while the new CI feature is being developed?
+
+Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
+Github Actions as of this writing don't support that.
+
+So the following workaround can be used:
+
+1. `set +euo pipefail` at the beginning of the run command to suppress most potential failures in the bash script.
+2. the last command must be a success: `echo "done"` or just `true` will do
+
+Here is an example:
+
+```yaml
+- run:
+    name: run CI experiment
+    command: |
+        set +euo pipefail
+        echo "setting run-all-despite-any-errors-mode"
+        this_command_will_fail
+        echo "but bash continues to run"
+        # emulate another failure
+        false
+        # but the last command must be a success
+        echo "during experiment do not remove: reporting success to CI, even if there were failures"
+```
+
+For simple commands you could also do:
+
+```bash
+cmd_that_may_fail || true
+```
+
+Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
+while removing `set +euo pipefail` or any other things you may have added to ensure that the experimental job doesn't
+interfere with the normal CI functioning.
+
+This whole process would have been much easier if we only could set something like `allow-failure` for the
+experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
+Github Actions don't support it at the moment.
+
+You can vote for this feature and see where it is at at these CI-specific threads:
+
+- [Github Actions:](https://github.com/actions/toolkit/issues/399)
+- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344)
diff --git a/docs/source/testing.rst b/docs/source/testing.rst
deleted file mode 100644
index f057e8bbcf..0000000000
--- a/docs/source/testing.rst
+++ /dev/null
@@ -1,1252 +0,0 @@
-..
-    Copyright 2020 The HuggingFace Team. All rights reserved.
-
-    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-    the License. You may obtain a copy of the License at
-
-        http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-    specific language governing permissions and limitations under the License.
-
-Testing
-=======================================================================================================================
-
-
-Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
-
-There are 2 test suites in the repository:
-
-1. ``tests`` -- tests for the general API
-2. ``examples`` -- tests primarily for various applications that aren't part of the API
-
-How transformers are tested
------------------------------------------------------------------------------------------------------------------------
-
-1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs
-   are defined in this :prefix_link:`config file <.circleci/config.yml>`, so that if needed you can reproduce the same
-   environment on your machine.
-
-   These CI jobs don't run ``@slow`` tests.
-
-2. There are 3 jobs run by `github actions <https://github.com/huggingface/transformers/actions>`__:
-
-   * :prefix_link:`torch hub integration <.github/workflows/github-torch-hub.yml>`: checks whether torch hub
-     integration works.
-
-   * :prefix_link:`self-hosted (push) <.github/workflows/self-push.yml>`: runs fast tests on GPU only on commits on
-     ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``,
-     ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
-
-   * :prefix_link:`self-hosted runner <.github/workflows/self-scheduled.yml>`: runs normal and slow tests on GPU in
-     ``tests`` and ``examples``:
-
-   .. code-block:: bash
-
-    RUN_SLOW=1 pytest tests/
-    RUN_SLOW=1 pytest examples/
-
-   The results can be observed `here <https://github.com/huggingface/transformers/actions>`__.
-
-
-
-Running tests
------------------------------------------------------------------------------------------------------------------------
-
-
-
-
-
-Choosing which tests to run
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This document goes into many details of how tests can be run. If after reading everything, you need even more details
-you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
-
-Here are some most useful ways of running tests.
-
-Run all:
-
-.. code-block:: console
-
-    pytest
-
-or:
-
-.. code-block:: bash
-
-    make test
-
-Note that the latter is defined as:
-
-.. code-block:: bash
-
-    python -m pytest -n auto --dist=loadfile -s -v ./tests/
-
-which tells pytest to:
-
-* run as many test processes as they are CPU cores (which could be too many if you don't have a ton of RAM!)
-* ensure that all tests from the same file will be run by the same test process
-* do not capture output
-* run in verbose mode
-
-
-
-Getting the list of all tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-All tests of the test suite:
-
-.. code-block:: bash
-
-    pytest --collect-only -q
-
-All tests of a given test file:
-
-.. code-block:: bash
-
-    pytest tests/test_optimization.py --collect-only -q
-
-
-
-Run a specific test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To run an individual test module:
-
-.. code-block:: bash
-
-    pytest tests/test_logging.py
-
-
-Run specific tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest
-class containing those tests. For example, it could be:
-
-.. code-block:: bash
-
-    pytest tests/test_optimization.py::OptimizationTest::test_adam_w
-
-Here:
-
-* ``tests/test_optimization.py`` - the file with tests
-* ``OptimizationTest`` - the name of the class
-* ``test_adam_w`` - the name of the specific test function
-
-If the file contains multiple classes, you can choose to run only tests of a given class. For example:
-
-.. code-block:: bash
-
-    pytest tests/test_optimization.py::OptimizationTest
-
-
-will run all the tests inside that class.
-
-As mentioned earlier you can see what tests are contained inside the ``OptimizationTest`` class by running:
-
-.. code-block:: bash
-
-    pytest tests/test_optimization.py::OptimizationTest --collect-only -q
-
-You can run tests by keyword expressions.
-
-To run only tests whose name contains ``adam``:
-
-.. code-block:: bash
-
-    pytest -k adam tests/test_optimization.py
-
-Logical ``and`` and ``or`` can be used to indicate whether all keywords should match or either. ``not`` can be used to
-negate.
-
-To run all tests except those whose name contains ``adam``:
-
-.. code-block:: bash
-
-    pytest -k "not adam" tests/test_optimization.py
-
-And you can combine the two patterns in one:
-
-.. code-block:: bash
-
-    pytest -k "ada and not adam" tests/test_optimization.py
-
-For example to run both ``test_adafactor`` and ``test_adam_w`` you can use:
-
-.. code-block:: bash
-
-    pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py
-
-Note that we use ``or`` here, since we want either of the keywords to match to include both.
-
-If you want to include only tests that include both patterns, ``and`` is to be used:
-
-.. code-block:: bash
-
-    pytest -k "test and ada" tests/test_optimization.py
-
-
-
-Run only modified tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked
-<https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break
-anything, since it won't run the tests related to files you didn't touch.
-
-.. code-block:: bash
-
-    pip install pytest-picked
-
-.. code-block:: bash
-
-    pytest --picked
-
-All tests will be run from files and folders which are modified, but not yet committed.
-
-Automatically rerun failed tests on source modification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-`pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a very useful feature of detecting all failed
-tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you
-fix them. So that you don't need to re start pytest after you made the fix. This is repeated until all tests pass after
-which again a full run is performed.
-
-.. code-block:: bash
-
-    pip install pytest-xdist
-
-To enter the mode: ``pytest -f`` or ``pytest --looponfail``
-
-File changes are detected by looking at ``looponfailroots`` root directories and all of their contents (recursively).
-If the default for this value does not work for you, you can change it in your project by setting a configuration
-option in ``setup.cfg``:
-
-.. code-block:: ini
-
-    [tool:pytest]
-    looponfailroots = transformers tests
-
-or ``pytest.ini``/``tox.ini`` files:
-
-.. code-block:: ini
-
-    [pytest]
-    looponfailroots = transformers tests
-
-This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s
-directory.
-
-`pytest-watch <https://github.com/joeyespo/pytest-watch>`__ is an alternative implementation of this functionality.
-
-
-Skip a test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For
-example, to run all except ``test_modeling_*.py`` tests:
-
-.. code-block:: bash
-
-    pytest `ls -1 tests/*py | grep -v test_modeling`
-
-
-Clearing state
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-CI builds and when isolation is important (against speed), cache should be cleared:
-
-.. code-block:: bash
-
-    pytest --cache-clear tests
-
-Running tests in parallel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2``
-to run 2 parallel jobs).
-
-``pytest-xdist``'s ``--dist=`` option allows one to control how the tests are grouped. ``--dist=loadfile`` puts the
-tests located in one file onto the same process.
-
-Since the order of executed tests is different and unpredictable, if running the test suite with ``pytest-xdist``
-produces failures (meaning we have some undetected coupled tests), use `pytest-replay
-<https://github.com/ESSS/pytest-replay>`__ to replay the tests in the same order, which should help with then somehow
-reducing that failing sequence to a minimum.
-
-Test order and repetition
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-It's good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential
-inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect
-some problems that get uncovered by randomness of DL.
-
-
-Repeat tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-* `pytest-flakefinder <https://github.com/dropbox/pytest-flakefinder>`__:
-
-.. code-block:: bash
-
-    pip install pytest-flakefinder
-
-And then run every test multiple times (50 by default):
-
-.. code-block:: bash
-
-    pytest --flake-finder --flake-runs=5 tests/test_failing_test.py
-
-.. note::
-   This plugin doesn't work with ``-n`` flag from ``pytest-xdist``.
-
-.. note::
-   There is another plugin ``pytest-repeat``, but it doesn't work with ``unittest``.
-
-
-Run tests in a random order
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code-block:: bash
-
-    pip install pytest-random-order
-
-Important: the presence of ``pytest-random-order`` will automatically randomize tests, no configuration change or
-command line options is required.
-
-As explained earlier this allows detection of coupled tests - where one test's state affects the state of another. When
-``pytest-random-order`` is installed it will print the random seed it used for that session, e.g:
-
-.. code-block:: bash
-
-    pytest tests
-    [...]
-    Using --random-order-bucket=module
-    Using --random-order-seed=573663
-
-So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
-
-.. code-block:: bash
-
-    pytest --random-order-seed=573663
-    [...]
-    Using --random-order-bucket=module
-    Using --random-order-seed=573663
-
-It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to
-manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order
-they failed and tell pytest to not randomize them instead using ``--random-order-bucket=none``, e.g.:
-
-.. code-block:: bash
-
-    pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py
-
-To disable the shuffling for all tests:
-
-.. code-block:: bash
-
-    pytest --random-order-bucket=none
-
-By default ``--random-order-bucket=module`` is implied, which will shuffle the files on the module levels. It can also
-shuffle on ``class``, ``package``, ``global`` and ``none`` levels. For the complete details please see its
-`documentation <https://github.com/jbasko/pytest-random-order>`__.
-
-Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This
-module has a very similar functionality/interface, but it doesn't have the bucket modes available in
-``pytest-random-order``. It has the same problem of imposing itself once installed.
-
-Look and feel variations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-pytest-sugar
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-`pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a plugin that improves the look-n-feel, adds a
-progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
-
-.. code-block:: bash
-
-    pip install pytest-sugar
-
-To run tests without it, run:
-
-.. code-block:: bash
-
-    pytest -p no:sugar
-
-or uninstall it.
-
-
-
-Report each sub-test name and its progress
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-For a single or a group of tests via ``pytest`` (after ``pip install pytest-pspec``):
-
-.. code-block:: bash
-
-    pytest --pspec tests/test_optimization.py
-
-
-
-Instantly shows failed tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-`pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__ shows failures and errors instantly instead of
-waiting until the end of test session.
-
-.. code-block:: bash
-
-    pip install pytest-instafail
-
-.. code-block:: bash
-
-    pytest --instafail
-
-To GPU or not to GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
-
-.. code-block:: bash
-
-    CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py
-
-or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the
-second gpu if you have gpus ``0`` and ``1``, you can run:
-
-.. code-block:: bash
-
-    CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py
-
-This is handy when you want to run different tasks on different GPUs.
-
-Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip
-decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
-
-* ``require_torch`` - this test will run only under torch
-* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
-* ``require_torch_multi_gpu`` - as ``require_torch`` plus requires at least 2 GPUs
-* ``require_torch_non_multi_gpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
-* ``require_torch_up_to_2_gpus`` - as ``require_torch`` plus requires 0 or 1 or 2 GPUs
-* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
-
-Let's depict the GPU requirements in the following table:
-
-
-+----------+----------------------------------+
-| n gpus   |  decorator                       |
-+==========+==================================+
-| ``>= 0`` | ``@require_torch``               |
-+----------+----------------------------------+
-| ``>= 1`` | ``@require_torch_gpu``           |
-+----------+----------------------------------+
-| ``>= 2`` | ``@require_torch_multi_gpu``     |
-+----------+----------------------------------+
-| ``< 2``  | ``@require_torch_non_multi_gpu`` |
-+----------+----------------------------------+
-| ``< 3``  | ``@require_torch_up_to_2_gpus``  |
-+----------+----------------------------------+
-
-
-For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
-
-.. code-block:: python
-
-    @require_torch_multi_gpu
-    def test_example_with_multi_gpu():
-
-If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
-
-.. code-block:: python
-
-    @require_tf
-    def test_tf_thing_with_tensorflow():
-
-These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
-how to set it up:
-
-.. code-block:: python
-
-    @require_torch_gpu
-    @slow
-    def test_example_slow_on_gpu():
-
-Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed
-last for them to work correctly. Here is an example of the correct usage:
-
-.. code-block:: python
-
-    @parameterized.expand(...)
-    @require_torch_multi_gpu
-    def test_integration_foo():
-
-This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still
-work. But it only works with non-unittests.
-
-Inside tests:
-
-* How many GPUs are available:
-
-.. code-block:: bash
-
-    from transformers.testing_utils import get_gpu_count
-    n_gpu = get_gpu_count() # works with torch and tf
-
-
-
-Distributed training
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right
-thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one
-spawns a normal process that then spawns off multiple workers and manages the IO pipes.
-
-Here are some tests that use it:
-
-* :prefix_link:`test_trainer_distributed.py <tests/test_trainer_distributed.py>`
-* :prefix_link:`test_deepspeed.py <tests/deepspeed/test_deepspeed.py>`
-
-To jump right into the execution point, search for the ``execute_subprocess_async`` call in those tests.
-
-You will need at least 2 GPUs to see these tests in action:
-
-.. code-block:: bash
-
-    CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py
-
-
-Output capture
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-During test execution any output sent to ``stdout`` and ``stderr`` is captured. If a test or a setup method fails, its
-according captured output will usually be shown along with the failure traceback.
-
-To disable output capturing and to get the ``stdout`` and ``stderr`` normally, use ``-s`` or ``--capture=no``:
-
-.. code-block:: bash
-
-    pytest -s tests/test_logging.py
-
-To send test results to JUnit format output:
-
-.. code-block:: bash
-
-    py.test tests --junitxml=result.xml
-
-
-Color control
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To have no color (e.g., yellow on white background is not readable):
-
-.. code-block:: bash
-
-    pytest --color=no tests/test_logging.py
-
-
-
-Sending test report to online pastebin service
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Creating a URL for each test failure:
-
-.. code-block:: bash
-
-    pytest --pastebin=failed tests/test_logging.py
-
-This will submit test run information to a remote Paste service and provide a URL for each failure. You may select
-tests as usual or add for example -x if you only want to send one particular failure.
-
-Creating a URL for a whole test session log:
-
-.. code-block:: bash
-
-    pytest --pastebin=all tests/test_logging.py
-
-
-
-Writing tests
------------------------------------------------------------------------------------------------------------------------
-
-🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems
-can be used.
-
-You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which features are supported, but the important
-thing to remember is that most ``pytest`` fixtures don't work. Neither parametrization, but we use the module
-``parameterized`` that works in a similar way.
-
-
-Parametrization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within
-the test, but then there is no way of running that test for just one set of arguments.
-
-.. code-block:: python
-
-    # test_this1.py
-    import unittest
-    from parameterized import parameterized
-    class TestMathUnitTest(unittest.TestCase):
-        @parameterized.expand([
-            ("negative", -1.5, -2.0),
-            ("integer", 1, 1.0),
-            ("large fraction", 1.6, 1),
-        ])
-        def test_floor(self, name, input, expected):
-            assert_equal(math.floor(input), expected)
-
-Now, by default this test will be run 3 times, each time with the last 3 arguments of ``test_floor`` being assigned the
-corresponding arguments in the parameter list.
-
-and you could run just the ``negative`` and ``integer`` sets of params with:
-
-.. code-block:: bash
-
-    pytest -k "negative and integer" tests/test_mytest.py
-
-or all but ``negative`` sub-tests, with:
-
-.. code-block:: bash
-
-    pytest -k "not negative" tests/test_mytest.py
-
-Besides using the ``-k`` filter that was just mentioned, you can find out the exact name of each sub-test and run any
-or all of them using their exact names.
-
-.. code-block:: bash
-
-    pytest test_this1.py --collect-only -q
-
-and it will list:
-
-.. code-block:: bash
-
-    test_this1.py::TestMathUnitTest::test_floor_0_negative
-    test_this1.py::TestMathUnitTest::test_floor_1_integer
-    test_this1.py::TestMathUnitTest::test_floor_2_large_fraction
-
-So now you can run just 2 specific sub-tests:
-
-.. code-block:: bash
-
-    pytest test_this1.py::TestMathUnitTest::test_floor_0_negative  test_this1.py::TestMathUnitTest::test_floor_1_integer
-
-The module `parameterized <https://pypi.org/project/parameterized/>`__ which is already in the developer dependencies
-of ``transformers`` works for both: ``unittests`` and ``pytest`` tests.
-
-If, however, the test is not a ``unittest``, you may use ``pytest.mark.parametrize`` (or you may see it being used in
-some existing tests, mostly under ``examples``).
-
-Here is the same example, this time using ``pytest``'s ``parametrize`` marker:
-
-.. code-block:: python
-
-    # test_this2.py
-    import pytest
-    @pytest.mark.parametrize(
-        "name, input, expected",
-        [
-            ("negative", -1.5, -2.0),
-            ("integer", 1, 1.0),
-            ("large fraction", 1.6, 1),
-        ],
-    )
-    def test_floor(name, input, expected):
-        assert_equal(math.floor(input), expected)
-
-Same as with ``parameterized``, with ``pytest.mark.parametrize`` you can have a fine control over which sub-tests are
-run, if the ``-k`` filter doesn't do the job. Except, this parametrization function creates a slightly different set of
-names for the sub-tests. Here is what they look like:
-
-.. code-block:: bash
-
-    pytest test_this2.py --collect-only -q
-
-and it will list:
-
-.. code-block:: bash
-
-    test_this2.py::test_floor[integer-1-1.0]
-    test_this2.py::test_floor[negative--1.5--2.0]
-    test_this2.py::test_floor[large fraction-1.6-1]
-
-So now you can run just the specific test:
-
-.. code-block:: bash
-
-    pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0]
-
-as in the previous example.
-
-
-
-Files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In tests often we need to know where things are relative to the current test file, and it's not trivial since the test
-could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class
-:obj:`transformers.test_utils.TestCasePlus` solves this problem by sorting out all the basic paths and provides easy
-accessors to them:
-
-* ``pathlib`` objects (all fully resolved):
-
-   - ``test_file_path`` - the current test file path, i.e. ``__file__``
-   - ``test_file_dir`` - the directory containing the current test file
-   - ``tests_dir`` - the directory of the ``tests`` test suite
-   - ``examples_dir`` - the directory of the ``examples`` test suite
-   - ``repo_root_dir`` - the directory of the repository
-   - ``src_dir`` - the directory of ``src`` (i.e. where the ``transformers`` sub-dir resides)
-
-* stringified paths---same as above but these return paths as strings, rather than ``pathlib`` objects:
-
-   - ``test_file_path_str``
-   - ``test_file_dir_str``
-   - ``tests_dir_str``
-   - ``examples_dir_str``
-   - ``repo_root_dir_str``
-   - ``src_dir_str``
-
-To start using those all you need is to make sure that the test resides in a subclass of
-:obj:`transformers.test_utils.TestCasePlus`. For example:
-
-.. code-block:: python
-
-    from transformers.testing_utils import TestCasePlus
-    class PathExampleTest(TestCasePlus):
-        def test_something_involving_local_locations(self):
-            data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
-
-If you don't need to manipulate paths via ``pathlib`` or you just need a path as a string, you can always invoked
-``str()`` on the ``pathlib`` object or use the accessors ending with ``_str``. For example:
-
-.. code-block:: python
-
-    from transformers.testing_utils import TestCasePlus
-    class PathExampleTest(TestCasePlus):
-        def test_something_involving_stringified_locations(self):
-            examples_dir = self.examples_dir_str
-
-
-
-
-Temporary files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite
-each other's data. Also we want to get the temporary files and directories removed at the end of each test that created
-them. Therefore, using packages like ``tempfile``, which address these needs is essential.
-
-However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want
-to know it's exact path and not having it randomized on every test re-run.
-
-A helper class :obj:`transformers.test_utils.TestCasePlus` is best used for such purposes. It's a sub-class of
-:obj:`unittest.TestCase`, so we can easily inherit from it in the test modules.
-
-Here is an example of its usage:
-
-.. code-block:: python
-
-    from transformers.testing_utils import TestCasePlus
-    class ExamplesTests(TestCasePlus):
-        def test_whatever(self):
-            tmp_dir = self.get_auto_remove_tmp_dir()
-
-This code creates a unique temporary directory, and sets :obj:`tmp_dir` to its location.
-
-* Create a unique temporary dir:
-
-.. code-block:: python
-
-    def test_whatever(self):
-        tmp_dir = self.get_auto_remove_tmp_dir()
-
-``tmp_dir`` will contain the path to the created temporary dir. It will be automatically removed at the end of the
-test.
-
-* Create a temporary dir of my choice, ensure it's empty before the test starts and don't empty it after the test.
-
-.. code-block:: python
-
-    def test_whatever(self):
-        tmp_dir = self.get_auto_remove_tmp_dir("./xxx")
-
-This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn't
-leave any data in there.
-
-* You can override the default behavior by directly overriding the ``before`` and ``after`` args, leading to one of the
-  following behaviors:
-
-    - ``before=True``: the temporary dir will always be cleared at the beginning of the test.
-    - ``before=False``: if the temporary dir already existed, any existing files will remain there.
-    - ``after=True``: the temporary dir will always be deleted at the end of the test.
-    - ``after=False``: the temporary dir will always be left intact at the end of the test.
-
-.. note::
-   In order to run the equivalent of ``rm -r`` safely, only subdirs of the project repository checkout are allowed if
-   an explicit obj:`tmp_dir` is used, so that by mistake no ``/tmp`` or similar important part of the filesystem will
-   get nuked. i.e. please always pass paths that start with ``./``.
-
-.. note::
-   Each test can register multiple temporary directories and they all will get auto-removed, unless requested
-   otherwise.
-
-
-Temporary sys.path override
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you need to temporary override ``sys.path`` to import from another test for example, you can use the
-``ExtendSysPath`` context manager. Example:
-
-
-.. code-block:: python
-
-    import os
-    from transformers.testing_utils import ExtendSysPath
-    bindir = os.path.abspath(os.path.dirname(__file__))
-    with ExtendSysPath(f"{bindir}/.."):
-        from test_trainer import TrainerIntegrationCommon  # noqa
-
-
-
-Skipping tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to
-commit it to the main repository we need make sure it's skipped during ``make test``.
-
-Methods:
-
--  A **skip** means that you expect your test to pass only if some conditions are met, otherwise pytest should skip
-   running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping
-   tests that depend on an external resource which is not available at the moment (for example a database).
-
--  A **xfail** means that you expect a test to fail for some reason. A common example is a test for a feature not yet
-   implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with
-   pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
-
-One of the important differences between the two is that ``skip`` doesn't run the test, and ``xfail`` does. So if the
-code that's buggy causes some bad state that will affect other tests, do not use ``xfail``.
-
-Implementation
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-- Here is how to skip whole test unconditionally:
-
-.. code-block:: python
-
-    @unittest.skip("this bug needs to be fixed")
-    def test_feature_x():
-
-or via pytest:
-
-.. code-block:: python
-
-    @pytest.mark.skip(reason="this bug needs to be fixed")
-
-or the ``xfail`` way:
-
-.. code-block:: python
-
-    @pytest.mark.xfail
-    def test_feature_x():
-
-- Here is how to skip a test based on some internal check inside the test:
-
-.. code-block:: python
-
-    def test_feature_x():
-        if not has_something():
-            pytest.skip("unsupported configuration")
-
-or the whole module:
-
-.. code-block:: python
-
-    import pytest
-    if not pytest.config.getoption("--custom-flag"):
-        pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
-
-or the ``xfail`` way:
-
-.. code-block:: python
-
-    def test_feature_x():
-        pytest.xfail("expected to fail until bug XYZ is fixed")
-
-- Here is how to skip all tests in a module if some import is missing:
-
-.. code-block:: python
-
-    docutils = pytest.importorskip("docutils", minversion="0.3")
-
--  Skip a test based on a condition:
-
-.. code-block:: python
-
-    @pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
-    def test_feature_x():
-
-or:
-
-.. code-block:: python
-
-    @unittest.skipIf(torch_device == "cpu", "Can't do half precision")
-    def test_feature_x():
-
-or skip the whole module:
-
-.. code-block:: python
-
-    @pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
-    class TestClass():
-        def test_feature_x(self):
-
-More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.
-
-Slow tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can't afford waiting for
-an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
-marked as in the example below:
-
-.. code-block:: python
-
-    from transformers.testing_utils import slow
-    @slow
-    def test_integration_foo():
-
-Once a test is marked as ``@slow``, to run such tests set ``RUN_SLOW=1`` env var, e.g.:
-
-.. code-block:: bash
-
-    RUN_SLOW=1 pytest tests
-
-Some decorators like ``@parameterized`` rewrite test names, therefore ``@slow`` and the rest of the skip decorators
-``@require_*`` have to be listed last for them to work correctly. Here is an example of the correct usage:
-
-.. code-block:: python
-
-    @parameterized.expand(...)
-    @slow
-    def test_integration_foo():
-
-As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI
-checks. So it's possible that some problems will be missed during a PR submission and get merged. Such problems will
-get caught during the next scheduled CI job. But it also means that it's important to run the slow tests on your
-machine before submitting the PR.
-
-Here is a rough decision making mechanism for choosing which tests should be marked as slow:
-
-If the test is focused on one of the library's internal components (e.g., modeling files, tokenization files,
-pipelines), then we should run that test in the non-slow test suite. If it's focused on an other aspect of the library,
-such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine
-this approach we should have exceptions:
-
-* All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or
-  tokenizer integration tests, pipeline integration tests) should be set to slow. If you're adding a new model, you
-  should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is
-  discussed in the following paragraphs.
-* All tests that need to do a training not specifically optimized to be fast should be set to slow.
-* We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
-  ``@slow``. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked
-  as ``@slow``.
-* If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
-
-Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example,
-a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models
-have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the ``@slow`` tests can use large
-slow models to do qualitative testing. To see the use of these simply look for *tiny* models with:
-
-.. code-block:: bash
-
-    grep tiny tests examples
-
-Here is a an example of a :prefix_link:`script <scripts/fsmt/fsmt-make-tiny-model.py>` that created the tiny model
-`stas/tiny-wmt19-en-de <https://huggingface.co/stas/tiny-wmt19-en-de>`__. You can easily adjust it to your specific
-model's architecture.
-
-It's easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if
-you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the
-execution speed report in CI logs instead (the output of ``pytest --durations=0 tests``).
-
-That report is also useful to find slow outliers that aren't marked as such, or which need to be re-written to be fast.
-If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest
-tests.
-
-
-Testing the stdout/stderr output
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In order to test functions that write to ``stdout`` and/or ``stderr``, the test can access those streams using the
-``pytest``'s `capsys system <https://docs.pytest.org/en/latest/capture.html>`__. Here is how this is accomplished:
-
-.. code-block:: python
-
-    import sys
-    def print_to_stdout(s): print(s)
-    def print_to_stderr(s): sys.stderr.write(s)
-    def test_result_and_stdout(capsys):
-        msg = "Hello"
-        print_to_stdout(msg)
-        print_to_stderr(msg)
-        out, err = capsys.readouterr() # consume the captured output streams
-        # optional: if you want to replay the consumed streams:
-        sys.stdout.write(out)
-        sys.stderr.write(err)
-        # test:
-        assert msg in out
-        assert msg in err
-
-And, of course, most of the time, ``stderr`` will come as a part of an exception, so try/except has to be used in such
-a case:
-
-.. code-block:: python
-
-    def raise_exception(msg): raise ValueError(msg)
-    def test_something_exception():
-        msg = "Not a good value"
-        error = ''
-        try:
-            raise_exception(msg)
-        except Exception as e:
-            error = str(e)
-            assert msg in error, f"{msg} is in the exception:\n{error}"
-
-Another approach to capturing stdout is via ``contextlib.redirect_stdout``:
-
-.. code-block:: python
-
-    from io import StringIO
-    from contextlib import redirect_stdout
-    def print_to_stdout(s): print(s)
-    def test_result_and_stdout():
-        msg = "Hello"
-        buffer = StringIO()
-        with redirect_stdout(buffer):
-            print_to_stdout(msg)
-        out = buffer.getvalue()
-        # optional: if you want to replay the consumed streams:
-        sys.stdout.write(out)
-        # test:
-        assert msg in out
-
-An important potential issue with capturing stdout is that it may contain ``\r`` characters that in normal ``print``
-reset everything that has been printed so far. There is no problem with ``pytest``, but with ``pytest -s`` these
-characters get included in the buffer, so to be able to have the test run with and without ``-s``, you have to make an
-extra cleanup to the captured output, using ``re.sub(r'~.*\r', '', buf, 0, re.M)``.
-
-But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has
-some ``\r``'s in it or not, so it's a simple:
-
-.. code-block:: python
-
-    from transformers.testing_utils import CaptureStdout
-    with CaptureStdout() as cs:
-        function_that_writes_to_stdout()
-    print(cs.out)
-
-Here is a full test example:
-
-.. code-block:: python
-
-    from transformers.testing_utils import CaptureStdout
-    msg = "Secret message\r"
-    final = "Hello World"
-    with CaptureStdout() as cs:
-        print(msg + final)
-    assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
-
-If you'd like to capture ``stderr`` use the :obj:`CaptureStderr` class instead:
-
-.. code-block:: python
-
-    from transformers.testing_utils import CaptureStderr
-    with CaptureStderr() as cs:
-        function_that_writes_to_stderr()
-    print(cs.err)
-
-If you need to capture both streams at once, use the parent :obj:`CaptureStd` class:
-
-.. code-block:: python
-
-    from transformers.testing_utils import CaptureStd
-    with CaptureStd() as cs:
-        function_that_writes_to_stdout_and_stderr()
-    print(cs.err, cs.out)
-
-Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit
-from the context.
-
-
-Capturing logger stream
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you need to validate the output of a logger, you can use :obj:`CaptureLogger`:
-
-.. code-block:: python
-
-    from transformers import logging
-    from transformers.testing_utils import CaptureLogger
-
-    msg = "Testing 1, 2, 3"
-    logging.set_verbosity_info()
-    logger = logging.get_logger("transformers.models.bart.tokenization_bart")
-    with CaptureLogger(logger) as cl:
-        logger.info(msg)
-    assert cl.out, msg+"\n"
-
-
-Testing with environment variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you want to test the impact of environment variables for a specific test you can use a helper decorator
-``transformers.testing_utils.mockenv``
-
-.. code-block:: python
-
-    from transformers.testing_utils import mockenv
-    class HfArgumentParserTest(unittest.TestCase):
-        @mockenv(TRANSFORMERS_VERBOSITY="error")
-        def test_env_override(self):
-            env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None)
-
-At times an external program needs to be called, which requires setting ``PYTHONPATH`` in ``os.environ`` to include
-multiple local paths. A helper class :obj:`transformers.test_utils.TestCasePlus` comes to help:
-
-.. code-block:: python
-
-    from transformers.testing_utils import TestCasePlus
-    class EnvExampleTest(TestCasePlus):
-        def test_external_prog(self):
-            env = self.get_env()
-            # now call the external program, passing ``env`` to it
-
-Depending on whether the test file was under the ``tests`` test suite or ``examples`` it'll correctly set up
-``env[PYTHONPATH]`` to include one of these two directories, and also the ``src`` directory to ensure the testing is
-done against the current repo, and finally with whatever ``env[PYTHONPATH]`` was already set to before the test was
-called if anything.
-
-This helper method creates a copy of the ``os.environ`` object, so the original remains intact.
-
-
-Getting reproducible results
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In some situations you may want to remove randomness for your tests. To get identical reproducable results set, you
-will need to fix the seed:
-
-.. code-block:: python
-
-    seed = 42
-
-    # python RNG
-    import random
-    random.seed(seed)
-
-    # pytorch RNGs
-    import torch
-    torch.manual_seed(seed)
-    torch.backends.cudnn.deterministic = True
-    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
-
-    # numpy RNG
-    import numpy as np
-    np.random.seed(seed)
-
-    # tf RNG
-    tf.random.set_seed(seed)
-
-Debugging tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To start a debugger at the point of the warning, do this:
-
-.. code-block:: bash
-
-    pytest tests/test_logging.py -W error::UserWarning --pdb
-
-
-Working with github actions workflows
------------------------------------------------------------------------------------------------------------------------
-
-To trigger a self-push workflow CI job, you must:
-
-1. Create a new branch on ``transformers`` origin (not a fork!).
-2. The branch name has to start with either ``ci_`` or ``ci-`` (``master`` triggers it too, but we can't do PRs on
-   ``master``). It also gets triggered only for specific paths - you can find the up-to-date definition in case it
-   changed since this document has been written `here
-   <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__ under `push:`
-3. Create a PR from this branch.
-4. Then you can see the job appear `here
-   <https://github.com/huggingface/transformers/actions/workflows/self-push.yml>`__. It may not run right away if there
-   is a backlog.
-
-
-
-
-Testing Experimental CI Features
------------------------------------------------------------------------------------------------------------------------
-
-Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a
-new CI feature is to be added, it should be done as following.
-
-1. Create a new dedicated job that tests what needs to be tested
-2. The new job must always succeed so that it gives us a green ✓ (details below).
-3. Let it run for some days to see that a variety of different PR types get to run on it (user fork branches,
-   non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there
-   are so many) while monitoring the experimental job's logs (not the overall job green as it's purposefully always
-   green)
-4. When it's clear that everything is solid, then merge the new changes into existing jobs.
-
-That way experiments on CI functionality itself won't interfere with the normal workflow.
-
-Now how can we make the job always succeed while the new CI feature is being developed?
-
-Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and
-Github Actions as of this writing don't support that.
-
-So the following workaround can be used:
-
-1. ``set +euo pipefail`` at the beginning of the run command to suppress most potential failures in the bash script.
-2. the last command must be a success: ``echo "done"`` or just ``true`` will do
-
-Here is an example:
-
-.. code-block:: yaml
-
-    - run:
-        name: run CI experiment
-        command: |
-            set +euo pipefail
-            echo "setting run-all-despite-any-errors-mode"
-            this_command_will_fail
-            echo "but bash continues to run"
-            # emulate another failure
-            false
-            # but the last command must be a success
-            echo "during experiment do not remove: reporting success to CI, even if there were failures"
-
-For simple commands you could also do:
-
-.. code-block:: bash
-
-    cmd_that_may_fail || true
-
-Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs,
-while removing ``set +euo pipefail`` or any other things you may have added to ensure that the experimental job doesn't
-interfere with the normal CI functioning.
-
-This whole process would have been much easier if we only could set something like ``allow-failure`` for the
-experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and
-Github Actions don't support it at the moment.
-
-You can vote for this feature and see where it is at at these CI-specific threads:
-
-* `Github Actions: <https://github.com/actions/toolkit/issues/399>`__
-* `CircleCI: <https://ideas.circleci.com/ideas/CCI-I-344>`__