Reflect RoCm support in the documentation (#27636)

* reflect RoCm support in the documentation

* Update docs/source/en/main_classes/trainer.md

Co-authored-by: Lysandre Debut <hi@lysand.re>

* fix review comments

* use ROCm instead of RoCm

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>
This commit is contained in:
fxmarty
2023-11-24 16:59:17 +01:00
committed by GitHub
parent a6d178e238
commit c13a43aaf2
5 changed files with 17 additions and 11 deletions

View File

@@ -237,7 +237,7 @@ You can speedup the training throughput by using Flash Attention 2 integration i
The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves
good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory
footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer.
For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed, `adamw_apex_fused` will give you the
For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed for NVIDIA GPUs, or [ROCmSoftwarePlatform/apex](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs, `adamw_apex_fused` will give you the
fastest training experience among all supported AdamW optimizers.
[`Trainer`] integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`,
@@ -529,4 +529,4 @@ By default, in training mode, the BetterTransformer integration **drops the mask
</Tip>
Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA.
Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA.