Improving Training Performance and Scalability Documentation (#28497)
* Improving Training Performance and Scaling documentation by adding PEFT techniques to suggestions to reduce memory requirements for training * Update docs/source/en/perf_train_gpu_one.md Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
This commit is contained in:
@@ -51,7 +51,8 @@ The methods and tools covered in this guide can be classified based on the effec
|
|||||||
| [Data preloading](#data-preloading) | Yes | No |
|
| [Data preloading](#data-preloading) | Yes | No |
|
||||||
| [DeepSpeed Zero](#deepspeed-zero) | No | Yes |
|
| [DeepSpeed Zero](#deepspeed-zero) | No | Yes |
|
||||||
| [torch.compile](#using-torchcompile) | Yes | No |
|
| [torch.compile](#using-torchcompile) | Yes | No |
|
||||||
|
| [Parameter-Efficient Fine Tuning (PEFT)](#peft) | No | Yes |
|
||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a
|
Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a
|
||||||
@@ -400,6 +401,25 @@ Choose which backend to use by specifying it via `torch_compile_backend` in the
|
|||||||
|
|
||||||
For an example of using `torch.compile` with 🤗 Transformers, check out this [blog post on fine-tuning a BERT model for Text Classification using the newest PyTorch 2.0 features](https://www.philschmid.de/getting-started-pytorch-2-0-transformers)
|
For an example of using `torch.compile` with 🤗 Transformers, check out this [blog post on fine-tuning a BERT model for Text Classification using the newest PyTorch 2.0 features](https://www.philschmid.de/getting-started-pytorch-2-0-transformers)
|
||||||
|
|
||||||
|
## Using 🤗 PEFT
|
||||||
|
|
||||||
|
[Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it.
|
||||||
|
|
||||||
|
As a result the [memory associated to the optimizer states and gradients](https://huggingface.co/docs/transformers/model_memory_anatomy#anatomy-of-models-memory) are greatly reduced.
|
||||||
|
|
||||||
|
For example with a vanilla AdamW, the memory requirement for the optimizer state would be:
|
||||||
|
* fp32 copy of parameters: 4 bytes/param
|
||||||
|
* Momentum: 4 bytes/param
|
||||||
|
* Variance: 4 bytes/param
|
||||||
|
|
||||||
|
Suppose a model with 7B parameters and 200 millions parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora).
|
||||||
|
|
||||||
|
The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters).
|
||||||
|
|
||||||
|
Adding Lora increases slightly the memory associated to the model weights and substantially decreases memory requirement for the optimizer state to 12 * 0.2 = 2.4GB.
|
||||||
|
|
||||||
|
Read more about PEFT and its detailed usage in [the PEFT documentation](https://huggingface.co/docs/peft/) or [PEFT repository](https://github.com/huggingface/peft).
|
||||||
|
|
||||||
## Using 🤗 Accelerate
|
## Using 🤗 Accelerate
|
||||||
|
|
||||||
With [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) you can use the above methods while gaining full
|
With [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) you can use the above methods while gaining full
|
||||||
|
|||||||
Reference in New Issue
Block a user