[Docs / BetterTransformer ] Added more details about flash attention + SDPA (#25265)
* added more details about flash attention * correct and add more details * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * few modifs * more details * up * Apply suggestions from code review Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> * adapt from suggestion * Apply suggestions from code review Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> * trigger CI * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix nits and copies * add new section --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
This commit is contained in:
@@ -17,7 +17,19 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
In addition to this guide, relevant information can be found as well in [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
|
||||
|
||||
## Better Transformer: PyTorch-native transformer fastpath
|
||||
## BetterTransformer
|
||||
|
||||
[BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) converts 🤗 Transformers models to use the PyTorch-native fastpath execution, which calls optimized kernels like Flash Attention under the hood.
|
||||
|
||||
BetterTransformer is also supported for faster inference on single and multi-GPU for text, image, and audio models.
|
||||
|
||||
<Tip>
|
||||
|
||||
Flash Attention can only be used for models using fp16 or bf16 dtype. Make sure to cast your model to the appropriate dtype before using BetterTransformer.
|
||||
|
||||
</Tip>
|
||||
|
||||
### Encoder models
|
||||
|
||||
PyTorch-native [`nn.MultiHeadAttention`](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) attention fastpath, called BetterTransformer, can be used with Transformers through the integration in the [🤗 Optimum library](https://huggingface.co/docs/optimum/bettertransformer/overview).
|
||||
|
||||
@@ -36,7 +48,59 @@ model = model.reverse_bettertransformer()
|
||||
model.save_pretrained("saved_model")
|
||||
```
|
||||
|
||||
As of PyTorch 2.0, the attention fastpath is supported for both encoders and decoders. The list of supported architectures can be found [here](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models).
|
||||
Have a look at this [blog post](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) to learn more about what is possible to do with `BetterTransformer` API for encoder models.
|
||||
|
||||
### Decoder models
|
||||
|
||||
For text models, especially decoder-based models (GPT, T5, Llama, etc.), the BetterTransformer API converts all attention operations to use the [`torch.nn.functional.scaled_dot_product_attention` operator](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) that is only available in PyTorch 2.0 and onwards.
|
||||
|
||||
To convert a model to BetterTransformer:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
|
||||
# convert the model to BetterTransformer
|
||||
model.to_bettertransformer()
|
||||
|
||||
# Use it for training or inference
|
||||
```
|
||||
|
||||
SDPA can also call [Flash Attention](https://arxiv.org/abs/2205.14135) kernels under the hood. To enable Flash Attention or to check that it is available in a given setting (hardware, problem size), use [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager:
|
||||
|
||||
|
||||
```diff
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
|
||||
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda")
|
||||
# convert the model to BetterTransformer
|
||||
model.to_bettertransformer()
|
||||
|
||||
input_text = "Hello my dog is cute and"
|
||||
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
+ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
|
||||
outputs = model.generate(**inputs)
|
||||
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
If you see a bug with a traceback saying
|
||||
|
||||
```bash
|
||||
RuntimeError: No available kernel. Aborting execution.
|
||||
```
|
||||
|
||||
try using the PyTorch nightly version, which may have a broader coverage for Flash Attention:
|
||||
|
||||
```bash
|
||||
pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
|
||||
```
|
||||
|
||||
|
||||
Have a look at [this detailed blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to read more about what is possible to do with `BetterTransformer` + SDPA API.
|
||||
|
||||
## `bitsandbytes` integration for FP4 mixed-precision inference
|
||||
|
||||
@@ -182,3 +246,28 @@ Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization o
|
||||
Or this demo for BLOOM-3B:
|
||||
|
||||
[](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
|
||||
|
||||
## Advanced usage: mixing FP4 (or Int8) and BetterTransformer
|
||||
|
||||
You can combine the different methods described above to get the best performance for your model. For example, you can use BetterTransformer with FP4 mixed-precision inference + flash attention:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_compute_dtype=torch.float16
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
|
||||
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config)
|
||||
|
||||
input_text = "Hello my dog is cute and"
|
||||
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||
|
||||
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
|
||||
outputs = model.generate(**inputs)
|
||||
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
```
|
||||
Reference in New Issue
Block a user