Fix torch.compile with fullgraph=True when attention_mask input is used (#29211)
* fix torch.export.export for llama * do not change doc title * make fix copies
This commit is contained in:
@@ -184,7 +184,7 @@ For now, Transformers supports SDPA inference and training for the following arc
|
||||
|
||||
<Tip>
|
||||
|
||||
FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first.
|
||||
FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first. The memory-efficient attention backend is able to handle `fp32` models.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
@@ -529,24 +529,6 @@ And for Pytorch DeepSpeed has built one as well: [DeepSpeed-MoE: Advancing Mixtu
|
||||
|
||||
## Using PyTorch native attention and Flash Attention
|
||||
|
||||
PyTorch 2.0 released a native [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA),
|
||||
that allows using fused GPU kernels such as [memory-efficient attention](https://arxiv.org/abs/2112.05682) and [flash attention](https://arxiv.org/abs/2205.14135).
|
||||
|
||||
After installing the [`optimum`](https://github.com/huggingface/optimum) package, the relevant internal modules can be
|
||||
replaced to use PyTorch's native attention with:
|
||||
|
||||
```python
|
||||
model = model.to_bettertransformer()
|
||||
```
|
||||
|
||||
Once converted, train the model as usual.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
The PyTorch-native `scaled_dot_product_attention` operator can only dispatch to Flash Attention if no `attention_mask` is provided.
|
||||
|
||||
By default, in training mode, the BetterTransformer integration **drops the mask support and can only be used for training that does not require a padding mask for batched training**. This is the case, for example, during masked language modeling or causal language modeling. BetterTransformer is not suited for fine-tuning models on tasks that require a padding mask.
|
||||
|
||||
</Tip>
|
||||
PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available. Please refer to [PyTorch scaled dot product attention](https://huggingface.co/docs/transformers/perf_infer_gpu_one#pytorch-scaled-dot-product-attention) for a list of supported models and more details.
|
||||
|
||||
Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA.
|
||||
|
||||
Reference in New Issue
Block a user