[Flash Attention 2] Add flash attention 2 for GPT-Neo-X (#26463)
* add flash-attn-2 support for GPT-neo-x * fixup * add comment * revert * fixes * update docs * comment * again * fix copies * add plot + fix copies * Update docs/source/en/model_doc/gpt_neox.md
This commit is contained in:
@@ -61,6 +61,40 @@ The `generate()` method can be used to generate text using GPT Neo model.
|
||||
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
||||
```
|
||||
|
||||
## Using Flash Attention 2
|
||||
|
||||
Flash Attention 2 is an faster, optimized version of the model.
|
||||
|
||||
### Installation
|
||||
|
||||
First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer).
|
||||
|
||||
Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
|
||||
|
||||
```bash
|
||||
pip install -U flash-attn --no-build-isolation
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
To load a model using Flash Attention 2, we can pass the `use_flash_attention_2` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
|
||||
|
||||
```python
|
||||
>>> from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
|
||||
|
||||
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
### Expected speedups
|
||||
|
||||
Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `stockmark/gpt-neox-japanese-1.4b` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048.
|
||||
|
||||
<div style="text-align: center">
|
||||
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/gpt-neox-1.8b-speedup.jpg">
|
||||
</div>
|
||||
|
||||
## Resources
|
||||
|
||||
- [Causal language modeling task guide](../tasks/language_modeling)
|
||||
|
||||
Reference in New Issue
Block a user