Flash Attention 2 support for RoCm (#27611)

* support FA2 * fix typo * fix broken tests * fix more test errors * left/right * fix bug * more test * typo * fix layout flash attention falcon * do not support this case * use allclose instead of equal * fix various bugs with flash attention * bump * fix test * fix mistral * use skiptest instead of return that may be misleading * add fix causal arg flash attention * fix copies * more explicit comment * still use self.is_causal * fix causal argument * comment * fixes * update documentation * add link * wrong test * simplify FA2 RoCm requirements * update opt * make flash_attn_uses_top_left_mask attribute private and precise comment * better error handling * fix copy & mistral * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/utils/import_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * use is_flash_attn_greater_or_equal_2_10 instead of is_flash_attn_greater_or_equal_210 * fix merge * simplify * inline args --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-12-04 13:52:17 +01:00
parent 4d4febb7aa
commit 1da1302ec8
17 changed files with 253 additions and 51 deletions
--- a/docs/source/en/model_doc/gpt_neo.md
+++ b/docs/source/en/model_doc/gpt_neo.md
@@ -56,13 +56,9 @@ The `generate()` method can be used to generate text using GPT Neo model.

 ## Combining GPT-Neo and Flash Attention 2

-First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
+First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature, and make sure your hardware is compatible with Flash-Attention 2. More details are available [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) concerning the installation.

-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``)
+Make sure as well to load your model in half-precision (e.g. `torch.float16`).

 To load and run a model using Flash Attention 2, refer to the snippet below: