Fix torch.compile with fullgraph=True when attention_mask input is used (#29211)

* fix torch.export.export for llama * do not change doc title * make fix copies
2024-02-22 16:40:06 +01:00
parent dabe855668
commit 2cc8cf6ce7
5 changed files with 39 additions and 33 deletions
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@@ -184,7 +184,7 @@ For now, Transformers supports SDPA inference and training for the following arc

 <Tip>

-FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first.
+FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first. The memory-efficient attention backend is able to handle `fp32` models.

 </Tip>