@@ -141,7 +141,7 @@ Do note that when training Idefics2 on multi-turn conversations between a user a
|
|||||||
|
|
||||||
## Model optimizations: Flash Attention
|
## Model optimizations: Flash Attention
|
||||||
|
|
||||||
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
||||||
|
|
||||||
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
||||||
|
|
||||||
|
|||||||
@@ -240,7 +240,7 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-N
|
|||||||
|
|
||||||
### Flash-Attention 2 to speed-up generation
|
### Flash-Attention 2 to speed-up generation
|
||||||
|
|
||||||
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
||||||
|
|
||||||
First, make sure to install the latest version of Flash Attention 2:
|
First, make sure to install the latest version of Flash Attention 2:
|
||||||
|
|
||||||
|
|||||||
@@ -91,7 +91,7 @@ As can be seen, the instruction-tuned model requires a [chat template](../chat_t
|
|||||||
|
|
||||||
## Speeding up Mistral by using Flash Attention
|
## Speeding up Mistral by using Flash Attention
|
||||||
|
|
||||||
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
||||||
|
|
||||||
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
||||||
|
|
||||||
|
|||||||
@@ -93,7 +93,7 @@ As can be seen, the instruction-tuned model requires a [chat template](../chat_t
|
|||||||
|
|
||||||
## Speeding up Mixtral by using Flash Attention
|
## Speeding up Mixtral by using Flash Attention
|
||||||
|
|
||||||
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
||||||
|
|
||||||
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
||||||
|
|
||||||
|
|||||||
@@ -174,7 +174,7 @@ model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-L
|
|||||||
|
|
||||||
### Flash-Attention 2 to speed-up generation
|
### Flash-Attention 2 to speed-up generation
|
||||||
|
|
||||||
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
|
||||||
|
|
||||||
First, make sure to install the latest version of Flash Attention 2:
|
First, make sure to install the latest version of Flash Attention 2:
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user