update Bark FA2 docs (#27400)
* update Bark FA2 docs * update benchmark section * Update bark.md * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * rephrase --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
This commit is contained in:
@@ -44,7 +44,19 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
|
|||||||
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
|
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Using 🤗 Better Transformer
|
#### Using CPU offload
|
||||||
|
|
||||||
|
As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
|
||||||
|
|
||||||
|
If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from GPU to CPU when they're idle. This operation is called *CPU offloading*. You can use it with one line of code as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
model.enable_cpu_offload()
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
|
||||||
|
|
||||||
|
#### Using Better Transformer
|
||||||
|
|
||||||
Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:
|
Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:
|
||||||
|
|
||||||
@@ -54,21 +66,46 @@ model = model.to_bettertransformer()
|
|||||||
|
|
||||||
Note that 🤗 Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
|
Note that 🤗 Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
|
||||||
|
|
||||||
#### Using CPU offload
|
#### Using Flash Attention 2
|
||||||
|
|
||||||
As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
|
Flash Attention 2 is an even faster, optimized version of the previous optimization.
|
||||||
|
|
||||||
If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.
|
##### Installation
|
||||||
|
|
||||||
```python
|
First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer).
|
||||||
model.enable_cpu_offload()
|
|
||||||
|
Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -U flash-attn --no-build-isolation
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
|
|
||||||
|
##### Usage
|
||||||
|
|
||||||
|
To load a model using Flash Attention 2, we can pass the `use_flash_attention_2` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
|
||||||
|
|
||||||
|
```python
|
||||||
|
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Performance comparison
|
||||||
|
|
||||||
|
|
||||||
|
The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:
|
||||||
|
|
||||||
|
<div style="text-align: center">
|
||||||
|
<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.
|
||||||
|
|
||||||
|
At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.
|
||||||
|
|
||||||
|
|
||||||
#### Combining optimization techniques
|
#### Combining optimization techniques
|
||||||
|
|
||||||
You can combine optimization techniques, and use CPU offload, half-precision and 🤗 Better Transformer all at once.
|
You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from transformers import BarkModel
|
from transformers import BarkModel
|
||||||
@@ -76,11 +113,8 @@ import torch
|
|||||||
|
|
||||||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
|
||||||
# load in fp16
|
# load in fp16 and use Flash Attention 2
|
||||||
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
|
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
|
||||||
|
|
||||||
# convert to bettertransformer
|
|
||||||
model = BetterTransformer.transform(model, keep_original_model=False)
|
|
||||||
|
|
||||||
# enable CPU offload
|
# enable CPU offload
|
||||||
model.enable_cpu_offload()
|
model.enable_cpu_offload()
|
||||||
|
|||||||
@@ -36,7 +36,7 @@ FlashAttention-2 is experimental and may change considerably in future versions.
|
|||||||
1. additionally parallelizing the attention computation over sequence length
|
1. additionally parallelizing the attention computation over sequence length
|
||||||
2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them
|
2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them
|
||||||
|
|
||||||
FlashAttention-2 supports inference with Llama, Mistral, and Falcon models. You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request.
|
FlashAttention-2 supports inference with Llama, Mistral, Falcon and Bark models. You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request.
|
||||||
|
|
||||||
Before you begin, make sure you have FlashAttention-2 installed (see the [installation](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) guide for more details about prerequisites):
|
Before you begin, make sure you have FlashAttention-2 installed (see the [installation](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) guide for more details about prerequisites):
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user