[whisper] static kv cache (#31166)
* make work with cache abstraction * correct for static cache * hacks for compile * make fast * fix * fix pos ids * generate * fix sdpa * fix sdpa cache pos * fix fa2 * clean fa2 * integrate cache into generate * make style * copies * more copies * update eager * update sdpa * update fa2 * simplify * use cache pos * always compute cross-cache for debug * avoid recompiles Co-authored-by: Arthur Zucker <arthur@huggingface.co> * fix fix * fix fix fix * more fix * try encoder-decoder cache (too messy) * revert encoder-decoder cache * check cross-attn cache * use enc-dec dataclass * use richer enc-dec dataclass * clean-up * revert static cache changes * small fixes * revert to cpu flag * fix copies * add static slow test * past k/v docstring * more docstrings * cache_position docstrings * add to docs * add enc-dec cache to docs * make style * fix after rebase * fix beam * style * fix generation strategies * fix most decoder-only tests * style * skip test * more clean up * small docstrings * Apply suggestions from code review Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * add todo * only crop self-attn * check cache in mixin * style * fix re-compile after rebase * move `is_updated` logic to enc-dec wrapper * revert back * revert cache back * finalise design * fix * fix fix * style * Update src/transformers/cache_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * deprecate * updates * final updates * style * style --------- Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
@@ -391,6 +391,12 @@ A [`Constraint`] can be used to force the generation to include specific tokens
|
||||
- get_seq_length
|
||||
- reset
|
||||
|
||||
[[autodoc]] EncoderDecoderCache
|
||||
- get_seq_length
|
||||
- to_legacy_cache
|
||||
- from_legacy_cache
|
||||
- reset
|
||||
- reorder_cache
|
||||
|
||||
## Watermark Utils
|
||||
|
||||
|
||||
@@ -52,8 +52,6 @@ Here is a step-by-step guide to transcribing an audio sample using a pre-trained
|
||||
>>> # Select an audio file and read it:
|
||||
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
>>> audio_sample = ds[0]["audio"]
|
||||
>>> waveform = audio_sample["array"]
|
||||
>>> sampling_rate = audio_sample["sampling_rate"]
|
||||
|
||||
>>> # Load the Whisper model in Hugging Face format:
|
||||
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
|
||||
@@ -61,7 +59,7 @@ Here is a step-by-step guide to transcribing an audio sample using a pre-trained
|
||||
|
||||
>>> # Use the model and processor to transcribe the audio:
|
||||
>>> input_features = processor(
|
||||
... waveform, sampling_rate=sampling_rate, return_tensors="pt"
|
||||
... audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
|
||||
... ).input_features
|
||||
|
||||
>>> # Generate token ids
|
||||
@@ -74,6 +72,49 @@ Here is a step-by-step guide to transcribing an audio sample using a pre-trained
|
||||
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
|
||||
```
|
||||
|
||||
Whisper is compatible with the following optimisations:
|
||||
- [PyTorch Scaled Dot Product Attention (SDPA)](../perf_infer_gpu_one#pytorch-scaled-dot-product-attention): flash attention and memory-efficient attention kernels. Enabled by default for `torch>=2.1.1`.
|
||||
- [Flash Attention 2](../perf_infer_gpu_one#flashattention-2): improved implementation of flash attention through better parallelism and work partitioning.
|
||||
- [torch.compile](../llm_optims#static-kv-cache-and-torchcompile): JIT-compile the forward pass to dispatch to efficient fused kernels.
|
||||
|
||||
As an example, the following codesnippet enables SDPA and `torch.compile` for up to 5x faster inference:
|
||||
|
||||
```python
|
||||
>>> from datasets import load_dataset
|
||||
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
||||
|
||||
>>> # Select an audio file and read it:
|
||||
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
>>> audio_sample = ds[0]["audio"]
|
||||
|
||||
>>> # Load the Whisper model with SDPA attention
|
||||
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
|
||||
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", attn_implementation="sdpa")
|
||||
|
||||
>>> # Enable static cache and compile the forward pass
|
||||
>>> model.generation_config.cache_implementation = "static"
|
||||
>>> model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
|
||||
|
||||
>>> # Use the model and processor to transcribe the audio:
|
||||
>>> input_features = processor(
|
||||
... audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt"
|
||||
... ).input_features
|
||||
|
||||
>>> # Compile the forward pass
|
||||
>>> _ = model.generate(input_features)
|
||||
|
||||
>>> # Generate token ids using compiled graph (fast!)
|
||||
>>> predicted_ids = model.generate(input_features)
|
||||
|
||||
>>> # Decode token ids to text
|
||||
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
||||
|
||||
>>> transcription[0]
|
||||
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
|
||||
```
|
||||
|
||||
For more details on each optimisation, refer to the documentation linked above.
|
||||
|
||||
## Resources
|
||||
|
||||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Whisper. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||
|
||||
Reference in New Issue
Block a user