Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function (#26834)
* Fix error in convert_openai_to_hf.py: "_download() missing 1 required positional argument: root" * Fix error in convert_openai_to_hf.py: "TypeError: byte indices must be integers or slices, not str" * Fix decoder_attention_heads value in convert_openai_to_hf.py. Correct the assignment for `decoder_attention_heads` in the conversion script for the Whisper model. * Black reformat convert_openai_to_hf.py file. * Fix Whisper model configuration defaults (for Tiny). - Correct encoder/decoder layers and attention heads count. - Update model width (`d_model`) to 384. * Add docstring to the convert_openai_to_hf.py script with a doctest * Add shebang and +x permission to the convert_openai_to_hf.py * convert_openai_to_hf.py: reuse the read model_bytes in the _download() function * Move convert_openai_to_hf.py doctest example to whisper.md * whisper.md: Add an inference example to the Conversion section. * whisper.md: remove `model.config.forced_decoder_ids` from examples (deprecated) * whisper.md: Remove "## Format Conversion" section; not used by users * whisper.md: Use librispeech_asr_dummy dataset and load_dataset()
This commit is contained in:
@@ -34,6 +34,42 @@ The original code can be found [here](https://github.com/openai/whisper).
|
||||
- Inference is currently only implemented for short-form i.e. audio is pre-segmented into <=30s segments. Long-form (including timestamps) will be implemented in a future release.
|
||||
- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text.
|
||||
|
||||
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
|
||||
The original code can be found [here](https://github.com/openai/whisper).
|
||||
|
||||
## Inference
|
||||
|
||||
Here is a step-by-step guide to transcribing an audio sample using a pre-trained Whisper model:
|
||||
|
||||
```python
|
||||
>>> from datasets import load_dataset
|
||||
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
||||
|
||||
>>> # Select an audio file and read it:
|
||||
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
>>> audio_sample = ds[0]["audio"]
|
||||
>>> waveform = audio_sample["array"]
|
||||
>>> sampling_rate = audio_sample["sampling_rate"]
|
||||
|
||||
>>> # Load the Whisper model in Hugging Face format:
|
||||
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
|
||||
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")
|
||||
|
||||
>>> # Use the model and processor to transcribe the audio:
|
||||
>>> input_features = processor(
|
||||
... waveform, sampling_rate=sampling_rate, return_tensors="pt"
|
||||
... ).input_features
|
||||
|
||||
>>> # Generate token ids
|
||||
>>> predicted_ids = model.generate(input_features)
|
||||
|
||||
>>> # Decode token ids to text
|
||||
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
|
||||
|
||||
>>> transcription[0]
|
||||
' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'
|
||||
```
|
||||
|
||||
## WhisperConfig
|
||||
|
||||
[[autodoc]] WhisperConfig
|
||||
|
||||
Reference in New Issue
Block a user