Add voxtral (#39429)

* draft

* draft update (conversion working)

* mend

* draft update

* draft update: working generate

* refactor

* VoxtralProcessor draft

* processor update

* update convert_tekken_tokenizer

* refactor processor

* update convert

* make style

* better handle prefil

* make style

* add tests

* add mistral_common audio loading

* processor update

* revert changes

* audio utils update

* add audio to apply chat template mistral update

* voxtral processor update

* fix

* udpate converstion script

* make mistral tokenier from pretrain work from local dir

* fix udpates

* add integration tests

* add batched version

* processor docstring

* make style

* revert convert_tekken_tokenizer changes

* revert processing_qwen2.5 changes

* add multi-turn test

* processor improvements

* address review changes

* Update src/transformers/tokenization_mistral_common.py

Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>

* update audio utils

* nits

* integration test update

* correct _support

* update tests

* test update

* update integration tests

* fix

* fix

* fix

* add test_apply_chat_template_with_audio

* add model doc

* model doc

* nit

* doc uptade

* nit

* processor improvement

* ensure default is 3B

* nits

* make

* make

* convert modular

* update checkpoint

* fix test

* make

* make

* autos

* make

* make

* nit

* nit

* nit

---------

Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
eustlb
2025-07-18 02:02:04 +02:00
committed by GitHub
parent 73869f2e81
commit 967045082f
17 changed files with 2806 additions and 7 deletions

View File

@@ -1095,6 +1095,8 @@
title: Vision Text Dual Encoder
- local: model_doc/visual_bert
title: VisualBERT
- local: model_doc/voxtral
title: Voxtral
- local: model_doc/xclip
title: X-CLIP
title: Multimodal models

View File

@@ -0,0 +1,300 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Voxtral
Voxtral is an upgrade of [Ministral 3B and Mistral Small 3B](https://mistral.ai/news/ministraux), extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's [realease blog post](https://mistral.ai/news/voxtral).
The model is available in two checkpoints:
- 3B: [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
- 24B: [mistralai/Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507)
## Key Features
Voxtral builds on Ministral-3B by adding audio processing capabilities:
- **Transcription mode**: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
- **Long-form context**: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
- **Integrated Q&A and summarization**: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
- **Multilingual support**: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
- **Function calling via voice**: Can trigger functions or workflows directly from spoken input based on detected user intent.
- **Text capabilities**: Maintains the strong text processing performance of its Ministral-3B foundation.
## Usage
Let's first load the model!
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
```
### Audio Instruct Mode
The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches.
➡️ audio + text instruction
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ multi-audio + text instruction
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ multi-turn:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ text only:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What if a cyber brain could possibly generate its own ghost, and create a soul all by itself?",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ audio only:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ batched inference!
```python
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{
"type": "text",
"text": "Who's speaking in the speach and what city's weather is being discussed?",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
],
]
inputs = processor.apply_chat_template(conversations)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
```
### Transcription Mode
Use the model to transcribe audio (supports English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)!
```python
inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3")
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
```
This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
## VoxtralConfig
[[autodoc]] VoxtralConfig
## VoxtralEncoderConfig
[[autodoc]] VoxtralEncoderConfig
## VoxtralProcessor
[[autodoc]] VoxtralProcessor
## VoxtralEncoder
[[autodoc]] VoxtralEncoder
- forward
## VoxtralForConditionalGeneration
[[autodoc]] VoxtralForConditionalGeneration
- forward