Add voxtral (#39429)

* draft

* draft update (conversion working)

* mend

* draft update

* draft update: working generate

* refactor

* VoxtralProcessor draft

* processor update

* update convert_tekken_tokenizer

* refactor processor

* update convert

* make style

* better handle prefil

* make style

* add tests

* add mistral_common audio loading

* processor update

* revert changes

* audio utils update

* add audio to apply chat template mistral update

* voxtral processor update

* fix

* udpate converstion script

* make mistral tokenier from pretrain work from local dir

* fix udpates

* add integration tests

* add batched version

* processor docstring

* make style

* revert convert_tekken_tokenizer changes

* revert processing_qwen2.5 changes

* add multi-turn test

* processor improvements

* address review changes

* Update src/transformers/tokenization_mistral_common.py

Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>

* update audio utils

* nits

* integration test update

* correct _support

* update tests

* test update

* update integration tests

* fix

* fix

* fix

* add test_apply_chat_template_with_audio

* add model doc

* model doc

* nit

* doc uptade

* nit

* processor improvement

* ensure default is 3B

* nits

* make

* make

* convert modular

* update checkpoint

* fix test

* make

* make

* autos

* make

* make

* nit

* nit

* nit

---------

Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
eustlb
2025-07-18 02:02:04 +02:00
committed by GitHub
parent 73869f2e81
commit 967045082f
17 changed files with 2806 additions and 7 deletions

View File

@@ -1095,6 +1095,8 @@
title: Vision Text Dual Encoder
- local: model_doc/visual_bert
title: VisualBERT
- local: model_doc/voxtral
title: Voxtral
- local: model_doc/xclip
title: X-CLIP
title: Multimodal models

View File

@@ -0,0 +1,300 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Voxtral
Voxtral is an upgrade of [Ministral 3B and Mistral Small 3B](https://mistral.ai/news/ministraux), extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's [realease blog post](https://mistral.ai/news/voxtral).
The model is available in two checkpoints:
- 3B: [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
- 24B: [mistralai/Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507)
## Key Features
Voxtral builds on Ministral-3B by adding audio processing capabilities:
- **Transcription mode**: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
- **Long-form context**: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
- **Integrated Q&A and summarization**: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
- **Multilingual support**: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
- **Function calling via voice**: Can trigger functions or workflows directly from spoken input based on detected user intent.
- **Text capabilities**: Maintains the strong text processing performance of its Ministral-3B foundation.
## Usage
Let's first load the model!
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
```
### Audio Instruct Mode
The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches.
➡️ audio + text instruction
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ multi-audio + text instruction
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ multi-turn:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ text only:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What if a cyber brain could possibly generate its own ghost, and create a soul all by itself?",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ audio only:
```python
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
```
➡️ batched inference!
```python
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{
"type": "text",
"text": "Who's speaking in the speach and what city's weather is being discussed?",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
],
]
inputs = processor.apply_chat_template(conversations)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
```
### Transcription Mode
Use the model to transcribe audio (supports English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)!
```python
inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3")
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
```
This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
## VoxtralConfig
[[autodoc]] VoxtralConfig
## VoxtralEncoderConfig
[[autodoc]] VoxtralEncoderConfig
## VoxtralProcessor
[[autodoc]] VoxtralProcessor
## VoxtralEncoder
[[autodoc]] VoxtralEncoder
- forward
## VoxtralForConditionalGeneration
[[autodoc]] VoxtralForConditionalGeneration
- forward

View File

@@ -16,10 +16,12 @@ Audio processing functions to extract features from audio waveforms. This code i
and remove unnecessary dependencies.
"""
import base64
import io
import os
import warnings
from io import BytesIO
from typing import Optional, Union
from typing import Any, Optional, Union
import numpy as np
import requests
@@ -27,14 +29,21 @@ import requests
from .utils import (
is_librosa_available,
is_numpy_array,
is_soundfile_available,
is_torch_tensor,
requires_backends,
)
if is_soundfile_available():
import soundfile as sf
if is_librosa_available():
import librosa
# TODO: @eustlb, we actually don't need librosa but soxr is installed with librosa
import soxr
def load_audio(audio: Union[str, np.ndarray], sampling_rate=16000, timeout=None) -> np.ndarray:
"""
@@ -69,6 +78,85 @@ def load_audio(audio: Union[str, np.ndarray], sampling_rate=16000, timeout=None)
return audio
def load_audio_as(
audio: str,
return_format: str,
timeout: Optional[int] = None,
force_mono: bool = False,
sampling_rate: Optional[int] = None,
) -> Union[str, dict[str, Any], io.BytesIO, None]:
"""
Load audio from either a local file path or URL and return in specified format.
Args:
audio (`str`): Either a local file path or a URL to an audio file
return_format (`str`): Format to return the audio in:
- "base64": Base64 encoded string
- "dict": Dictionary with data and format
- "buffer": BytesIO object
timeout (`int`, *optional*): Timeout for URL requests in seconds
force_mono (`bool`): Whether to convert stereo audio to mono
sampling_rate (`int`, *optional*): If provided, the audio will be resampled to the specified sampling rate.
Returns:
`Union[str, Dict[str, Any], io.BytesIO, None]`:
- `str`: Base64 encoded audio data (if return_format="base64")
- `dict`: Dictionary with 'data' (base64 encoded audio data) and 'format' keys (if return_format="dict")
- `io.BytesIO`: BytesIO object containing audio data (if return_format="buffer")
"""
# TODO: @eustlb, we actually don't need librosa but soxr is installed with librosa
requires_backends(load_audio_as, ["librosa"])
if return_format not in ["base64", "dict", "buffer"]:
raise ValueError(f"Invalid return_format: {return_format}. Must be 'base64', 'dict', or 'buffer'")
try:
# Load audio bytes from URL or file
audio_bytes = None
if audio.startswith(("http://", "https://")):
response = requests.get(audio, timeout=timeout)
response.raise_for_status()
audio_bytes = response.content
elif os.path.isfile(audio):
with open(audio, "rb") as audio_file:
audio_bytes = audio_file.read()
else:
raise ValueError(f"File not found: {audio}")
# Process audio data
with io.BytesIO(audio_bytes) as audio_file:
with sf.SoundFile(audio_file) as f:
audio_array = f.read(dtype="float32")
original_sr = f.samplerate
audio_format = f.format
if sampling_rate is not None and sampling_rate != original_sr:
# Resample audio to target sampling rate
audio_array = soxr.resample(audio_array, original_sr, sampling_rate, quality="HQ")
else:
sampling_rate = original_sr
# Convert to mono if needed
if force_mono and audio_array.ndim != 1:
audio_array = audio_array.mean(axis=1)
buffer = io.BytesIO()
sf.write(buffer, audio_array, sampling_rate, format=audio_format.upper())
buffer.seek(0)
if return_format == "buffer":
return buffer
elif return_format == "base64":
return base64.b64encode(buffer.read()).decode("utf-8")
elif return_format == "dict":
return {
"data": base64.b64encode(buffer.read()).decode("utf-8"),
"format": audio_format.lower(),
}
except Exception as e:
raise ValueError(f"Error loading audio: {e}")
AudioInput = Union[
np.ndarray, "torch.Tensor", list[np.ndarray], tuple[np.ndarray], list["torch.Tensor"], tuple["torch.Tensor"] # noqa: F821
]

View File

@@ -335,6 +335,7 @@ if TYPE_CHECKING:
from .vits import *
from .vivit import *
from .vjepa2 import *
from .voxtral import *
from .wav2vec2 import *
from .wav2vec2_bert import *
from .wav2vec2_conformer import *

View File

@@ -389,6 +389,8 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
("vits", "VitsConfig"),
("vivit", "VivitConfig"),
("vjepa2", "VJEPA2Config"),
("voxtral", "VoxtralConfig"),
("voxtral_encoder", "VoxtralEncoderConfig"),
("wav2vec2", "Wav2Vec2Config"),
("wav2vec2-bert", "Wav2Vec2BertConfig"),
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
@@ -798,6 +800,8 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
("vits", "VITS"),
("vivit", "ViViT"),
("vjepa2", "VJEPA2Model"),
("voxtral", "Voxtral"),
("voxtral_encoder", "Voxtral Encoder"),
("wav2vec2", "Wav2Vec2"),
("wav2vec2-bert", "Wav2Vec2-BERT"),
("wav2vec2-conformer", "Wav2Vec2-Conformer"),
@@ -864,6 +868,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict[str, str](
("xclip", "x_clip"),
("clip_vision_model", "clip"),
("qwen2_audio_encoder", "qwen2_audio"),
("voxtral_encoder", "voxtral"),
("clip_text_model", "clip"),
("aria_text", "aria"),
("gemma3_text", "gemma3"),

View File

@@ -359,6 +359,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
("vits", "VitsModel"),
("vivit", "VivitModel"),
("vjepa2", "VJEPA2Model"),
("voxtral", "VoxtralForConditionalGeneration"),
("voxtral_encoder", "VoxtralEncoder"),
("wav2vec2", "Wav2Vec2Model"),
("wav2vec2-bert", "Wav2Vec2BertModel"),
("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
@@ -458,6 +460,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
("vipllava", "VipLlavaForConditionalGeneration"),
("visual_bert", "VisualBertForPreTraining"),
("vit_mae", "ViTMAEForPreTraining"),
("voxtral", "VoxtralForConditionalGeneration"),
("wav2vec2", "Wav2Vec2ForPreTraining"),
("wav2vec2-conformer", "Wav2Vec2ConformerForPreTraining"),
("xlm", "XLMWithLMHeadModel"),
@@ -1078,6 +1081,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
("t5", "T5ForConditionalGeneration"),
("t5gemma", "T5GemmaForConditionalGeneration"),
("umt5", "UMT5ForConditionalGeneration"),
("voxtral", "VoxtralForConditionalGeneration"),
("xlm-prophetnet", "XLMProphetNetForConditionalGeneration"),
]
)

View File

@@ -132,6 +132,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("vilt", "ViltProcessor"),
("vipllava", "LlavaProcessor"),
("vision-text-dual-encoder", "VisionTextDualEncoderProcessor"),
("voxtral", "VoxtralProcessor"),
("wav2vec2", "Wav2Vec2Processor"),
("wav2vec2-bert", "Wav2Vec2Processor"),
("wav2vec2-conformer", "Wav2Vec2Processor"),

View File

@@ -0,0 +1,29 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_voxtral import *
from .modeling_voxtral import *
from .processing_voxtral import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,203 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...configuration_utils import PretrainedConfig
from ..auto import CONFIG_MAPPING, AutoConfig
class VoxtralEncoderConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`VoxtralEncoder`]. It is used to instantiate a
Voxtral audio encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the audio encoder of the Voxtral
architecture.
e.g. [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 51866):
Vocabulary size of the model.
hidden_size (`int`, *optional*, defaults to 1280):
Dimensionality of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 5120):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 20):
Number of attention heads for each attention layer in the Transformer encoder.
scale_embedding (`bool`, *optional*, defaults to `False`):
Scale embeddings by dividing by sqrt(hidden_size) if True.
activation_function (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu",
num_mel_bins (`int`, *optional*, defaults to 128):
Number of mel features used per input features. Should correspond to the value used in the
`VoxtralProcessor` class.
max_source_positions (`int`, *optional*, defaults to 1500):
The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
```python
>>> from transformers import VoxtralEncoderConfig, VoxtralEncoder
>>> # Initializing a VoxtralEncoderConfig
>>> configuration = VoxtralEncoderConfig()
>>> # Initializing a VoxtralEncoder (with random weights)
>>> model = VoxtralEncoder(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "voxtral_encoder"
attribute_map = {
"d_model": "hidden_size",
"encoder_layers": "num_hidden_layers",
"encoder_attention_heads": "num_attention_heads",
"encoder_ffn_dim": "intermediate_size",
"encoder_layerdrop": "layerdrop",
}
def __init__(
self,
vocab_size=51866,
hidden_size=1280,
intermediate_size=5120,
num_hidden_layers=32,
num_attention_heads=20,
scale_embedding=False,
activation_function="gelu",
num_mel_bins=128,
max_source_positions=1500,
initializer_range=0.02,
attention_dropout=0.0,
**kwargs,
):
super().__init__(**kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.scale_embedding = scale_embedding # scale factor will be sqrt(hidden_size) if True
self.activation_function = activation_function
self.num_mel_bins = num_mel_bins
self.max_source_positions = max_source_positions
self.initializer_range = initializer_range
# TODO: @eustlb, we do not use dropout and layerdrop, yet we need to hardcode them
# to be able to use Whisper with modular (here actually from Qwen2-Audio and copied from).
# After a future Whisper refactor, we should remove this.
self.dropout = 0.0
self.layerdrop = 0.0
self.activation_dropout = 0.0
self.attention_dropout = attention_dropout
class VoxtralConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`VoxtralForConditionalGeneration`]. It is used to instantiate an
Voxtral model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Voxtral-Mini-3B.
e.g. [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
audio_config (`Union[AutoConfig, dict]`, *optional*):
The config object or dictionary of the audio encoder.
text_config (`Union[AutoConfig, dict]`, *optional*):
The config object or dictionary of the text model.
audio_token_id (`int`, *optional*):
The image token index to encode the image prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
The activation function (function or string) in the multi-modal projector.
```python
>>> from transformers import VoxtralForConditionalGeneration, VoxtralConfig
>>> # Initializing a Voxtral configuration
>>> configuration = VoxtralConfig(audio_token_id=24, projector_hidden_act="gelu")
>>> # Initializing a 3B model with random weights
>>> model = VoxtralForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "voxtral"
sub_configs = {"text_config": AutoConfig, "audio_config": AutoConfig}
_default_text_config_kwargs = {
"vocab_size": 131072,
"hidden_size": 3072,
"intermediate_size": 8192,
"num_hidden_layers": 30,
"num_key_value_heads": 8,
"max_position_embeddings": 131072,
"rms_norm_eps": 1e-05,
"use_cache": True,
"rope_theta": 100000000.0,
"head_dim": 128,
}
def __init__(
self,
audio_config=None,
text_config=None,
audio_token_id=None,
projector_hidden_act="gelu",
**kwargs,
):
if isinstance(audio_config, dict):
audio_config["model_type"] = (
audio_config["model_type"] if "model_type" in audio_config else "voxtral_encoder"
)
audio_config = CONFIG_MAPPING[audio_config["model_type"]](**audio_config)
elif audio_config is None:
audio_config = CONFIG_MAPPING["voxtral_encoder"]()
self.audio_config = audio_config
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
text_config = CONFIG_MAPPING[text_config["model_type"]](
**{**self._default_text_config_kwargs, **text_config}
)
elif text_config is None:
text_config = CONFIG_MAPPING["llama"](**self._default_text_config_kwargs)
self.text_config = text_config
self.vocab_size = text_config.vocab_size
self.hidden_size = text_config.hidden_size
self.audio_token_id = audio_token_id
self.projector_hidden_act = projector_hidden_act
super().__init__(**kwargs)
__all__ = ["VoxtralEncoderConfig", "VoxtralConfig"]

View File

@@ -0,0 +1,302 @@
# coding=utf-8
# Copyright 2025 HuggingFace Inc. team. All rights reserved.
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import gc
import json
import os
import re
import torch
from safetensors.torch import load_file
from transformers import (
MistralCommonTokenizer,
VoxtralConfig,
VoxtralForConditionalGeneration,
VoxtralProcessor,
WhisperFeatureExtractor,
)
from transformers.models.whisper.modeling_whisper import sinusoids
from transformers.utils.hub import cached_file
# fmt: off
STATE_DICT_MAPPING = {
# Text model keys
r"^output.weight": r"language_model.lm_head.weight",
r"^norm.weight": r"language_model.model.norm.weight",
r"^tok_embeddings.weight": r"language_model.model.embed_tokens.weight",
r"^layers.(\d+).attention_norm.weight": r"language_model.model.layers.\1.input_layernorm.weight",
r"^layers.(\d+).ffn_norm.weight": r"language_model.model.layers.\1.post_attention_layernorm.weight",
r"^layers.(\d+).attention.w(q|k|v|o).weight": r"language_model.model.layers.\1.self_attn.\2_proj.weight",
r"^layers.(\d+).feed_forward.w1.weight": r"language_model.model.layers.\1.mlp.gate_proj.weight",
r"^layers.(\d+).feed_forward.w2.weight": r"language_model.model.layers.\1.mlp.down_proj.weight",
r"^layers.(\d+).feed_forward.w3.weight": r"language_model.model.layers.\1.mlp.up_proj.weight",
r"mm_whisper_embeddings.tok_embeddings.weight": r"language_model.model.embed_tokens.weight",
# audio model keys
r"mm_whisper_embeddings.whisper_encoder\.conv_layers\.0\.(weight|bias)": r"audio_tower.conv1.\1",
r"mm_whisper_embeddings.whisper_encoder\.conv_layers\.1\.(weight|bias)": r"audio_tower.conv2.\1",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.norm\.(weight|bias)": r"audio_tower.layer_norm.\1",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention\.w([qkv])\.(weight|bias)": r"audio_tower.layers.\1.self_attn.\2_proj.\3",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention\.wo\.(weight|bias)": r"audio_tower.layers.\1.self_attn.out_proj.\2",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention_norm\.(weight|bias)": r"audio_tower.layers.\1.self_attn_layer_norm.\2",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w1\.(weight|bias)": r"audio_tower.layers.\1.fc1.\2",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w2\.(weight|bias)": r"audio_tower.layers.\1.fc2.\2",
r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.ffn_norm\.(weight|bias)": r"audio_tower.layers.\1.final_layer_norm.\2",
r"mm_whisper_embeddings.audio_language_projection\.0\.weight": r"multi_modal_projector.linear_1.weight",
r"mm_whisper_embeddings.audio_language_projection\.2\.weight": r"multi_modal_projector.linear_2.weight",
}
# fmt: on
def convert_config(original_config: dict, max_position_embeddings: int = 131072):
original_audio_config = original_config.pop("multimodal")
original_audio_config = original_audio_config["whisper_model_args"]["encoder_args"]
original_text_config = original_config
# Text config
text_key_mapping = {
"hidden_size": "dim",
"num_hidden_layers": "n_layers",
"intermediate_size": "hidden_dim",
"num_attention_heads": "n_heads",
"num_key_value_heads": "n_kv_heads",
"rms_norm_eps": "norm_eps",
}
similar_text_keys_to_keep = [
"head_dim",
"vocab_size",
"rope_theta",
]
new_text_config_kwargs = {k: original_text_config[v] for k, v in text_key_mapping.items()}
new_text_config_kwargs.update({k: v for k, v in original_text_config.items() if k in similar_text_keys_to_keep})
# These are not always defined depending on `params.json`
new_text_config_kwargs["sliding_window"] = original_text_config.get("sliding_window", None)
new_text_config_kwargs["max_position_embeddings"] = original_text_config.get(
"max_seq_len", max_position_embeddings
)
# This may sometimes be a string in `params.json`
if new_text_config_kwargs["sliding_window"] is not None:
new_text_config_kwargs["sliding_window"] = int(new_text_config_kwargs["sliding_window"])
# Audio config
audio_key_mapping = {
"hidden_size": "dim",
"num_hidden_layers": "n_layers",
"intermediate_size": "hidden_dim",
"num_attention_heads": "n_heads",
"num_key_value_heads": "n_heads",
}
similar_audio_keys_to_keep = [
"head_dim",
"vocab_size",
]
new_audio_config_kwargs = {k: original_audio_config[v] for k, v in audio_key_mapping.items()}
new_audio_config_kwargs.update({k: v for k, v in original_audio_config.items() if k in similar_audio_keys_to_keep})
new_config = VoxtralConfig(
audio_config=new_audio_config_kwargs,
text_config=new_text_config_kwargs,
audio_token_id=24,
projector_hidden_act="gelu",
)
return new_config
def map_old_key_to_new(old_key):
"""Map of a key of the original state dict to the equivalent key in HF format"""
for pattern, replacement in STATE_DICT_MAPPING.items():
new_key, n_replace = re.subn(pattern, replacement, old_key)
# Early exit of the loop
if n_replace > 0:
return new_key
raise ValueError(f"Key: {old_key} could not be mapped (check the mapping).")
def permute_for_rope(tensor, n_heads, dim1, dim2):
"""Permute the weights for the ROPE formulation."""
tensor = tensor.view(n_heads, dim1 // n_heads // 2, 2, dim2)
tensor = tensor.transpose(1, 2)
tensor = tensor.reshape(dim1, dim2)
return tensor
def convert_state_dict(original_state_dict, config):
"""Convert a state dict file, when a single `nn.Module` is never sharded in different files (usual case)."""
new_dict = {}
num_attention_heads = config.num_attention_heads
hidden_size = config.hidden_size
head_dim = config.head_dim
num_key_value_heads = config.num_key_value_heads
key_value_dim = head_dim * num_key_value_heads
query_dim = head_dim * num_attention_heads
for old_key, tensor in original_state_dict.items():
new_key = map_old_key_to_new(old_key)
if "audio_tower" not in new_key:
if "q_proj" in new_key:
tensor = tensor.view(num_attention_heads, head_dim, hidden_size).reshape(query_dim, hidden_size)
tensor = permute_for_rope(tensor, num_attention_heads, query_dim, hidden_size)
elif "k_proj" in new_key:
tensor = tensor.view(num_key_value_heads, head_dim, hidden_size).reshape(key_value_dim, hidden_size)
tensor = permute_for_rope(tensor, num_key_value_heads, key_value_dim, hidden_size)
elif "v_proj" in new_key:
tensor = tensor.view(num_key_value_heads, head_dim, hidden_size).reshape(key_value_dim, hidden_size)
new_dict[new_key] = tensor
return new_dict
def write_model(
input_path_or_repo,
model_name,
config_name,
output_dir,
safe_serialization=True,
):
print("Converting the model.")
os.makedirs(output_dir, exist_ok=True)
# --------------
# convert config
# --------------
config_path = cached_file(input_path_or_repo, config_name)
with open(config_path, "r") as f:
original_config = json.load(f)
config = convert_config(original_config)
model = VoxtralForConditionalGeneration(config)
# ---------------
# convert weights
# ---------------
model_path = cached_file(input_path_or_repo, model_name)
print(f"Fetching all parameters from the checkpoint at {model_path}...")
state_dict = load_file(model_path)
print("Converting model...")
converted_state_dict = convert_state_dict(state_dict, config.text_config)
# we need to add embed positions as they are not in the state dict
with torch.no_grad(), torch.device("cuda"):
# TODO: @eustlb, we are here creating on GPU
# vllm initalizes on device, while we save in state dict
embed_positions_weight = sinusoids(config.audio_config.max_source_positions, config.audio_config.hidden_size)
converted_state_dict["audio_tower.embed_positions.weight"] = embed_positions_weight.cpu()
# -------------------------
# load the weights and save
# -------------------------
print("Loading the checkpoint in a Voxtral model.")
with torch.device("meta"):
model = VoxtralForConditionalGeneration(config)
model.load_state_dict(converted_state_dict, strict=True, assign=True)
print("Checkpoint loaded successfully.")
del model.config._name_or_path
del model.generation_config._from_model_config
model.generation_config.pad_token_id = 11
print("Saving the model.")
model.save_pretrained(output_dir, safe_serialization=safe_serialization)
del state_dict, model
# Safety check: reload the converted model
gc.collect()
print("Reloading the model to check if it's saved correctly.")
VoxtralForConditionalGeneration.from_pretrained(output_dir, torch_dtype=torch.bfloat16, device_map="auto")
print("Model reloaded successfully.")
def write_processor(input_path_or_repo: str, feature_extractor_path_or_repo: str, output_dir: str):
tokenizer = MistralCommonTokenizer.from_pretrained(input_path_or_repo)
feature_extractor = WhisperFeatureExtractor.from_pretrained(feature_extractor_path_or_repo)
print("Creating the processor...")
# Create the processor and save it
processor = VoxtralProcessor(
feature_extractor=feature_extractor,
tokenizer=tokenizer,
)
processor.save_pretrained(output_dir)
print("Processor saved successfully.")
def main():
parser = argparse.ArgumentParser(description="Convert Voxtral weights to Hugging Face format")
parser.add_argument(
"--input_path_or_repo",
type=str,
required=True,
help="Path or repo containing Csm weights",
)
parser.add_argument(
"--model_name",
type=str,
required=True,
help="Name of the model in input_path_or_repo",
)
parser.add_argument(
"--config_name",
type=str,
required=True,
help="Name of the config in input_path_or_repo",
)
parser.add_argument(
"--feature_extractor_path_or_repo",
type=str,
required=True,
help="Path or repo containing the feature extractor",
)
parser.add_argument(
"--output_dir",
help="Location to write HF model and tokenizer",
)
parser.add_argument(
"--safe_serialization", action="store_true", default=True, help="Whether or not to save using `safetensors`."
)
args = parser.parse_args()
write_model(
args.input_path_or_repo,
args.model_name,
args.config_name,
args.output_dir,
safe_serialization=args.safe_serialization,
)
write_processor(
args.input_path_or_repo,
args.feature_extractor_path_or_repo,
args.output_dir,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,542 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/voxtral/modular_voxtral.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_voxtral.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from typing import Callable, Optional, Union
import torch
from torch import nn
from ...activations import ACT2FN
from ...cache_utils import Cache
from ...generation import GenerationMixin
from ...modeling_layers import GradientCheckpointingLayer
from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPast, CausalLMOutputWithPast
from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
from ...processing_utils import Unpack
from ...utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
from ...utils.generic import check_model_inputs
from ..auto import AutoModel, AutoModelForCausalLM
from .configuration_voxtral import VoxtralConfig, VoxtralEncoderConfig
logger = logging.get_logger(__name__)
def eager_attention_forward(
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: Optional[float] = None,
dropout: float = 0.0,
head_mask: Optional[torch.Tensor] = None,
**kwargs,
):
if scaling is None:
scaling = query.size(-1) ** -0.5
attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling
if attention_mask is not None and attention_mask.ndim == 4:
attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]]
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
if head_mask is not None:
attn_weights = attn_weights * head_mask.view(1, -1, 1, 1)
attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights
class VoxtralAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(
self,
embed_dim: int,
num_heads: int,
dropout: float = 0.0,
is_decoder: bool = False,
bias: bool = True,
is_causal: bool = False,
layer_idx: Optional[int] = None,
config: Optional[VoxtralConfig] = None,
):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.dropout = dropout
self.head_dim = embed_dim // num_heads
self.config = config
if (self.head_dim * num_heads) != self.embed_dim:
raise ValueError(
f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
f" and `num_heads`: {num_heads})."
)
self.scaling = self.head_dim**-0.5
self.is_decoder = is_decoder
self.is_causal = is_causal
if layer_idx is None and is_decoder:
logger.warning_once(
f"Instantiating a decoder {self.__class__.__name__} without passing `layer_idx` is not recommended and "
"will to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
"when creating this class."
)
self.layer_idx = layer_idx
self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
layer_head_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False,
**kwargs,
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel"""
bsz, tgt_len, _ = hidden_states.size()
# Scaling is susceptible to floating point arithmetics' inprecisions
# which can lead to different results (this is dependent from model
# to model, e.g. whisper is one such case). We therefore keep the
# original order of scaling to follow the original implementation
# and enforce no scaling (1.0) in the attention call below.
query_states = self._shape(self.q_proj(hidden_states) * self.scaling, tgt_len, bsz)
key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
attention_interface: Callable = eager_attention_forward
if self.config._attn_implementation != "eager":
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
attention_mask,
dropout=0.0 if not self.training else self.dropout,
scaling=1.0,
output_attentions=output_attentions,
head_mask=layer_head_mask,
**kwargs,
)
attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous()
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights
class VoxtralEncoderLayer(GradientCheckpointingLayer):
def __init__(self, config: VoxtralConfig):
super().__init__()
self.embed_dim = config.d_model
self.self_attn = VoxtralAttention(
embed_dim=self.embed_dim,
num_heads=config.encoder_attention_heads,
dropout=config.attention_dropout,
config=config,
)
self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.dropout = config.dropout
self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout
self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
layer_head_mask: torch.Tensor,
output_attentions: bool = False,
) -> torch.Tensor:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
`(encoder_attention_heads,)`.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
hidden_states = self.self_attn_layer_norm(hidden_states)
hidden_states, attn_weights = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
layer_head_mask=layer_head_mask,
output_attentions=output_attentions,
)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.final_layer_norm(hidden_states)
hidden_states = self.activation_fn(self.fc1(hidden_states))
hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
hidden_states = self.fc2(hidden_states)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states
if hidden_states.dtype == torch.float16:
clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
return hidden_states, attn_weights
@auto_docstring
class VoxtralPreTrainedModel(PreTrainedModel):
config: VoxtralConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["VoxtralAttention"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn = True
_supports_sdpa = True
_supports_flex_attn = True
_supports_cache_class = True
_supports_attention_backend = True
_supports_static_cache = True
def _init_weights(self, module):
# important: this ported version of Voxtral isn't meant for training from scratch - only
# inference and fine-tuning - so the proper init weights code has been removed
std = (
self.config.initializer_range
if hasattr(self.config, "initializer_range")
else self.config.audio_config.initializer_range
)
if isinstance(module, (nn.Linear, nn.Conv1d)):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.LayerNorm):
module.weight.data.fill_(1.0)
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
@auto_docstring(
custom_intro="""
The Voxtral encoder, which is a Whisper encoder.
"""
)
class VoxtralEncoder(VoxtralPreTrainedModel):
"""
Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a
[`VoxtralEncoderLayer`].
Args:
config: VoxtralEncoderConfig
"""
# Ignore copy
config: VoxtralEncoderConfig
main_input_name = "input_features"
_no_split_modules = ["VoxtralEncoderLayer"]
_can_record_outputs = {
"attentions": VoxtralAttention,
"hidden_states": VoxtralEncoderLayer,
}
def __init__(self, config: VoxtralEncoderConfig):
super().__init__(config)
self.dropout = config.dropout
self.layerdrop = config.encoder_layerdrop
embed_dim = config.d_model
self.num_mel_bins = config.num_mel_bins
self.padding_idx = config.pad_token_id
self.max_source_positions = config.max_source_positions
self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
self.conv1 = nn.Conv1d(self.num_mel_bins, embed_dim, kernel_size=3, padding=1)
self.conv2 = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=2, padding=1)
self.embed_positions = nn.Embedding(self.max_source_positions, embed_dim)
self.embed_positions.requires_grad_(False)
self.layers = nn.ModuleList([VoxtralEncoderLayer(config) for _ in range(config.encoder_layers)])
self.layer_norm = nn.LayerNorm(config.d_model)
# Ignore copy
self.avg_pooler = nn.AvgPool1d(2, stride=2)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def _freeze_parameters(self):
for param in self.parameters():
param.requires_grad = False
self._requires_grad = False
def get_input_embeddings(self) -> nn.Module:
return self.conv1
def set_input_embeddings(self, value: nn.Module):
self.conv1 = value
@check_model_inputs
def forward(
self,
input_features,
attention_mask=None,
**kwargs: Unpack[TransformersKwargs],
):
r"""
Args:
input_features (`torch.LongTensor` of shape `(batch_size, feature_size, sequence_length)`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
attention_mask (`torch.Tensor`)`, *optional*):
Voxtral does not support masking of the `input_features`, this argument is preserved for compatibility,
but it is not used. By default the silence in the input log mel spectrogram are ignored.
"""
expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]
if input_features.shape[-1] != expected_seq_length:
raise ValueError(
f"Qwen2Audio expects the mel input features to be of length {expected_seq_length}, but found {input_features.shape[-1]}. Make sure to pad the input mel features to {expected_seq_length}."
)
input_features = input_features.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
inputs_embeds = nn.functional.gelu(self.conv1(input_features))
inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
inputs_embeds = inputs_embeds.permute(0, 2, 1)
embed_pos = self.embed_positions.weight
hidden_states = (inputs_embeds + embed_pos).to(inputs_embeds.dtype)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
for idx, encoder_layer in enumerate(self.layers):
layer_outputs = encoder_layer(
hidden_states,
attention_mask=attention_mask,
layer_head_mask=None,
)
hidden_states = layer_outputs[0]
hidden_states = self.layer_norm(hidden_states)
return BaseModelOutput(
last_hidden_state=hidden_states,
)
# Ignore copy
def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
"""
Computes the output length of the convolutional layers and the output length of the audio encoder
"""
input_lengths = (input_lengths - 1) // 2 + 1
output_lengths = (input_lengths - 2) // 2 + 1
return input_lengths, output_lengths
class VoxtralMultiModalProjector(nn.Module):
def __init__(self, config: VoxtralConfig):
super().__init__()
self.linear_1 = nn.Linear(config.audio_config.intermediate_size, config.text_config.hidden_size, bias=False)
self.act = ACT2FN[config.projector_hidden_act]
self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=False)
def forward(self, audio_features):
hidden_states = self.linear_1(audio_features)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
return hidden_states
@auto_docstring(
custom_intro="""
The Voxtral model, which consists of Whisper encoder, a multi-modal projector and a LLama language model.
"""
)
class VoxtralForConditionalGeneration(VoxtralPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
_keep_in_fp32_modules_strict = ["embed_positions"]
def __init__(self, config):
super().__init__(config)
self.vocab_size = config.text_config.vocab_size
self.audio_tower = AutoModel.from_config(config.audio_config)
self.language_model = AutoModelForCausalLM.from_config(config.text_config)
self.multi_modal_projector = VoxtralMultiModalProjector(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_output_embeddings(self):
return self.language_model.get_output_embeddings()
def set_output_embeddings(self, new_embeddings):
self.language_model.set_output_embeddings(new_embeddings)
def set_decoder(self, decoder):
self.language_model.set_decoder(decoder)
def get_decoder(self):
return self.language_model.get_decoder()
def get_audio_embeds(self, input_features: torch.FloatTensor):
"""
This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
Args:
input_features (`torch.FloatTensor`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
Returns:
`torch.FloatTensor`:
The audio embeddings.
"""
audio_outputs = self.audio_tower(input_features)
audio_hidden_states = audio_outputs.last_hidden_state
audio_hidden_states = audio_hidden_states.reshape(-1, self.config.audio_config.intermediate_size)
audio_embeds = self.multi_modal_projector(audio_hidden_states)
return audio_embeds
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
input_features: Optional[torch.FloatTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
) -> CausalLMOutputWithPast:
r"""
Example:
```python
>>> from transformers import VoxtralForConditionalGeneration, AutoProcessor
>>> import torch
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> repo_id = "mistralai/Voxtral-Mini-3B-2507"
>>> processor = AutoProcessor.from_pretrained(repo_id)
>>> model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
>>> conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
]
>>> inputs = processor.apply_chat_template(conversation)
>>> inputs = inputs.to(device, dtype=torch.bfloat16)
>>> outputs = model.generate(**inputs, max_new_tokens=30)
>>> processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
["This audio is a humorous conversation between two friends, likely in English, where one of them is trying to figure out what the other's tattoo says."]
```"""
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if input_features is not None:
audio_embeds = self.get_audio_embeds(input_features)
# replace text-audio token placeholders with audio embeddings
audio_token_mask = input_ids == self.config.audio_token_id
inputs_embeds[audio_token_mask] = audio_embeds
outputs: BaseModelOutputWithPast = self.language_model(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return outputs
def prepare_inputs_for_generation(self, *args, **kwargs):
# Overwritten -- we should not pass input_features when we are in cached decoding stage
input_features = kwargs.pop("input_features", None)
cache_position = kwargs.get("cache_position")
model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
if cache_position is not None and cache_position[0] == 0:
# input_features should only be passed when we are not in cached decoding stage
model_inputs["input_features"] = input_features
return model_inputs
__all__ = ["VoxtralPreTrainedModel", "VoxtralEncoder", "VoxtralForConditionalGeneration"]

View File

@@ -0,0 +1,276 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import torch
from torch import nn
from ...activations import ACT2FN
from ...cache_utils import Cache
from ...generation import GenerationMixin
from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPast, CausalLMOutputWithPast
from ...processing_utils import Unpack
from ...utils import TransformersKwargs, auto_docstring, can_return_tuple
from ...utils.generic import check_model_inputs
from ..auto import AutoModel, AutoModelForCausalLM
from ..qwen2_audio.modeling_qwen2_audio import (
Qwen2AudioAttention,
Qwen2AudioEncoder,
Qwen2AudioEncoderLayer,
Qwen2AudioPreTrainedModel,
)
from .configuration_voxtral import VoxtralConfig
class VoxtralAttention(Qwen2AudioAttention):
pass
class VoxtralEncoderLayer(Qwen2AudioEncoderLayer):
pass
class VoxtralPreTrainedModel(Qwen2AudioPreTrainedModel):
_supports_flex_attn = True
_supports_cache_class = True
_supports_attention_backend = True
_supports_static_cache = True
_supports_attention_backend = True
# TODO: @eustlb, I would really prefer to use WhisperEncoder but it's messing with modular
@auto_docstring(
custom_intro="""
The Voxtral encoder, which is a Whisper encoder.
"""
)
class VoxtralEncoder(Qwen2AudioEncoder):
_can_record_outputs = {
"attentions": VoxtralAttention,
"hidden_states": VoxtralEncoderLayer,
}
@check_model_inputs
def forward(
self,
input_features,
attention_mask=None,
**kwargs: Unpack[TransformersKwargs],
):
r"""
Args:
input_features (`torch.LongTensor` of shape `(batch_size, feature_size, sequence_length)`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
attention_mask (`torch.Tensor`)`, *optional*):
Voxtral does not support masking of the `input_features`, this argument is preserved for compatibility,
but it is not used. By default the silence in the input log mel spectrogram are ignored.
"""
expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]
if input_features.shape[-1] != expected_seq_length:
raise ValueError(
f"Qwen2Audio expects the mel input features to be of length {expected_seq_length}, but found {input_features.shape[-1]}. Make sure to pad the input mel features to {expected_seq_length}."
)
input_features = input_features.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
inputs_embeds = nn.functional.gelu(self.conv1(input_features))
inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
inputs_embeds = inputs_embeds.permute(0, 2, 1)
embed_pos = self.embed_positions.weight
hidden_states = (inputs_embeds + embed_pos).to(inputs_embeds.dtype)
hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
for idx, encoder_layer in enumerate(self.layers):
layer_outputs = encoder_layer(
hidden_states,
attention_mask=attention_mask,
layer_head_mask=None,
)
hidden_states = layer_outputs[0]
hidden_states = self.layer_norm(hidden_states)
return BaseModelOutput(
last_hidden_state=hidden_states,
)
class VoxtralMultiModalProjector(nn.Module):
def __init__(self, config: VoxtralConfig):
super().__init__()
self.linear_1 = nn.Linear(config.audio_config.intermediate_size, config.text_config.hidden_size, bias=False)
self.act = ACT2FN[config.projector_hidden_act]
self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=False)
def forward(self, audio_features):
hidden_states = self.linear_1(audio_features)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
return hidden_states
@auto_docstring(
custom_intro="""
The Voxtral model, which consists of Whisper encoder, a multi-modal projector and a LLama language model.
"""
)
class VoxtralForConditionalGeneration(VoxtralPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
_keep_in_fp32_modules_strict = ["embed_positions"]
def __init__(self, config):
super().__init__(config)
self.vocab_size = config.text_config.vocab_size
self.audio_tower = AutoModel.from_config(config.audio_config)
self.language_model = AutoModelForCausalLM.from_config(config.text_config)
self.multi_modal_projector = VoxtralMultiModalProjector(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_output_embeddings(self):
return self.language_model.get_output_embeddings()
def set_output_embeddings(self, new_embeddings):
self.language_model.set_output_embeddings(new_embeddings)
def set_decoder(self, decoder):
self.language_model.set_decoder(decoder)
def get_decoder(self):
return self.language_model.get_decoder()
def get_audio_embeds(self, input_features: torch.FloatTensor):
"""
This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
Args:
input_features (`torch.FloatTensor`):
Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
`numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
`input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
Returns:
`torch.FloatTensor`:
The audio embeddings.
"""
audio_outputs = self.audio_tower(input_features)
audio_hidden_states = audio_outputs.last_hidden_state
audio_hidden_states = audio_hidden_states.reshape(-1, self.config.audio_config.intermediate_size)
audio_embeds = self.multi_modal_projector(audio_hidden_states)
return audio_embeds
@can_return_tuple
@auto_docstring
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
input_features: Optional[torch.FloatTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[TransformersKwargs],
) -> CausalLMOutputWithPast:
r"""
Example:
```python
>>> from transformers import VoxtralForConditionalGeneration, AutoProcessor
>>> import torch
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> repo_id = "mistralai/Voxtral-Mini-3B-2507"
>>> processor = AutoProcessor.from_pretrained(repo_id)
>>> model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
>>> conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
]
>>> inputs = processor.apply_chat_template(conversation)
>>> inputs = inputs.to(device, dtype=torch.bfloat16)
>>> outputs = model.generate(**inputs, max_new_tokens=30)
>>> processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
["This audio is a humorous conversation between two friends, likely in English, where one of them is trying to figure out what the other's tattoo says."]
```"""
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
if input_features is not None:
audio_embeds = self.get_audio_embeds(input_features)
# replace text-audio token placeholders with audio embeddings
audio_token_mask = input_ids == self.config.audio_token_id
inputs_embeds[audio_token_mask] = audio_embeds
outputs: BaseModelOutputWithPast = self.language_model(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
return outputs
def prepare_inputs_for_generation(self, *args, **kwargs):
# Overwritten -- we should not pass input_features when we are in cached decoding stage
input_features = kwargs.pop("input_features", None)
cache_position = kwargs.get("cache_position")
model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
if cache_position is not None and cache_position[0] == 0:
# input_features should only be passed when we are not in cached decoding stage
model_inputs["input_features"] = input_features
return model_inputs
__all__ = ["VoxtralPreTrainedModel", "VoxtralEncoder", "VoxtralForConditionalGeneration"]

View File

@@ -0,0 +1,449 @@
# coding=utf-8
# Copyright 2025 Sesame and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import io
from typing import Optional, Union
from ...utils import is_mistral_common_available, is_soundfile_available, is_torch_available, logging
if is_torch_available():
import torch
if is_soundfile_available():
import soundfile as sf
if is_mistral_common_available():
from mistral_common.protocol.transcription.request import TranscriptionRequest
from ...audio_utils import AudioInput, load_audio_as, make_list_of_audio
from ...feature_extraction_utils import BatchFeature
from ...processing_utils import AllKwargsForChatTemplate, AudioKwargs, ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import PreTokenizedInput, TextInput
logger = logging.get_logger(__name__)
class VoxtralAudioKwargs(AudioKwargs, total=False):
max_source_positions: Optional[int]
class VoxtralProcessorKwargs(ProcessingKwargs, total=False):
_defaults = {
"text_kwargs": {
"padding": True,
},
"audio_kwargs": {
"sampling_rate": 16000,
"padding": True,
"truncation": False,
"pad_to_multiple_of": 480000,
"max_source_positions": 3000,
},
"common_kwargs": {
"return_tensors": "pt",
"return_dict": True,
"tokenize": True,
},
}
class VoxtralProcessor(ProcessorMixin):
r"""
Constructs a Voxtral processor which wraps [`WhisperFeatureExtractor`] and
[`MistralCommonTokenizer`] into a single processor that inherits both the audio feature extraction and
tokenizer functionalities.
Args:
feature_extractor ([`WhisperFeatureExtractor`]):
The feature extractor is a required input.
tokenizer ([`MistralCommonTokenizer`]):
The tokenizer is a required input.
"""
attributes = ["feature_extractor", "tokenizer"]
feature_extractor_class = "WhisperFeatureExtractor"
tokenizer_class = "MistralCommonTokenizer"
def __init__(
self,
feature_extractor,
tokenizer,
):
self.audio_token_id = 24
self.audio_token = tokenizer.convert_ids_to_tokens(self.audio_token_id)
super().__init__(feature_extractor, tokenizer)
def _retreive_input_features(self, audio, max_source_positions, **kwargs):
"""
Handles specific logic of Voxtral expected input features: audio arrays should be padded to next multiple of 480000 (duration is a multiple of 30s), see VoxtralProcessorKwargs' default audio_kwargs.
Then mel input features are extracted and stacked along batch dimension, splitting into chunks of max_source_positions.
"""
input_features_list = []
for audio_array in audio:
audio_inputs = self.feature_extractor(audio_array, **kwargs)
# let's split into chunks of max_source_positions, and then stack them along batch dimension
input_features = audio_inputs["input_features"].reshape(
self.feature_extractor.feature_size, -1, max_source_positions
)
input_features_list.append(input_features.transpose(0, 1))
return torch.cat(input_features_list)
def apply_chat_template(
self,
conversation: Union[list[dict[str, str]], list[list[dict[str, str]]]],
**kwargs: Unpack[AllKwargsForChatTemplate],
) -> str:
"""
This method applies the model's chat completion template given a conversation. It relies on MistralCommonTokenizer's
[`~MistralCommonTokenizer.apply_chat_template`] to prepare input ids to the model and on WhisperFeatureExtractor's
[`~WhisperFeatureExtractor.__call__`] to prepare input features to the model.
Note that audio is padded to the nearest 30-second multiple prior to mel feature extraction.
A `conversation` is a list of messages, where each message is a dictionary with a `role` and a `content` field.
For Voxtral, `role` can be `"user"` or `"assistant"`.
The `content` field can be a string or a list of dictionaries with a `type` field. See example below.
```python
from huggingface_hub import hf_hub_download
from transformers.audio_utils import load_audio_as
audio_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3"
audio_path = hf_hub_download(repo_id="hf-internal-testing/dummy-audio-samples", filename="bcn_weather.mp3", repo_type="dataset")
audio_base64 = load_audio_as(audio_path, return_format="base64", force_mono=True)
# audio + text
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "url": audio_url},
{"type": "audio", "path": audio_path},
{"type": "audio", "base64": audio_base64},
{"type": "text", "text": "How many audio do you hear?"},
],
},
]
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
inputs = processor.apply_chat_template(conversation)
```
Args:
conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`):
The conversation to format.
"""
if kwargs.get("continue_final_message", False):
if kwargs.get("add_generation_prompt", False):
raise ValueError(
"continue_final_message and add_generation_prompt are not compatible. Use continue_final_message when you want the model to continue the final message, and add_generation_prompt when you want to add a header that will prompt it to start a new assistant message instead."
)
if kwargs.get("return_assistant_tokens_mask", False):
raise ValueError("continue_final_message is not compatible with return_assistant_tokens_mask.")
# Fill sets of kwargs that should be used by different parts of template
processed_kwargs = {
"mm_load_kwargs": {},
"template_kwargs": {},
}
for kwarg_type in processed_kwargs:
for key in AllKwargsForChatTemplate.__annotations__[kwarg_type].__annotations__.keys():
kwarg_type_defaults = AllKwargsForChatTemplate.__annotations__[kwarg_type]
default_value = getattr(kwarg_type_defaults, key, None)
value = kwargs.pop(key, default_value)
if value is not None and not isinstance(value, dict):
processed_kwargs[kwarg_type][key] = value
# Pass unprocessed custom kwargs
processed_kwargs["template_kwargs"].update(kwargs)
if isinstance(conversation, (list, tuple)) and (
isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "content")
):
is_batched = True
conversations = conversation
else:
is_batched = False
conversations = [conversation]
# Check for any overlapping keys between mm_load_kwargs and kwargs
mm_load_kwargs = processed_kwargs["mm_load_kwargs"]
if any(key in kwargs for key in mm_load_kwargs):
overlapping_keys = [key for key in mm_load_kwargs if key in kwargs]
logger.warning(
f"{overlapping_keys[0] if len(overlapping_keys) == 1 else ', '.join(overlapping_keys)} load multimodal data kwarg{'s' if len(overlapping_keys) > 1 else ''} {'have' if len(overlapping_keys) > 1 else 'has'} been passed to the processor, but {'they are' if len(overlapping_keys) > 1 else 'it is'} not supported for VoxtralProcessor since it relies on mistral_common directly. {'They' if len(overlapping_keys) > 1 else 'It'} will be ignored."
)
output_kwargs = self._merge_kwargs(
VoxtralProcessorKwargs,
**kwargs,
)
text_kwargs = output_kwargs["text_kwargs"]
audio_kwargs = output_kwargs["audio_kwargs"]
common_kwargs = output_kwargs["common_kwargs"]
return_tensors = common_kwargs.pop("return_tensors", None)
if return_tensors != "pt":
raise ValueError(f"{self.__class__.__name__} only supports `return_tensors='pt'`.")
tokenizer_kwargs = {**processed_kwargs["template_kwargs"], **text_kwargs}
tokenizer_kwargs["return_tensors"] = None # let's not return tensors here
tokenize = tokenizer_kwargs.pop("tokenize", False)
return_dict = tokenizer_kwargs.pop("return_dict", False)
encoded_instruct_inputs = self.tokenizer.apply_chat_template(
conversations,
tokenize=tokenize,
return_dict=return_dict,
**tokenizer_kwargs,
)
if tokenize:
if return_dict:
audio = encoded_instruct_inputs.pop("audio", None)
data = dict(encoded_instruct_inputs)
if audio is not None:
max_source_positions = audio_kwargs.pop("max_source_positions")
data["input_features"] = self._retreive_input_features(audio, max_source_positions, **audio_kwargs)
return BatchFeature(data=data, tensor_type=return_tensors)
if not is_batched:
return encoded_instruct_inputs[0]
return encoded_instruct_inputs
def __call__(
self,
text: Optional[Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]]],
**kwargs: Unpack[VoxtralProcessorKwargs],
):
r"""
Method to prepare text to be fed as input to the model. This method forwards the `text`
arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.__call__`] to encode
the text. Please refer to the docstring of the above methods for more information.
This methods does not support audio. To prepare the audio, please use:
1. `apply_chat_template` [`~VoxtralProcessor.apply_chat_template`] method.
2. `apply_transcrition_request` [`~VoxtralProcessor.apply_transcrition_request`] method.
Args:
text (`str`, `list[str]`, `list[list[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **input_features** -- List of audio values to be fed to a model. Returned when `audio` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
"""
if isinstance(text, str):
text = [text]
if any(self.audio_token in t for t in text):
raise ValueError(
f"{self.audio_token} is present in the provided text which is not supported by VoxtralProcessor. Please use the `apply_chat_template` method instead."
)
output_kwargs = self._merge_kwargs(
VoxtralProcessorKwargs,
**kwargs,
)
text_kwargs = output_kwargs["text_kwargs"]
common_kwargs = output_kwargs["common_kwargs"]
out = self.tokenizer(text, **text_kwargs)
return BatchFeature(data=out, tensor_type=common_kwargs.pop("return_tensors", None))
# TODO: @eustlb, this should be moved to mistral_common + testing
def apply_transcrition_request(
self,
language: Union[str, list[str]],
audio: Union[str, list[str], AudioInput],
model_id: str,
sampling_rate: Optional[int] = None,
format: Optional[Union[str, list[str]]] = None,
**kwargs: Unpack[VoxtralProcessorKwargs],
):
"""
This method applies the model's transcription request template given a language and audio.
It relies on MistralCommonTokenizer and WhisperFeatureExtractor to prepare input ids and input features to the model.
```python
from transformers import VoxtralProcessor
model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = VoxtralProcessor.from_pretrained(model_id)
language = "en"
audio = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
inputs = processor.apply_transcrition_request(language=language, audio=audio, model_id=model_id)
```
Args:
language (`str`, `list[str]`):
The language or languages of the audio. If provided as a string, will be applied uniformly to all audio.
If provided as a list, will be applied to each audio individually with a one-to-one mapping.
audio (`str`, `list[str]`, `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
The audio or batch of audio to be prepared. If provided as a string, it should correspond to the path or url of the audio file.
model_id (`str`:
The hub model id of the model to use for transcription.
sampling_rate (`int`, *optional*):
The sampling rate of the audio. Necessary if it is provided as `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`.
Used to avoid silent errors when passing audio that is not in the expected sampling rate.
format (`str`, `list[str]`, *optional*):
The format of the audio, necessary if is provided as `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`.
"""
output_kwargs = self._merge_kwargs(
VoxtralProcessorKwargs,
**kwargs,
)
text_kwargs = output_kwargs["text_kwargs"]
audio_kwargs = output_kwargs["audio_kwargs"]
common_kwargs = output_kwargs["common_kwargs"]
is_str = isinstance(audio, str)
is_list_of_str = all(isinstance(el, str) for el in audio)
is_list_of_audio = not (is_str or is_list_of_str)
if is_list_of_audio:
if sampling_rate is None:
logger.warning_once(
f"You've provided audio without specifying the sampling rate. It will be assumed to be {audio_kwargs['sampling_rate']}, which can result in silent errors."
)
elif sampling_rate != audio_kwargs["sampling_rate"]:
raise ValueError(
f"The sampling rate of the audio ({sampling_rate}) does not match the sampling rate of the processor ({audio_kwargs['sampling_rate']}). Please provide resampled the audio to the expected sampling rate."
)
sampling_rate = audio_kwargs["sampling_rate"]
return_dict = common_kwargs.pop("return_dict", False)
tokenize = common_kwargs.pop("tokenize", False)
# make sure to remove from text_kwargs and audio_kwargs
for k in ("return_dict", "tokenize"):
text_kwargs.pop(k, None)
audio_kwargs.pop(k, None)
return_tensors = common_kwargs.pop("return_tensors", None)
if return_tensors != "pt":
raise ValueError(f"{self.__class__.__name__} only supports `return_tensors='pt'`.")
# validate audio input
if is_str:
audio = [load_audio_as(audio, return_format="buffer", force_mono=True, sampling_rate=sampling_rate)]
elif is_list_of_str:
audio = [
load_audio_as(el, return_format="buffer", force_mono=True, sampling_rate=sampling_rate) for el in audio
]
else:
audio = make_list_of_audio(audio)
if len(audio) != len(format):
raise ValueError(
f"When passed as a list of audio, the length ({len(audio)}) must match the number of format ({len(format)})"
)
audio_buffers = []
for array, f in zip(audio, format):
# Create new BytesIO object and write audio data to it
buffer = io.BytesIO()
# Convert to mono if needed
if array.ndim == 2:
array = array.mean(axis=1)
# Write to buffer with default format and sampling rate
sf.write(buffer, array, samplerate=audio_kwargs["sampling_rate"], format=f)
buffer.seek(0)
audio_buffers.append(buffer)
audio = audio_buffers
# validate language input
n_audio = len(audio)
if isinstance(language, str):
language = [language] * n_audio
if len(language) != n_audio:
raise ValueError(
f"When passed as a list of languages, the length ({len(language)}) must match the number of audio ({n_audio})"
)
input_ids = []
texts = []
audio_arrays = []
for audio_el, language_el in zip(audio, language):
openai_transcription_request = {
"model": model_id,
"file": audio_el,
"language": language_el,
}
transcription_request = TranscriptionRequest.from_openai(openai_transcription_request)
tokenized_transcription_request = self.tokenizer.tokenizer.encode_transcription(transcription_request)
input_ids.append(tokenized_transcription_request.tokens)
texts.append(tokenized_transcription_request.text)
audio_arrays.extend([el.audio_array for el in tokenized_transcription_request.audios])
if tokenize:
if return_dict:
# text are already tokenized but we need to pad etc
encoding = self.tokenizer(
input_ids,
add_special_tokens=False,
**text_kwargs,
)
data = dict(encoding)
# extract the input features
max_source_positions = audio_kwargs.pop("max_source_positions")
data["input_features"] = self._retreive_input_features(
audio_arrays, max_source_positions, **audio_kwargs
)
return BatchFeature(data=data, tensor_type=return_tensors)
return texts
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
__all__ = ["VoxtralProcessor"]

View File

@@ -22,6 +22,7 @@ from typing import Any, Callable, Optional, Union, overload
import numpy as np
from transformers.audio_utils import load_audio_as
from transformers.tokenization_utils_base import (
LARGE_INTEGER,
VERY_LARGE_INTEGER,
@@ -41,11 +42,13 @@ from transformers.utils.import_utils import is_mistral_common_available, is_torc
if is_mistral_common_available():
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.validator import ValidationMode
from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy
from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy, TokenizerVersion
from mistral_common.tokens.tokenizers.image import MultiModalVersion
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.tekken import Tekkenizer
from mistral_common.tokens.tokenizers.utils import download_tokenizer_from_hf_hub
if is_torch_available():
import torch
@@ -1473,12 +1476,24 @@ class MistralCommonTokenizer(PushToHubMixin):
else:
raise ValueError("Image content must be specified.")
normalized_content.append({"type": "image_url", "image_url": {"url": image_content}})
elif content_type == "audio":
maybe_url: Optional[str] = content.get("url")
maybe_path: Optional[str] = content.get("path")
maybe_base64: Optional[str] = content.get("base64")
if maybe_url or maybe_path:
audio_data = load_audio_as(maybe_url or maybe_path, return_format="dict", force_mono=True)
normalized_content.append({"type": "input_audio", "input_audio": audio_data})
continue
if not maybe_base64:
raise ValueError("Audio content must be specified.")
normalized_content.append({"type": "audio_url", "audio_url": {"url": maybe_base64}})
else:
normalized_content.append(content)
message["content"] = normalized_content
outputs = []
images: list[np.ndarray] = []
audios: list[np.ndarray] = []
for conversation in conversations:
messages: list[dict[str, Union[str, list[dict[str, Union[str, dict[str, Any]]]]]]] = []
@@ -1498,6 +1513,7 @@ class MistralCommonTokenizer(PushToHubMixin):
else:
outputs.append(tokenized_request.text)
images.extend(tokenized_request.images)
audios.extend([el.audio_array for el in tokenized_request.audios])
if not is_batched:
outputs = outputs[0]
@@ -1528,6 +1544,13 @@ class MistralCommonTokenizer(PushToHubMixin):
else:
raise ValueError(f"Unsupported return_tensors type: {return_tensors}")
out.data["pixel_values"] = pixel_values
if audios:
if return_tensors is not None:
raise NotImplementedError(
"When passing audio content in apply_chat_template, `return_tensors` must be None since we cannot batch the audio inputs. The returned audio will be a list of numpy arrays."
)
# Transformers convention is audio for plural audio (audio does not take a "s")
out.data["audio"] = audios
return out
else:
return out["input_ids"]
@@ -1735,12 +1758,12 @@ class MistralCommonTokenizer(PushToHubMixin):
raise ValueError("`init_inputs` are not supported by `MistralCommonTokenizer.from_pretrained`.")
# Handle kwargs and AutoTokenizer case
if kwargs and not kwargs.keys() == {"_from_auto"}:
if kwargs and not set(kwargs.keys()).issubset({"_from_auto", "trust_remote_code"}):
raise ValueError(
f"Kwargs {list(kwargs.keys())} are not supported by `MistralCommonTokenizer.from_pretrained`."
)
if not os.path.isfile(pretrained_model_name_or_path):
if not os.path.isdir(pretrained_model_name_or_path):
tokenizer_path = download_tokenizer_from_hf_hub(
repo_id=pretrained_model_name_or_path,
cache_dir=cache_dir,
@@ -1750,7 +1773,37 @@ class MistralCommonTokenizer(PushToHubMixin):
local_files_only=local_files_only,
)
else:
tokenizer_path = pretrained_model_name_or_path
valid_tokenizer_files = []
tokenizer_file: str
instruct_versions = list(TokenizerVersion.__members__)
mm_versions = list(MultiModalVersion.__members__) + [""] # allow no mm version
sentencepiece_suffixes = [f".model.{v}{m}" for v in instruct_versions for m in mm_versions] + [".model"]
for path in os.listdir(pretrained_model_name_or_path):
pathlib_repo_file = Path(path)
file_name = pathlib_repo_file.name
suffix = "".join(pathlib_repo_file.suffixes)
if file_name == "tekken.json":
valid_tokenizer_files.append(file_name)
elif suffix in sentencepiece_suffixes:
valid_tokenizer_files.append(file_name)
if len(valid_tokenizer_files) == 0:
raise ValueError(f"No tokenizer file found in directory: {pretrained_model_name_or_path}")
# If there are multiple tokenizer files, we use tekken.json if it exists, otherwise the versioned one.
if len(valid_tokenizer_files) > 1:
if "tekken.json" in valid_tokenizer_files:
tokenizer_file = "tekken.json"
else:
tokenizer_file = sorted(valid_tokenizer_files)[-1]
logger.warning(
f"Multiple tokenizer files found in directory: {pretrained_model_name_or_path}. Using {tokenizer_file}."
)
else:
tokenizer_file = valid_tokenizer_files[0]
tokenizer_path = os.path.join(pretrained_model_name_or_path, tokenizer_file)
return cls(
tokenizer_path=tokenizer_path,
@@ -1802,6 +1855,8 @@ class MistralCommonTokenizer(PushToHubMixin):
Returns:
A tuple of `str`: The files saved.
"""
# `save_jinja_files`` must be skipped to be able to save from a processor
kwargs.pop("save_jinja_files", None)
if kwargs:
raise ValueError(
f"Kwargs {list(kwargs.keys())} are not supported by `MistralCommonTokenizer.save_pretrained`."

View File

View File

@@ -0,0 +1,472 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch Voxtral model."""
import tempfile
import unittest
from transformers import (
AutoProcessor,
VoxtralConfig,
VoxtralForConditionalGeneration,
is_torch_available,
)
from transformers.testing_utils import (
cleanup,
require_torch,
require_torch_sdpa,
slow,
torch_device,
)
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
if is_torch_available():
import torch
class VoxtralModelTester:
def __init__(
self,
parent,
ignore_index=-100,
audio_token_id=0,
seq_length=35,
feat_seq_length=60,
text_config={
"model_type": "llama",
"intermediate_size": 36,
"initializer_range": 0.02,
"hidden_size": 32,
"max_position_embeddings": 52,
"num_hidden_layers": 2,
"num_attention_heads": 4,
"num_key_value_heads": 2,
"use_labels": True,
"use_mrope": False,
"vocab_size": 99,
"head_dim": 8,
},
is_training=True,
audio_config={
"model_type": "voxtral_encoder",
"hidden_size": 16,
"num_attention_heads": 4,
"intermediate_size": 16,
"num_hidden_layers": 2,
"num_mel_bins": 80,
"max_source_positions": 30,
"initializer_range": 0.02,
},
):
self.parent = parent
self.ignore_index = ignore_index
self.audio_token_id = audio_token_id
self.text_config = text_config
self.audio_config = audio_config
self.seq_length = seq_length
self.feat_seq_length = feat_seq_length
self.num_hidden_layers = text_config["num_hidden_layers"]
self.vocab_size = text_config["vocab_size"]
self.hidden_size = text_config["hidden_size"]
self.num_attention_heads = text_config["num_attention_heads"]
self.is_training = is_training
self.batch_size = 3
self.encoder_seq_length = seq_length
def get_config(self):
return VoxtralConfig(
text_config=self.text_config,
audio_config=self.audio_config,
ignore_index=self.ignore_index,
audio_token_id=self.audio_token_id,
)
def prepare_config_and_inputs(self):
input_features_values = floats_tensor(
[
self.batch_size,
self.audio_config["num_mel_bins"],
self.feat_seq_length,
]
)
config = self.get_config()
return config, input_features_values
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_features_values = config_and_inputs
num_audio_tokens_per_batch_idx = 30
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 1) + 1
attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(torch_device)
attention_mask[:, :1] = 0
input_ids[:, 1 : 1 + num_audio_tokens_per_batch_idx] = config.audio_token_id
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"input_features": input_features_values,
}
return config, inputs_dict
@require_torch
class VoxtralForConditionalGenerationModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
"""
Model tester for `VoxtralForConditionalGeneration`.
"""
all_model_classes = (VoxtralForConditionalGeneration,) if is_torch_available() else ()
pipeline_model_mapping = (
{"text-to-speech": VoxtralForConditionalGeneration, "audio-text-to-text": VoxtralForConditionalGeneration}
if is_torch_available()
else {}
)
test_pruning = False
test_head_masking = False
_is_composite = True
def setUp(self):
self.model_tester = VoxtralModelTester(self)
self.config_tester = ConfigTester(self, config_class=VoxtralConfig, has_text_modality=False)
@unittest.skip(
reason="This test does not apply to Voxtral since inputs_embeds corresponding to audio tokens are replaced when input features are provided."
)
def test_inputs_embeds_matches_input_ids(self):
pass
@require_torch_sdpa
def test_sdpa_can_dispatch_composite_models(self):
# overwrite because Voxtral is audio+text model (not vision+text)
if not self.has_attentions:
self.skipTest(reason="Model architecture does not support attentions")
if not self._is_composite:
self.skipTest(f"{self.all_model_classes[0].__name__} does not support SDPA")
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model_sdpa = model_class.from_pretrained(tmpdirname)
model_sdpa = model_sdpa.eval().to(torch_device)
text_attn = "sdpa" if model.language_model._supports_sdpa else "eager"
vision_attn = "sdpa" if model.audio_tower._supports_sdpa else "eager"
# `None` as it is the requested one which will be assigned to each sub-config
# Sub-model will dispatch to SDPA if it can (checked below that `SDPA` layers are present)
self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
self.assertTrue(model.language_model.config._attn_implementation == text_attn)
self.assertTrue(model.audio_tower.config._attn_implementation == vision_attn)
model_eager = model_class.from_pretrained(tmpdirname, attn_implementation="eager")
model_eager = model_eager.eval().to(torch_device)
self.assertTrue(model_eager.config._attn_implementation == "eager")
self.assertTrue(model_eager.language_model.config._attn_implementation == "eager")
self.assertTrue(model_eager.audio_tower.config._attn_implementation == "eager")
for name, submodule in model_eager.named_modules():
class_name = submodule.__class__.__name__
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
raise ValueError("The eager model should not have SDPA attention layers")
@require_torch
class VoxtralForConditionalGenerationIntegrationTest(unittest.TestCase):
def setUp(self):
self.checkpoint_name = "mistralai/Voxtral-Mini-3B-2507"
self.dtype = torch.bfloat16
self.processor = AutoProcessor.from_pretrained(self.checkpoint_name)
def tearDown(self):
cleanup(torch_device, gc_collect=True)
@slow
def test_mini_single_turn_audio_only(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
],
}
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversation)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
'The audio is a humorous exchange between two individuals, likely friends or acquaintances, about tattoos. Here\'s a breakdown:\n\n1. **Initial Reaction**: One person (let\'s call him A) is surprised to see the other person (let\'s call him B) has a tattoo. A asks if B has a tattoo, and B confirms.\n\n2. **Tattoo Interpretation**: A then asks what B\'s tattoo says, and B responds with "sweet." This exchange is repeated multiple times, with A asking what B\'s tattoo says, and B always answering "sweet."\n\n3. **Confusion**: A seems confused and asks what B\'s tattoo says multiple times, each time getting the same response. This leads to a humorous back-and-forth.\n\n4. **Clarification**: Eventually, B clarifies that A\'s tattoo says "dude" and A\'s says "sweet." This is the punchline of the joke, as A had been asking about B\'s tattoo but not his own.\n\n5. **Final Exchange**: B then asks what A\'s tattoo says, and A responds with "sweet," leading to a final round of confusion.\n\nThe humor comes from the repetition of the word "sweet" and the confusion that arises from A\'s lack of self-awareness about his own tattoo.'
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_mini_single_turn_text_and_audio(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversation)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"What can you tell me about this audio?This audio is a farewell address by President Barack Obama, delivered in Chicago. In the speech, he reflects on his eight years in office, highlighting the resilience, hope, and unity of the American people. He expresses gratitude for the conversations he had with the public, which kept him honest and inspired. The president also emphasizes the importance of self-government and civic engagement, encouraging Americans to participate in their democracy actively. He concludes by expressing optimism about the country's future and his commitment to serving as a citizen."
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_mini_single_turn_text_and_multiple_audios(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
],
}
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversation)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
'What sport and what nursery rhyme are referenced?The audio references both a nursery rhyme and a baseball game. The nursery rhyme is "Mary Had a Little Lamb," and the baseball game is a playoff game between the Baltimore Orioles and the Oakland Athletics.'
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_mini_single_turn_text_only(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Hello, how are you doing today?"},
],
}
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversation)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"Hello, how are you doing today?Hello! I'm functioning as intended, thank you. How about you? How's your day going?"
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_mini_single_turn_text_and_multiple_audios_batched(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{
"type": "text",
"text": "Who's speaking in the speach and what city's weather is being discussed?",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
],
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversations)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"Who's speaking in the speach and what city's weather is being discussed?The speaker in the speech is Barack Obama, and the weather being discussed is in Barcelona.",
'What can you tell me about this audio?This audio is a commentary of a baseball game, specifically a home run hit by Edgar Martinez. Here are some key points:\n\n- **Game Context**: The game is likely a playoff or championship game, as the commentator mentions the American League Championship.\n- **Play Description**: Edgar Martinez hits a home run, which is described as a "line drive" and a "base hit."\n- **Team Involvement**: The team is the Mariners, and the commentator is excited about their chances to win the championship.\n- **Emotional Tone**: The commentator expresses disbelief and excitement, using phrases like "I don\'t believe it" and "my, oh my."\n- **Player Involvement**: The commentator mentions Joy and Junior, likely referring to other players or commentators in the broadcast.',
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_mini_multi_turn_text_and_audio(self):
"""
reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
]
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_chat_template(conversations)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
'Describe briefly what you can hear.The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.Ok, now compare this new audio with the previous one.The new audio is a humorous conversation between two friends, one of whom has a tattoo. The speaker is excited to see the tattoo and asks what it says. The other friend repeatedly says "sweet" in response, leading to a playful exchange. The speaker then realizes the joke and says "your tattoo says dude, your tattoo says sweet, got it?" The previous audio was a farewell address by a president, reflecting on his time in office and expressing gratitude to the American people. The new audio is a casual, light-hearted conversation in contrast to the serious and reflective tone of the previous audio.'
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
@slow
def test_transcribe_mode_audio_input(self):
"""
To test transcribe mode of the model, WER evaluation has been run to compare with the declared model performances.
see https://github.com/huggingface/transformers/pull/39429 PR's descrition.
disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
"""
model = VoxtralForConditionalGeneration.from_pretrained(
self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
)
inputs = self.processor.apply_transcrition_request(
language="en",
audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
model_id=self.checkpoint_name,
)
inputs = inputs.to(torch_device, dtype=self.dtype)
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
EXPECTED_OUTPUT = [
"lang:enThis week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers, and I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees, to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there's an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our"
]
self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)

File diff suppressed because one or more lines are too long