Add voxtral (#39429)

* draft * draft update (conversion working) * mend * draft update * draft update: working generate * refactor * VoxtralProcessor draft * processor update * update convert_tekken_tokenizer * refactor processor * update convert * make style * better handle prefil * make style * add tests * add mistral_common audio loading * processor update * revert changes * audio utils update * add audio to apply chat template mistral update * voxtral processor update * fix * udpate converstion script * make mistral tokenier from pretrain work from local dir * fix udpates * add integration tests * add batched version * processor docstring * make style * revert convert_tekken_tokenizer changes * revert processing_qwen2.5 changes * add multi-turn test * processor improvements * address review changes * Update src/transformers/tokenization_mistral_common.py Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * update audio utils * nits * integration test update * correct _support * update tests * test update * update integration tests * fix * fix * fix * add test_apply_chat_template_with_audio * add model doc * model doc * nit * doc uptade * nit * processor improvement * ensure default is 3B * nits * make * make * convert modular * update checkpoint * fix test * make * make * autos * make * make * nit * nit * nit --------- Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-07-18 02:02:04 +02:00
parent 73869f2e81
commit 967045082f
17 changed files with 2806 additions and 7 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -1095,6 +1095,8 @@
        title: Vision Text Dual Encoder
      - local: model_doc/visual_bert
        title: VisualBERT
+      - local: model_doc/voxtral
+        title: Voxtral
      - local: model_doc/xclip
        title: X-CLIP
      title: Multimodal models
--- a/docs/source/en/model_doc/voxtral.md
+++ b/docs/source/en/model_doc/voxtral.md
@@ -0,0 +1,300 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Voxtral
+
+Voxtral is an upgrade of [Ministral 3B and Mistral Small 3B](https://mistral.ai/news/ministraux), extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
+
+You can read more in Mistral's [realease blog post](https://mistral.ai/news/voxtral).
+
+The model is available in two checkpoints:
+- 3B: [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
+- 24B: [mistralai/Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507)
+
+## Key Features
+
+Voxtral builds on Ministral-3B by adding audio processing capabilities:
+
+- **Transcription mode**: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.  
+- **Long-form context**: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.  
+- **Integrated Q&A and summarization**: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.  
+- **Multilingual support**: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.  
+- **Function calling via voice**: Can trigger functions or workflows directly from spoken input based on detected user intent.  
+- **Text capabilities**: Maintains the strong text processing performance of its Ministral-3B foundation.
+
+## Usage
+
+Let's first load the model!
+```python
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+import torch
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+repo_id = "mistralai/Voxtral-Mini-3B-2507"
+
+processor = AutoProcessor.from_pretrained(repo_id)
+model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+```
+
+### Audio Instruct Mode
+
+The model supports audio-text instructions, including multi-turn and multi-audio interactions, all processed in batches.
+
+➡️ audio + text instruction
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav",
+            },
+            {"type": "text", "text": "What can you tell me about this audio?"},
+        ],
+    }
+]
+
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+
+➡️ multi-audio + text instruction 
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
+            },
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+            },
+            {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
+        ],
+    }
+]
+
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+
+➡️ multi-turn:
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+            },
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+            },
+            {"type": "text", "text": "Describe briefly what you can hear."},
+        ],
+    },
+    {
+        "role": "assistant",
+        "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
+    },
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+            },
+            {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
+        ],
+    },
+]
+
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+
+➡️ text only:
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "What if a cyber brain could possibly generate its own ghost, and create a soul all by itself?",
+            },
+        ],
+    }
+]
+
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+
+➡️ audio only:
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "audio",
+                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+            },
+        ],
+    }
+]
+
+inputs = processor.apply_chat_template(conversation)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated response:")
+print("=" * 80)
+print(decoded_outputs[0])
+print("=" * 80)
+```
+
+➡️ batched inference!
+```python
+conversations = [
+    [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+                },
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+                },
+                {
+                    "type": "text",
+                    "text": "Who's speaking in the speach and what city's weather is being discussed?",
+                },
+            ],
+        }
+    ],
+    [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "audio",
+                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+                },
+                {"type": "text", "text": "What can you tell me about this audio?"},
+            ],
+        }
+    ],
+]
+
+inputs = processor.apply_chat_template(conversations)
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated responses:")
+print("=" * 80)
+for decoded_output in decoded_outputs:
+    print(decoded_output)
+    print("=" * 80)
+```
+
+### Transcription Mode
+
+Use the model to transcribe audio (supports English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)!
+
+```python
+inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3")
+inputs = inputs.to(device, dtype=torch.bfloat16)
+
+outputs = model.generate(**inputs, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+
+print("\nGenerated responses:")
+print("=" * 80)
+for decoded_output in decoded_outputs:
+    print(decoded_output)
+    print("=" * 80)
+```
+
+This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
+
+## VoxtralConfig
+
+[[autodoc]] VoxtralConfig
+
+## VoxtralEncoderConfig
+
+[[autodoc]] VoxtralEncoderConfig
+
+## VoxtralProcessor
+
+[[autodoc]] VoxtralProcessor
+
+## VoxtralEncoder
+
+[[autodoc]] VoxtralEncoder
+    - forward
+
+## VoxtralForConditionalGeneration
+
+[[autodoc]] VoxtralForConditionalGeneration
+    - forward
--- a/src/transformers/audio_utils.py
+++ b/src/transformers/audio_utils.py
@@ -16,10 +16,12 @@ Audio processing functions to extract features from audio waveforms. This code i
 and remove unnecessary dependencies.
 """

+import base64
+import io
 import os
 import warnings
 from io import BytesIO
-from typing import Optional, Union
+from typing import Any, Optional, Union

 import numpy as np
 import requests
@@ -27,14 +29,21 @@ import requests
 from .utils import (
    is_librosa_available,
    is_numpy_array,
+    is_soundfile_available,
    is_torch_tensor,
    requires_backends,
 )


+if is_soundfile_available():
+    import soundfile as sf
+
 if is_librosa_available():
    import librosa

+    # TODO: @eustlb, we actually don't need librosa but soxr is installed with librosa
+    import soxr
+

 def load_audio(audio: Union[str, np.ndarray], sampling_rate=16000, timeout=None) -> np.ndarray:
    """
@@ -69,6 +78,85 @@ def load_audio(audio: Union[str, np.ndarray], sampling_rate=16000, timeout=None)
    return audio


+def load_audio_as(
+    audio: str,
+    return_format: str,
+    timeout: Optional[int] = None,
+    force_mono: bool = False,
+    sampling_rate: Optional[int] = None,
+) -> Union[str, dict[str, Any], io.BytesIO, None]:
+    """
+    Load audio from either a local file path or URL and return in specified format.
+
+    Args:
+        audio (`str`): Either a local file path or a URL to an audio file
+        return_format (`str`): Format to return the audio in:
+            - "base64": Base64 encoded string
+            - "dict": Dictionary with data and format
+            - "buffer": BytesIO object
+        timeout (`int`, *optional*): Timeout for URL requests in seconds
+        force_mono (`bool`): Whether to convert stereo audio to mono
+        sampling_rate (`int`, *optional*): If provided, the audio will be resampled to the specified sampling rate.
+
+    Returns:
+        `Union[str, Dict[str, Any], io.BytesIO, None]`:
+            - `str`: Base64 encoded audio data (if return_format="base64")
+            - `dict`: Dictionary with 'data' (base64 encoded audio data) and 'format' keys (if return_format="dict")
+            - `io.BytesIO`: BytesIO object containing audio data (if return_format="buffer")
+    """
+    # TODO: @eustlb, we actually don't need librosa but soxr is installed with librosa
+    requires_backends(load_audio_as, ["librosa"])
+
+    if return_format not in ["base64", "dict", "buffer"]:
+        raise ValueError(f"Invalid return_format: {return_format}. Must be 'base64', 'dict', or 'buffer'")
+
+    try:
+        # Load audio bytes from URL or file
+        audio_bytes = None
+        if audio.startswith(("http://", "https://")):
+            response = requests.get(audio, timeout=timeout)
+            response.raise_for_status()
+            audio_bytes = response.content
+        elif os.path.isfile(audio):
+            with open(audio, "rb") as audio_file:
+                audio_bytes = audio_file.read()
+        else:
+            raise ValueError(f"File not found: {audio}")
+
+        # Process audio data
+        with io.BytesIO(audio_bytes) as audio_file:
+            with sf.SoundFile(audio_file) as f:
+                audio_array = f.read(dtype="float32")
+                original_sr = f.samplerate
+                audio_format = f.format
+                if sampling_rate is not None and sampling_rate != original_sr:
+                    # Resample audio to target sampling rate
+                    audio_array = soxr.resample(audio_array, original_sr, sampling_rate, quality="HQ")
+                else:
+                    sampling_rate = original_sr
+
+        # Convert to mono if needed
+        if force_mono and audio_array.ndim != 1:
+            audio_array = audio_array.mean(axis=1)
+
+        buffer = io.BytesIO()
+        sf.write(buffer, audio_array, sampling_rate, format=audio_format.upper())
+        buffer.seek(0)
+
+        if return_format == "buffer":
+            return buffer
+        elif return_format == "base64":
+            return base64.b64encode(buffer.read()).decode("utf-8")
+        elif return_format == "dict":
+            return {
+                "data": base64.b64encode(buffer.read()).decode("utf-8"),
+                "format": audio_format.lower(),
+            }
+
+    except Exception as e:
+        raise ValueError(f"Error loading audio: {e}")
+
+
 AudioInput = Union[
    np.ndarray, "torch.Tensor", list[np.ndarray], tuple[np.ndarray], list["torch.Tensor"], tuple["torch.Tensor"]  # noqa: F821
 ]
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -335,6 +335,7 @@ if TYPE_CHECKING:
    from .vits import *
    from .vivit import *
    from .vjepa2 import *
+    from .voxtral import *
    from .wav2vec2 import *
    from .wav2vec2_bert import *
    from .wav2vec2_conformer import *
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -389,6 +389,8 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
        ("vits", "VitsConfig"),
        ("vivit", "VivitConfig"),
        ("vjepa2", "VJEPA2Config"),
+        ("voxtral", "VoxtralConfig"),
+        ("voxtral_encoder", "VoxtralEncoderConfig"),
        ("wav2vec2", "Wav2Vec2Config"),
        ("wav2vec2-bert", "Wav2Vec2BertConfig"),
        ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
@@ -798,6 +800,8 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
        ("vits", "VITS"),
        ("vivit", "ViViT"),
        ("vjepa2", "VJEPA2Model"),
+        ("voxtral", "Voxtral"),
+        ("voxtral_encoder", "Voxtral Encoder"),
        ("wav2vec2", "Wav2Vec2"),
        ("wav2vec2-bert", "Wav2Vec2-BERT"),
        ("wav2vec2-conformer", "Wav2Vec2-Conformer"),
@@ -864,6 +868,7 @@ SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict[str, str](
        ("xclip", "x_clip"),
        ("clip_vision_model", "clip"),
        ("qwen2_audio_encoder", "qwen2_audio"),
+        ("voxtral_encoder", "voxtral"),
        ("clip_text_model", "clip"),
        ("aria_text", "aria"),
        ("gemma3_text", "gemma3"),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -359,6 +359,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("vits", "VitsModel"),
        ("vivit", "VivitModel"),
        ("vjepa2", "VJEPA2Model"),
+        ("voxtral", "VoxtralForConditionalGeneration"),
+        ("voxtral_encoder", "VoxtralEncoder"),
        ("wav2vec2", "Wav2Vec2Model"),
        ("wav2vec2-bert", "Wav2Vec2BertModel"),
        ("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
@@ -458,6 +460,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
        ("vipllava", "VipLlavaForConditionalGeneration"),
        ("visual_bert", "VisualBertForPreTraining"),
        ("vit_mae", "ViTMAEForPreTraining"),
+        ("voxtral", "VoxtralForConditionalGeneration"),
        ("wav2vec2", "Wav2Vec2ForPreTraining"),
        ("wav2vec2-conformer", "Wav2Vec2ConformerForPreTraining"),
        ("xlm", "XLMWithLMHeadModel"),
@@ -1078,6 +1081,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
        ("t5", "T5ForConditionalGeneration"),
        ("t5gemma", "T5GemmaForConditionalGeneration"),
        ("umt5", "UMT5ForConditionalGeneration"),
+        ("voxtral", "VoxtralForConditionalGeneration"),
        ("xlm-prophetnet", "XLMProphetNetForConditionalGeneration"),
    ]
 )
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -132,6 +132,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("vilt", "ViltProcessor"),
        ("vipllava", "LlavaProcessor"),
        ("vision-text-dual-encoder", "VisionTextDualEncoderProcessor"),
+        ("voxtral", "VoxtralProcessor"),
        ("wav2vec2", "Wav2Vec2Processor"),
        ("wav2vec2-bert", "Wav2Vec2Processor"),
        ("wav2vec2-conformer", "Wav2Vec2Processor"),
--- a/src/transformers/models/voxtral/init.py
+++ b/src/transformers/models/voxtral/init.py
@@ -0,0 +1,29 @@
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import _LazyModule
+from ...utils.import_utils import define_import_structure
+
+
+if TYPE_CHECKING:
+    from .configuration_voxtral import *
+    from .modeling_voxtral import *
+    from .processing_voxtral import *
+else:
+    import sys
+
+    _file = globals()["__file__"]
+    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
--- a/src/transformers/models/voxtral/configuration_voxtral.py
+++ b/src/transformers/models/voxtral/configuration_voxtral.py
@@ -0,0 +1,203 @@
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ...configuration_utils import PretrainedConfig
+from ..auto import CONFIG_MAPPING, AutoConfig
+
+
+class VoxtralEncoderConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`VoxtralEncoder`]. It is used to instantiate a
+    Voxtral audio encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the audio encoder of the Voxtral
+    architecture.
+
+    e.g. [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 51866):
+            Vocabulary size of the model.
+        hidden_size (`int`, *optional*, defaults to 1280):
+            Dimensionality of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 5120):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 20):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        scale_embedding (`bool`, *optional*, defaults to `False`):
+            Scale embeddings by dividing by sqrt(hidden_size) if True.
+        activation_function (`str`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu",
+        num_mel_bins (`int`, *optional*, defaults to 128):
+            Number of mel features used per input features. Should correspond to the value used in the
+            `VoxtralProcessor` class.
+        max_source_positions (`int`, *optional*, defaults to 1500):
+            The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+
+    ```python
+    >>> from transformers import VoxtralEncoderConfig, VoxtralEncoder
+
+    >>> # Initializing a VoxtralEncoderConfig
+    >>> configuration = VoxtralEncoderConfig()
+
+    >>> # Initializing a VoxtralEncoder (with random weights)
+    >>> model = VoxtralEncoder(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "voxtral_encoder"
+
+    attribute_map = {
+        "d_model": "hidden_size",
+        "encoder_layers": "num_hidden_layers",
+        "encoder_attention_heads": "num_attention_heads",
+        "encoder_ffn_dim": "intermediate_size",
+        "encoder_layerdrop": "layerdrop",
+    }
+
+    def __init__(
+        self,
+        vocab_size=51866,
+        hidden_size=1280,
+        intermediate_size=5120,
+        num_hidden_layers=32,
+        num_attention_heads=20,
+        scale_embedding=False,
+        activation_function="gelu",
+        num_mel_bins=128,
+        max_source_positions=1500,
+        initializer_range=0.02,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+
+        self.num_attention_heads = num_attention_heads
+        self.scale_embedding = scale_embedding  # scale factor will be sqrt(hidden_size) if True
+        self.activation_function = activation_function
+        self.num_mel_bins = num_mel_bins
+        self.max_source_positions = max_source_positions
+        self.initializer_range = initializer_range
+
+        # TODO: @eustlb, we do not use dropout and layerdrop, yet we need to hardcode them
+        # to be able to use Whisper with modular (here actually from Qwen2-Audio and copied from).
+        # After a future Whisper refactor, we should remove this.
+        self.dropout = 0.0
+        self.layerdrop = 0.0
+        self.activation_dropout = 0.0
+
+        self.attention_dropout = attention_dropout
+
+
+class VoxtralConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`VoxtralForConditionalGeneration`]. It is used to instantiate an
+    Voxtral model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the Voxtral-Mini-3B.
+
+    e.g. [mistralai/Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        audio_config (`Union[AutoConfig, dict]`, *optional*):
+            The config object or dictionary of the audio encoder.
+        text_config (`Union[AutoConfig, dict]`, *optional*):
+            The config object or dictionary of the text model.
+        audio_token_id (`int`, *optional*):
+            The image token index to encode the image prompt.
+        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The activation function (function or string) in the multi-modal projector.
+
+    ```python
+    >>> from transformers import VoxtralForConditionalGeneration, VoxtralConfig
+
+    >>> # Initializing a Voxtral configuration
+    >>> configuration = VoxtralConfig(audio_token_id=24, projector_hidden_act="gelu")
+
+    >>> # Initializing a 3B model with random weights
+    >>> model = VoxtralForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "voxtral"
+    sub_configs = {"text_config": AutoConfig, "audio_config": AutoConfig}
+
+    _default_text_config_kwargs = {
+        "vocab_size": 131072,
+        "hidden_size": 3072,
+        "intermediate_size": 8192,
+        "num_hidden_layers": 30,
+        "num_key_value_heads": 8,
+        "max_position_embeddings": 131072,
+        "rms_norm_eps": 1e-05,
+        "use_cache": True,
+        "rope_theta": 100000000.0,
+        "head_dim": 128,
+    }
+
+    def __init__(
+        self,
+        audio_config=None,
+        text_config=None,
+        audio_token_id=None,
+        projector_hidden_act="gelu",
+        **kwargs,
+    ):
+        if isinstance(audio_config, dict):
+            audio_config["model_type"] = (
+                audio_config["model_type"] if "model_type" in audio_config else "voxtral_encoder"
+            )
+            audio_config = CONFIG_MAPPING[audio_config["model_type"]](**audio_config)
+        elif audio_config is None:
+            audio_config = CONFIG_MAPPING["voxtral_encoder"]()
+        self.audio_config = audio_config
+
+        if isinstance(text_config, dict):
+            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
+            text_config = CONFIG_MAPPING[text_config["model_type"]](
+                **{**self._default_text_config_kwargs, **text_config}
+            )
+        elif text_config is None:
+            text_config = CONFIG_MAPPING["llama"](**self._default_text_config_kwargs)
+        self.text_config = text_config
+
+        self.vocab_size = text_config.vocab_size
+        self.hidden_size = text_config.hidden_size
+        self.audio_token_id = audio_token_id
+        self.projector_hidden_act = projector_hidden_act
+
+        super().__init__(**kwargs)
+
+
+__all__ = ["VoxtralEncoderConfig", "VoxtralConfig"]
--- a/src/transformers/models/voxtral/convert_voxtral_weights_to_hf.py
+++ b/src/transformers/models/voxtral/convert_voxtral_weights_to_hf.py
@@ -0,0 +1,302 @@
+# coding=utf-8
+# Copyright 2025 HuggingFace Inc. team. All rights reserved.
+#
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import gc
+import json
+import os
+import re
+
+import torch
+from safetensors.torch import load_file
+
+from transformers import (
+    MistralCommonTokenizer,
+    VoxtralConfig,
+    VoxtralForConditionalGeneration,
+    VoxtralProcessor,
+    WhisperFeatureExtractor,
+)
+from transformers.models.whisper.modeling_whisper import sinusoids
+from transformers.utils.hub import cached_file
+
+
+# fmt: off
+STATE_DICT_MAPPING = {
+    # Text model keys
+    r"^output.weight":                                                                  r"language_model.lm_head.weight",
+    r"^norm.weight":                                                                    r"language_model.model.norm.weight",
+    r"^tok_embeddings.weight":                                                          r"language_model.model.embed_tokens.weight",
+    r"^layers.(\d+).attention_norm.weight":                                             r"language_model.model.layers.\1.input_layernorm.weight",
+    r"^layers.(\d+).ffn_norm.weight":                                                   r"language_model.model.layers.\1.post_attention_layernorm.weight",
+    r"^layers.(\d+).attention.w(q|k|v|o).weight":                                       r"language_model.model.layers.\1.self_attn.\2_proj.weight",
+    r"^layers.(\d+).feed_forward.w1.weight":                                            r"language_model.model.layers.\1.mlp.gate_proj.weight",
+    r"^layers.(\d+).feed_forward.w2.weight":                                            r"language_model.model.layers.\1.mlp.down_proj.weight",
+    r"^layers.(\d+).feed_forward.w3.weight":                                            r"language_model.model.layers.\1.mlp.up_proj.weight",
+
+    r"mm_whisper_embeddings.tok_embeddings.weight":                                     r"language_model.model.embed_tokens.weight",
+
+    # audio model keys
+    r"mm_whisper_embeddings.whisper_encoder\.conv_layers\.0\.(weight|bias)": r"audio_tower.conv1.\1",
+    r"mm_whisper_embeddings.whisper_encoder\.conv_layers\.1\.(weight|bias)": r"audio_tower.conv2.\1",
+
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.norm\.(weight|bias)": r"audio_tower.layer_norm.\1",
+
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention\.w([qkv])\.(weight|bias)": r"audio_tower.layers.\1.self_attn.\2_proj.\3",
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention\.wo\.(weight|bias)": r"audio_tower.layers.\1.self_attn.out_proj.\2",
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.attention_norm\.(weight|bias)": r"audio_tower.layers.\1.self_attn_layer_norm.\2",
+
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w1\.(weight|bias)": r"audio_tower.layers.\1.fc1.\2",
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.feed_forward\.w2\.(weight|bias)": r"audio_tower.layers.\1.fc2.\2",
+
+    r"mm_whisper_embeddings.whisper_encoder\.transformer\.layers\.(\d+)\.ffn_norm\.(weight|bias)": r"audio_tower.layers.\1.final_layer_norm.\2",
+
+    r"mm_whisper_embeddings.audio_language_projection\.0\.weight":               r"multi_modal_projector.linear_1.weight",
+    r"mm_whisper_embeddings.audio_language_projection\.2\.weight":               r"multi_modal_projector.linear_2.weight",
+}
+# fmt: on
+
+
+def convert_config(original_config: dict, max_position_embeddings: int = 131072):
+    original_audio_config = original_config.pop("multimodal")
+    original_audio_config = original_audio_config["whisper_model_args"]["encoder_args"]
+    original_text_config = original_config
+
+    # Text config
+    text_key_mapping = {
+        "hidden_size": "dim",
+        "num_hidden_layers": "n_layers",
+        "intermediate_size": "hidden_dim",
+        "num_attention_heads": "n_heads",
+        "num_key_value_heads": "n_kv_heads",
+        "rms_norm_eps": "norm_eps",
+    }
+    similar_text_keys_to_keep = [
+        "head_dim",
+        "vocab_size",
+        "rope_theta",
+    ]
+    new_text_config_kwargs = {k: original_text_config[v] for k, v in text_key_mapping.items()}
+    new_text_config_kwargs.update({k: v for k, v in original_text_config.items() if k in similar_text_keys_to_keep})
+    # These are not always defined depending on `params.json`
+    new_text_config_kwargs["sliding_window"] = original_text_config.get("sliding_window", None)
+    new_text_config_kwargs["max_position_embeddings"] = original_text_config.get(
+        "max_seq_len", max_position_embeddings
+    )
+    # This may sometimes be a string in `params.json`
+    if new_text_config_kwargs["sliding_window"] is not None:
+        new_text_config_kwargs["sliding_window"] = int(new_text_config_kwargs["sliding_window"])
+
+    # Audio config
+    audio_key_mapping = {
+        "hidden_size": "dim",
+        "num_hidden_layers": "n_layers",
+        "intermediate_size": "hidden_dim",
+        "num_attention_heads": "n_heads",
+        "num_key_value_heads": "n_heads",
+    }
+    similar_audio_keys_to_keep = [
+        "head_dim",
+        "vocab_size",
+    ]
+    new_audio_config_kwargs = {k: original_audio_config[v] for k, v in audio_key_mapping.items()}
+    new_audio_config_kwargs.update({k: v for k, v in original_audio_config.items() if k in similar_audio_keys_to_keep})
+
+    new_config = VoxtralConfig(
+        audio_config=new_audio_config_kwargs,
+        text_config=new_text_config_kwargs,
+        audio_token_id=24,
+        projector_hidden_act="gelu",
+    )
+
+    return new_config
+
+
+def map_old_key_to_new(old_key):
+    """Map of a key of the original state dict to the equivalent key in HF format"""
+    for pattern, replacement in STATE_DICT_MAPPING.items():
+        new_key, n_replace = re.subn(pattern, replacement, old_key)
+        # Early exit of the loop
+        if n_replace > 0:
+            return new_key
+
+    raise ValueError(f"Key: {old_key} could not be mapped (check the mapping).")
+
+
+def permute_for_rope(tensor, n_heads, dim1, dim2):
+    """Permute the weights for the ROPE formulation."""
+    tensor = tensor.view(n_heads, dim1 // n_heads // 2, 2, dim2)
+    tensor = tensor.transpose(1, 2)
+    tensor = tensor.reshape(dim1, dim2)
+    return tensor
+
+
+def convert_state_dict(original_state_dict, config):
+    """Convert a state dict file, when a single `nn.Module` is never sharded in different files (usual case)."""
+    new_dict = {}
+
+    num_attention_heads = config.num_attention_heads
+    hidden_size = config.hidden_size
+    head_dim = config.head_dim
+    num_key_value_heads = config.num_key_value_heads
+    key_value_dim = head_dim * num_key_value_heads
+    query_dim = head_dim * num_attention_heads
+
+    for old_key, tensor in original_state_dict.items():
+        new_key = map_old_key_to_new(old_key)
+
+        if "audio_tower" not in new_key:
+            if "q_proj" in new_key:
+                tensor = tensor.view(num_attention_heads, head_dim, hidden_size).reshape(query_dim, hidden_size)
+                tensor = permute_for_rope(tensor, num_attention_heads, query_dim, hidden_size)
+            elif "k_proj" in new_key:
+                tensor = tensor.view(num_key_value_heads, head_dim, hidden_size).reshape(key_value_dim, hidden_size)
+                tensor = permute_for_rope(tensor, num_key_value_heads, key_value_dim, hidden_size)
+            elif "v_proj" in new_key:
+                tensor = tensor.view(num_key_value_heads, head_dim, hidden_size).reshape(key_value_dim, hidden_size)
+
+        new_dict[new_key] = tensor
+    return new_dict
+
+
+def write_model(
+    input_path_or_repo,
+    model_name,
+    config_name,
+    output_dir,
+    safe_serialization=True,
+):
+    print("Converting the model.")
+    os.makedirs(output_dir, exist_ok=True)
+
+    # --------------
+    # convert config
+    # --------------
+
+    config_path = cached_file(input_path_or_repo, config_name)
+    with open(config_path, "r") as f:
+        original_config = json.load(f)
+
+    config = convert_config(original_config)
+    model = VoxtralForConditionalGeneration(config)
+
+    # ---------------
+    # convert weights
+    # ---------------
+
+    model_path = cached_file(input_path_or_repo, model_name)
+    print(f"Fetching all parameters from the checkpoint at {model_path}...")
+    state_dict = load_file(model_path)
+    print("Converting model...")
+    converted_state_dict = convert_state_dict(state_dict, config.text_config)
+
+    # we need to add embed positions as they are not in the state dict
+    with torch.no_grad(), torch.device("cuda"):
+        # TODO: @eustlb, we are here creating on GPU
+        # vllm initalizes on device, while we save in state dict
+        embed_positions_weight = sinusoids(config.audio_config.max_source_positions, config.audio_config.hidden_size)
+    converted_state_dict["audio_tower.embed_positions.weight"] = embed_positions_weight.cpu()
+
+    # -------------------------
+    # load the weights and save
+    # -------------------------
+
+    print("Loading the checkpoint in a Voxtral model.")
+    with torch.device("meta"):
+        model = VoxtralForConditionalGeneration(config)
+    model.load_state_dict(converted_state_dict, strict=True, assign=True)
+    print("Checkpoint loaded successfully.")
+    del model.config._name_or_path
+
+    del model.generation_config._from_model_config
+    model.generation_config.pad_token_id = 11
+
+    print("Saving the model.")
+    model.save_pretrained(output_dir, safe_serialization=safe_serialization)
+    del state_dict, model
+
+    # Safety check: reload the converted model
+    gc.collect()
+    print("Reloading the model to check if it's saved correctly.")
+    VoxtralForConditionalGeneration.from_pretrained(output_dir, torch_dtype=torch.bfloat16, device_map="auto")
+    print("Model reloaded successfully.")
+
+
+def write_processor(input_path_or_repo: str, feature_extractor_path_or_repo: str, output_dir: str):
+    tokenizer = MistralCommonTokenizer.from_pretrained(input_path_or_repo)
+    feature_extractor = WhisperFeatureExtractor.from_pretrained(feature_extractor_path_or_repo)
+
+    print("Creating the processor...")
+    # Create the processor and save it
+    processor = VoxtralProcessor(
+        feature_extractor=feature_extractor,
+        tokenizer=tokenizer,
+    )
+    processor.save_pretrained(output_dir)
+    print("Processor saved successfully.")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Convert Voxtral weights to Hugging Face format")
+    parser.add_argument(
+        "--input_path_or_repo",
+        type=str,
+        required=True,
+        help="Path or repo containing Csm weights",
+    )
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        required=True,
+        help="Name of the model in input_path_or_repo",
+    )
+    parser.add_argument(
+        "--config_name",
+        type=str,
+        required=True,
+        help="Name of the config in input_path_or_repo",
+    )
+    parser.add_argument(
+        "--feature_extractor_path_or_repo",
+        type=str,
+        required=True,
+        help="Path or repo containing the feature extractor",
+    )
+    parser.add_argument(
+        "--output_dir",
+        help="Location to write HF model and tokenizer",
+    )
+    parser.add_argument(
+        "--safe_serialization", action="store_true", default=True, help="Whether or not to save using `safetensors`."
+    )
+    args = parser.parse_args()
+
+    write_model(
+        args.input_path_or_repo,
+        args.model_name,
+        args.config_name,
+        args.output_dir,
+        safe_serialization=args.safe_serialization,
+    )
+
+    write_processor(
+        args.input_path_or_repo,
+        args.feature_extractor_path_or_repo,
+        args.output_dir,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/src/transformers/models/voxtral/modeling_voxtral.py
+++ b/src/transformers/models/voxtral/modeling_voxtral.py
@@ -0,0 +1,542 @@
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/voxtral/modular_voxtral.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_voxtral.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from typing import Callable, Optional, Union
+
+import torch
+from torch import nn
+
+from ...activations import ACT2FN
+from ...cache_utils import Cache
+from ...generation import GenerationMixin
+from ...modeling_layers import GradientCheckpointingLayer
+from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from ...processing_utils import Unpack
+from ...utils import TransformersKwargs, auto_docstring, can_return_tuple, logging
+from ...utils.generic import check_model_inputs
+from ..auto import AutoModel, AutoModelForCausalLM
+from .configuration_voxtral import VoxtralConfig, VoxtralEncoderConfig
+
+
+logger = logging.get_logger(__name__)
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: Optional[float] = None,
+    dropout: float = 0.0,
+    head_mask: Optional[torch.Tensor] = None,
+    **kwargs,
+):
+    if scaling is None:
+        scaling = query.size(-1) ** -0.5
+
+    attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling
+    if attention_mask is not None and attention_mask.ndim == 4:
+        attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]]
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+    if head_mask is not None:
+        attn_weights = attn_weights * head_mask.view(1, -1, 1, 1)
+
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
+class VoxtralAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(
+        self,
+        embed_dim: int,
+        num_heads: int,
+        dropout: float = 0.0,
+        is_decoder: bool = False,
+        bias: bool = True,
+        is_causal: bool = False,
+        layer_idx: Optional[int] = None,
+        config: Optional[VoxtralConfig] = None,
+    ):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        self.config = config
+
+        if (self.head_dim * num_heads) != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
+                f" and `num_heads`: {num_heads})."
+            )
+        self.scaling = self.head_dim**-0.5
+        self.is_decoder = is_decoder
+        self.is_causal = is_causal
+
+        if layer_idx is None and is_decoder:
+            logger.warning_once(
+                f"Instantiating a decoder {self.__class__.__name__} without passing `layer_idx` is not recommended and "
+                "will to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.layer_idx = layer_idx
+
+        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
+        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+        **kwargs,
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, _ = hidden_states.size()
+
+        # Scaling is susceptible to floating point arithmetics' inprecisions
+        # which can lead to different results (this is dependent from model
+        # to model, e.g. whisper is one such case). We therefore keep the
+        # original order of scaling to follow the original implementation
+        # and enforce no scaling (1.0) in the attention call below.
+        query_states = self._shape(self.q_proj(hidden_states) * self.scaling, tgt_len, bsz)
+        key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
+        value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
+
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.dropout,
+            scaling=1.0,
+            output_attentions=output_attentions,
+            head_mask=layer_head_mask,
+            **kwargs,
+        )
+
+        attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous()
+        attn_output = self.out_proj(attn_output)
+
+        return attn_output, attn_weights
+
+
+class VoxtralEncoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: VoxtralConfig):
+        super().__init__()
+        self.embed_dim = config.d_model
+
+        self.self_attn = VoxtralAttention(
+            embed_dim=self.embed_dim,
+            num_heads=config.encoder_attention_heads,
+            dropout=config.attention_dropout,
+            config=config,
+        )
+        self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
+        self.dropout = config.dropout
+        self.activation_fn = ACT2FN[config.activation_function]
+        self.activation_dropout = config.activation_dropout
+        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
+        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
+        self.final_layer_norm = nn.LayerNorm(self.embed_dim)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        layer_head_mask: torch.Tensor,
+        output_attentions: bool = False,
+    ) -> torch.Tensor:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
+                `(encoder_attention_heads,)`.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+        hidden_states = self.self_attn_layer_norm(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
+        hidden_states = residual + hidden_states
+
+        residual = hidden_states
+        hidden_states = self.final_layer_norm(hidden_states)
+        hidden_states = self.activation_fn(self.fc1(hidden_states))
+        hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training)
+        hidden_states = self.fc2(hidden_states)
+        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
+        hidden_states = residual + hidden_states
+
+        if hidden_states.dtype == torch.float16:
+            clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+            hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+
+        return hidden_states, attn_weights
+
+
+@auto_docstring
+class VoxtralPreTrainedModel(PreTrainedModel):
+    config: VoxtralConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["VoxtralAttention"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_attention_backend = True
+    _supports_static_cache = True
+
+    def _init_weights(self, module):
+        # important: this ported version of Voxtral isn't meant for training from scratch - only
+        # inference and fine-tuning - so the proper init weights code has been removed
+        std = (
+            self.config.initializer_range
+            if hasattr(self.config, "initializer_range")
+            else self.config.audio_config.initializer_range
+        )
+
+        if isinstance(module, (nn.Linear, nn.Conv1d)):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.weight.data.fill_(1.0)
+            module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+
+@auto_docstring(
+    custom_intro="""
+    The Voxtral encoder, which is a Whisper encoder.
+    """
+)
+class VoxtralEncoder(VoxtralPreTrainedModel):
+    """
+    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a
+    [`VoxtralEncoderLayer`].
+
+    Args:
+        config: VoxtralEncoderConfig
+    """
+
+    # Ignore copy
+    config: VoxtralEncoderConfig
+    main_input_name = "input_features"
+    _no_split_modules = ["VoxtralEncoderLayer"]
+    _can_record_outputs = {
+        "attentions": VoxtralAttention,
+        "hidden_states": VoxtralEncoderLayer,
+    }
+
+    def __init__(self, config: VoxtralEncoderConfig):
+        super().__init__(config)
+        self.dropout = config.dropout
+        self.layerdrop = config.encoder_layerdrop
+
+        embed_dim = config.d_model
+        self.num_mel_bins = config.num_mel_bins
+        self.padding_idx = config.pad_token_id
+        self.max_source_positions = config.max_source_positions
+        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
+
+        self.conv1 = nn.Conv1d(self.num_mel_bins, embed_dim, kernel_size=3, padding=1)
+        self.conv2 = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=2, padding=1)
+
+        self.embed_positions = nn.Embedding(self.max_source_positions, embed_dim)
+        self.embed_positions.requires_grad_(False)
+
+        self.layers = nn.ModuleList([VoxtralEncoderLayer(config) for _ in range(config.encoder_layers)])
+        self.layer_norm = nn.LayerNorm(config.d_model)
+        # Ignore copy
+        self.avg_pooler = nn.AvgPool1d(2, stride=2)
+
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def _freeze_parameters(self):
+        for param in self.parameters():
+            param.requires_grad = False
+        self._requires_grad = False
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.conv1
+
+    def set_input_embeddings(self, value: nn.Module):
+        self.conv1 = value
+
+    @check_model_inputs
+    def forward(
+        self,
+        input_features,
+        attention_mask=None,
+        **kwargs: Unpack[TransformersKwargs],
+    ):
+        r"""
+        Args:
+            input_features (`torch.LongTensor` of shape `(batch_size, feature_size, sequence_length)`):
+                Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
+                obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
+                `numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
+                `input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
+                and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
+            attention_mask (`torch.Tensor`)`, *optional*):
+                Voxtral does not support masking of the `input_features`, this argument is preserved for compatibility,
+                but it is not used. By default the silence in the input log mel spectrogram are ignored.
+        """
+        expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]
+        if input_features.shape[-1] != expected_seq_length:
+            raise ValueError(
+                f"Qwen2Audio expects the mel input features to be of length {expected_seq_length}, but found {input_features.shape[-1]}. Make sure to pad the input mel features to {expected_seq_length}."
+            )
+
+        input_features = input_features.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
+        inputs_embeds = nn.functional.gelu(self.conv1(input_features))
+        inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
+        inputs_embeds = inputs_embeds.permute(0, 2, 1)
+
+        embed_pos = self.embed_positions.weight
+        hidden_states = (inputs_embeds + embed_pos).to(inputs_embeds.dtype)
+        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
+
+        for idx, encoder_layer in enumerate(self.layers):
+            layer_outputs = encoder_layer(
+                hidden_states,
+                attention_mask=attention_mask,
+                layer_head_mask=None,
+            )
+            hidden_states = layer_outputs[0]
+
+        hidden_states = self.layer_norm(hidden_states)
+
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+        )
+
+    # Ignore copy
+    def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
+        """
+        Computes the output length of the convolutional layers and the output length of the audio encoder
+        """
+        input_lengths = (input_lengths - 1) // 2 + 1
+        output_lengths = (input_lengths - 2) // 2 + 1
+        return input_lengths, output_lengths
+
+
+class VoxtralMultiModalProjector(nn.Module):
+    def __init__(self, config: VoxtralConfig):
+        super().__init__()
+        self.linear_1 = nn.Linear(config.audio_config.intermediate_size, config.text_config.hidden_size, bias=False)
+        self.act = ACT2FN[config.projector_hidden_act]
+        self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=False)
+
+    def forward(self, audio_features):
+        hidden_states = self.linear_1(audio_features)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        return hidden_states
+
+
+@auto_docstring(
+    custom_intro="""
+    The Voxtral model, which consists of Whisper encoder, a multi-modal projector and a LLama language model.
+    """
+)
+class VoxtralForConditionalGeneration(VoxtralPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    _keep_in_fp32_modules_strict = ["embed_positions"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.vocab_size = config.text_config.vocab_size
+        self.audio_tower = AutoModel.from_config(config.audio_config)
+        self.language_model = AutoModelForCausalLM.from_config(config.text_config)
+        self.multi_modal_projector = VoxtralMultiModalProjector(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+
+    def set_decoder(self, decoder):
+        self.language_model.set_decoder(decoder)
+
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+
+    def get_audio_embeds(self, input_features: torch.FloatTensor):
+        """
+        This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
+        Args:
+            input_features (`torch.FloatTensor`):
+                Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
+                obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
+                `numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
+                `input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
+                and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
+
+        Returns:
+            `torch.FloatTensor`:
+                The audio embeddings.
+        """
+        audio_outputs = self.audio_tower(input_features)
+        audio_hidden_states = audio_outputs.last_hidden_state
+        audio_hidden_states = audio_hidden_states.reshape(-1, self.config.audio_config.intermediate_size)
+        audio_embeds = self.multi_modal_projector(audio_hidden_states)
+        return audio_embeds
+
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        input_features: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+
+        ```python
+        >>> from transformers import VoxtralForConditionalGeneration, AutoProcessor
+        >>> import torch
+
+        >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+        >>> repo_id = "mistralai/Voxtral-Mini-3B-2507"
+
+        >>> processor = AutoProcessor.from_pretrained(repo_id)
+        >>> model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+
+        >>> conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "audio",
+                        "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+                    },
+                    {"type": "text", "text": "What can you tell me about this audio?"},
+                ],
+            }
+        ]
+
+        >>> inputs = processor.apply_chat_template(conversation)
+        >>> inputs = inputs.to(device, dtype=torch.bfloat16)
+
+        >>> outputs = model.generate(**inputs, max_new_tokens=30)
+        >>> processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+        ["This audio is a humorous conversation between two friends, likely in English, where one of them is trying to figure out what the other's tattoo says."]
+        ```"""
+        if inputs_embeds is None:
+            inputs_embeds = self.get_input_embeddings()(input_ids)
+
+        if input_features is not None:
+            audio_embeds = self.get_audio_embeds(input_features)
+
+            # replace text-audio token placeholders with audio embeddings
+            audio_token_mask = input_ids == self.config.audio_token_id
+            inputs_embeds[audio_token_mask] = audio_embeds
+
+        outputs: BaseModelOutputWithPast = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            logits_to_keep=logits_to_keep,
+            **kwargs,
+        )
+        return outputs
+
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        # Overwritten -- we should not pass input_features when we are in cached decoding stage
+
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+
+        model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
+
+        if cache_position is not None and cache_position[0] == 0:
+            # input_features should only be passed when we are not in cached decoding stage
+            model_inputs["input_features"] = input_features
+
+        return model_inputs
+
+
+__all__ = ["VoxtralPreTrainedModel", "VoxtralEncoder", "VoxtralForConditionalGeneration"]
--- a/src/transformers/models/voxtral/modular_voxtral.py
+++ b/src/transformers/models/voxtral/modular_voxtral.py
@@ -0,0 +1,276 @@
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Union
+
+import torch
+from torch import nn
+
+from ...activations import ACT2FN
+from ...cache_utils import Cache
+from ...generation import GenerationMixin
+from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...processing_utils import Unpack
+from ...utils import TransformersKwargs, auto_docstring, can_return_tuple
+from ...utils.generic import check_model_inputs
+from ..auto import AutoModel, AutoModelForCausalLM
+from ..qwen2_audio.modeling_qwen2_audio import (
+    Qwen2AudioAttention,
+    Qwen2AudioEncoder,
+    Qwen2AudioEncoderLayer,
+    Qwen2AudioPreTrainedModel,
+)
+from .configuration_voxtral import VoxtralConfig
+
+
+class VoxtralAttention(Qwen2AudioAttention):
+    pass
+
+
+class VoxtralEncoderLayer(Qwen2AudioEncoderLayer):
+    pass
+
+
+class VoxtralPreTrainedModel(Qwen2AudioPreTrainedModel):
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_attention_backend = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+
+
+# TODO: @eustlb, I would really prefer to use WhisperEncoder but it's messing with modular
+@auto_docstring(
+    custom_intro="""
+    The Voxtral encoder, which is a Whisper encoder.
+    """
+)
+class VoxtralEncoder(Qwen2AudioEncoder):
+    _can_record_outputs = {
+        "attentions": VoxtralAttention,
+        "hidden_states": VoxtralEncoderLayer,
+    }
+
+    @check_model_inputs
+    def forward(
+        self,
+        input_features,
+        attention_mask=None,
+        **kwargs: Unpack[TransformersKwargs],
+    ):
+        r"""
+        Args:
+            input_features (`torch.LongTensor` of shape `(batch_size, feature_size, sequence_length)`):
+                Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
+                obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
+                `numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
+                `input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
+                and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
+            attention_mask (`torch.Tensor`)`, *optional*):
+                Voxtral does not support masking of the `input_features`, this argument is preserved for compatibility,
+                but it is not used. By default the silence in the input log mel spectrogram are ignored.
+        """
+        expected_seq_length = self.config.max_source_positions * self.conv1.stride[0] * self.conv2.stride[0]
+        if input_features.shape[-1] != expected_seq_length:
+            raise ValueError(
+                f"Qwen2Audio expects the mel input features to be of length {expected_seq_length}, but found {input_features.shape[-1]}. Make sure to pad the input mel features to {expected_seq_length}."
+            )
+
+        input_features = input_features.to(dtype=self.conv1.weight.dtype, device=self.conv1.weight.device)
+        inputs_embeds = nn.functional.gelu(self.conv1(input_features))
+        inputs_embeds = nn.functional.gelu(self.conv2(inputs_embeds))
+        inputs_embeds = inputs_embeds.permute(0, 2, 1)
+
+        embed_pos = self.embed_positions.weight
+        hidden_states = (inputs_embeds + embed_pos).to(inputs_embeds.dtype)
+        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
+
+        for idx, encoder_layer in enumerate(self.layers):
+            layer_outputs = encoder_layer(
+                hidden_states,
+                attention_mask=attention_mask,
+                layer_head_mask=None,
+            )
+            hidden_states = layer_outputs[0]
+
+        hidden_states = self.layer_norm(hidden_states)
+
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+        )
+
+
+class VoxtralMultiModalProjector(nn.Module):
+    def __init__(self, config: VoxtralConfig):
+        super().__init__()
+        self.linear_1 = nn.Linear(config.audio_config.intermediate_size, config.text_config.hidden_size, bias=False)
+        self.act = ACT2FN[config.projector_hidden_act]
+        self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=False)
+
+    def forward(self, audio_features):
+        hidden_states = self.linear_1(audio_features)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        return hidden_states
+
+
+@auto_docstring(
+    custom_intro="""
+    The Voxtral model, which consists of Whisper encoder, a multi-modal projector and a LLama language model.
+    """
+)
+class VoxtralForConditionalGeneration(VoxtralPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    _keep_in_fp32_modules_strict = ["embed_positions"]
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.vocab_size = config.text_config.vocab_size
+        self.audio_tower = AutoModel.from_config(config.audio_config)
+        self.language_model = AutoModelForCausalLM.from_config(config.text_config)
+        self.multi_modal_projector = VoxtralMultiModalProjector(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+
+    def set_decoder(self, decoder):
+        self.language_model.set_decoder(decoder)
+
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+
+    def get_audio_embeds(self, input_features: torch.FloatTensor):
+        """
+        This method is used to get the audio embeddings from input features (a log mel spectrogram), meaning inferring the audio encoder and the multi-modal projector.
+        Args:
+            input_features (`torch.FloatTensor`):
+                Float values of mel features extracted from the raw speech waveform. Raw speech waveform can be
+                obtained by loading a `.flac` or `.wav` audio file into an array of type `list[float]` or a
+                `numpy.ndarray`, *e.g.* via the soundfile library (`pip install soundfile`). To prepare the array into
+                `input_features`, the [`AutoFeatureExtractor`] should be used for extracting the mel features, padding
+                and conversion into a tensor of type `torch.FloatTensor`. See [`~WhisperFeatureExtractor.__call__`]
+
+        Returns:
+            `torch.FloatTensor`:
+                The audio embeddings.
+        """
+        audio_outputs = self.audio_tower(input_features)
+        audio_hidden_states = audio_outputs.last_hidden_state
+        audio_hidden_states = audio_hidden_states.reshape(-1, self.config.audio_config.intermediate_size)
+        audio_embeds = self.multi_modal_projector(audio_hidden_states)
+        return audio_embeds
+
+    @can_return_tuple
+    @auto_docstring
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        input_features: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        Example:
+
+        ```python
+        >>> from transformers import VoxtralForConditionalGeneration, AutoProcessor
+        >>> import torch
+
+        >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+        >>> repo_id = "mistralai/Voxtral-Mini-3B-2507"
+
+        >>> processor = AutoProcessor.from_pretrained(repo_id)
+        >>> model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
+
+        >>> conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "audio",
+                        "url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+                    },
+                    {"type": "text", "text": "What can you tell me about this audio?"},
+                ],
+            }
+        ]
+
+        >>> inputs = processor.apply_chat_template(conversation)
+        >>> inputs = inputs.to(device, dtype=torch.bfloat16)
+
+        >>> outputs = model.generate(**inputs, max_new_tokens=30)
+        >>> processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
+        ["This audio is a humorous conversation between two friends, likely in English, where one of them is trying to figure out what the other's tattoo says."]
+        ```"""
+        if inputs_embeds is None:
+            inputs_embeds = self.get_input_embeddings()(input_ids)
+
+        if input_features is not None:
+            audio_embeds = self.get_audio_embeds(input_features)
+
+            # replace text-audio token placeholders with audio embeddings
+            audio_token_mask = input_ids == self.config.audio_token_id
+            inputs_embeds[audio_token_mask] = audio_embeds
+
+        outputs: BaseModelOutputWithPast = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            labels=labels,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            logits_to_keep=logits_to_keep,
+            **kwargs,
+        )
+        return outputs
+
+    def prepare_inputs_for_generation(self, *args, **kwargs):
+        # Overwritten -- we should not pass input_features when we are in cached decoding stage
+
+        input_features = kwargs.pop("input_features", None)
+        cache_position = kwargs.get("cache_position")
+
+        model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)
+
+        if cache_position is not None and cache_position[0] == 0:
+            # input_features should only be passed when we are not in cached decoding stage
+            model_inputs["input_features"] = input_features
+
+        return model_inputs
+
+
+__all__ = ["VoxtralPreTrainedModel", "VoxtralEncoder", "VoxtralForConditionalGeneration"]
--- a/src/transformers/models/voxtral/processing_voxtral.py
+++ b/src/transformers/models/voxtral/processing_voxtral.py
@@ -0,0 +1,449 @@
+# coding=utf-8
+# Copyright 2025 Sesame and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+from typing import Optional, Union
+
+from ...utils import is_mistral_common_available, is_soundfile_available, is_torch_available, logging
+
+
+if is_torch_available():
+    import torch
+
+if is_soundfile_available():
+    import soundfile as sf
+
+if is_mistral_common_available():
+    from mistral_common.protocol.transcription.request import TranscriptionRequest
+
+from ...audio_utils import AudioInput, load_audio_as, make_list_of_audio
+from ...feature_extraction_utils import BatchFeature
+from ...processing_utils import AllKwargsForChatTemplate, AudioKwargs, ProcessingKwargs, ProcessorMixin, Unpack
+from ...tokenization_utils_base import PreTokenizedInput, TextInput
+
+
+logger = logging.get_logger(__name__)
+
+
+class VoxtralAudioKwargs(AudioKwargs, total=False):
+    max_source_positions: Optional[int]
+
+
+class VoxtralProcessorKwargs(ProcessingKwargs, total=False):
+    _defaults = {
+        "text_kwargs": {
+            "padding": True,
+        },
+        "audio_kwargs": {
+            "sampling_rate": 16000,
+            "padding": True,
+            "truncation": False,
+            "pad_to_multiple_of": 480000,
+            "max_source_positions": 3000,
+        },
+        "common_kwargs": {
+            "return_tensors": "pt",
+            "return_dict": True,
+            "tokenize": True,
+        },
+    }
+
+
+class VoxtralProcessor(ProcessorMixin):
+    r"""
+    Constructs a Voxtral processor which wraps [`WhisperFeatureExtractor`] and
+    [`MistralCommonTokenizer`] into a single processor that inherits both the audio feature extraction and
+    tokenizer functionalities.
+
+    Args:
+        feature_extractor ([`WhisperFeatureExtractor`]):
+            The feature extractor is a required input.
+        tokenizer ([`MistralCommonTokenizer`]):
+            The tokenizer is a required input.
+    """
+
+    attributes = ["feature_extractor", "tokenizer"]
+    feature_extractor_class = "WhisperFeatureExtractor"
+    tokenizer_class = "MistralCommonTokenizer"
+
+    def __init__(
+        self,
+        feature_extractor,
+        tokenizer,
+    ):
+        self.audio_token_id = 24
+        self.audio_token = tokenizer.convert_ids_to_tokens(self.audio_token_id)
+
+        super().__init__(feature_extractor, tokenizer)
+
+    def _retreive_input_features(self, audio, max_source_positions, **kwargs):
+        """
+        Handles specific logic of Voxtral expected input features: audio arrays should be padded to next multiple of 480000 (duration is a multiple of 30s), see VoxtralProcessorKwargs' default audio_kwargs.
+        Then mel input features are extracted and stacked along batch dimension, splitting into chunks of max_source_positions.
+        """
+        input_features_list = []
+        for audio_array in audio:
+            audio_inputs = self.feature_extractor(audio_array, **kwargs)
+
+            # let's split into chunks of max_source_positions, and then stack them along batch dimension
+            input_features = audio_inputs["input_features"].reshape(
+                self.feature_extractor.feature_size, -1, max_source_positions
+            )
+            input_features_list.append(input_features.transpose(0, 1))
+
+        return torch.cat(input_features_list)
+
+    def apply_chat_template(
+        self,
+        conversation: Union[list[dict[str, str]], list[list[dict[str, str]]]],
+        **kwargs: Unpack[AllKwargsForChatTemplate],
+    ) -> str:
+        """
+        This method applies the model's chat completion template given a conversation. It relies on MistralCommonTokenizer's
+        [`~MistralCommonTokenizer.apply_chat_template`] to prepare input ids to the model and on WhisperFeatureExtractor's
+        [`~WhisperFeatureExtractor.__call__`] to prepare input features to the model.
+
+        Note that audio is padded to the nearest 30-second multiple prior to mel feature extraction.
+
+        A `conversation` is a list of messages, where each message is a dictionary with a `role` and a `content` field.
+        For Voxtral, `role` can be `"user"` or `"assistant"`.
+        The `content` field can be a string or a list of dictionaries with a `type` field. See example below.
+
+        ```python
+        from huggingface_hub import hf_hub_download
+        from transformers.audio_utils import load_audio_as
+
+        audio_url = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3"
+        audio_path = hf_hub_download(repo_id="hf-internal-testing/dummy-audio-samples", filename="bcn_weather.mp3", repo_type="dataset")
+        audio_base64 = load_audio_as(audio_path, return_format="base64", force_mono=True)
+
+        # audio + text
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "audio", "url": audio_url},
+                    {"type": "audio", "path": audio_path},
+                    {"type": "audio", "base64": audio_base64},
+                    {"type": "text", "text": "How many audio do you hear?"},
+                ],
+            },
+        ]
+
+        processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
+        inputs = processor.apply_chat_template(conversation)
+        ```
+
+        Args:
+            conversation (`Union[list[Dict, [str, str]], list[list[dict[str, str]]]]`):
+                The conversation to format.
+        """
+        if kwargs.get("continue_final_message", False):
+            if kwargs.get("add_generation_prompt", False):
+                raise ValueError(
+                    "continue_final_message and add_generation_prompt are not compatible. Use continue_final_message when you want the model to continue the final message, and add_generation_prompt when you want to add a header that will prompt it to start a new assistant message instead."
+                )
+            if kwargs.get("return_assistant_tokens_mask", False):
+                raise ValueError("continue_final_message is not compatible with return_assistant_tokens_mask.")
+
+        # Fill sets of kwargs that should be used by different parts of template
+        processed_kwargs = {
+            "mm_load_kwargs": {},
+            "template_kwargs": {},
+        }
+
+        for kwarg_type in processed_kwargs:
+            for key in AllKwargsForChatTemplate.__annotations__[kwarg_type].__annotations__.keys():
+                kwarg_type_defaults = AllKwargsForChatTemplate.__annotations__[kwarg_type]
+                default_value = getattr(kwarg_type_defaults, key, None)
+                value = kwargs.pop(key, default_value)
+                if value is not None and not isinstance(value, dict):
+                    processed_kwargs[kwarg_type][key] = value
+
+        # Pass unprocessed custom kwargs
+        processed_kwargs["template_kwargs"].update(kwargs)
+
+        if isinstance(conversation, (list, tuple)) and (
+            isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "content")
+        ):
+            is_batched = True
+            conversations = conversation
+        else:
+            is_batched = False
+            conversations = [conversation]
+
+        # Check for any overlapping keys between mm_load_kwargs and kwargs
+        mm_load_kwargs = processed_kwargs["mm_load_kwargs"]
+        if any(key in kwargs for key in mm_load_kwargs):
+            overlapping_keys = [key for key in mm_load_kwargs if key in kwargs]
+            logger.warning(
+                f"{overlapping_keys[0] if len(overlapping_keys) == 1 else ', '.join(overlapping_keys)} load multimodal data kwarg{'s' if len(overlapping_keys) > 1 else ''} {'have' if len(overlapping_keys) > 1 else 'has'} been passed to the processor, but {'they are' if len(overlapping_keys) > 1 else 'it is'} not supported for VoxtralProcessor since it relies on mistral_common directly. {'They' if len(overlapping_keys) > 1 else 'It'} will be ignored."
+            )
+
+        output_kwargs = self._merge_kwargs(
+            VoxtralProcessorKwargs,
+            **kwargs,
+        )
+        text_kwargs = output_kwargs["text_kwargs"]
+        audio_kwargs = output_kwargs["audio_kwargs"]
+        common_kwargs = output_kwargs["common_kwargs"]
+
+        return_tensors = common_kwargs.pop("return_tensors", None)
+        if return_tensors != "pt":
+            raise ValueError(f"{self.__class__.__name__} only supports `return_tensors='pt'`.")
+
+        tokenizer_kwargs = {**processed_kwargs["template_kwargs"], **text_kwargs}
+        tokenizer_kwargs["return_tensors"] = None  # let's not return tensors here
+        tokenize = tokenizer_kwargs.pop("tokenize", False)
+        return_dict = tokenizer_kwargs.pop("return_dict", False)
+
+        encoded_instruct_inputs = self.tokenizer.apply_chat_template(
+            conversations,
+            tokenize=tokenize,
+            return_dict=return_dict,
+            **tokenizer_kwargs,
+        )
+
+        if tokenize:
+            if return_dict:
+                audio = encoded_instruct_inputs.pop("audio", None)
+                data = dict(encoded_instruct_inputs)
+                if audio is not None:
+                    max_source_positions = audio_kwargs.pop("max_source_positions")
+                    data["input_features"] = self._retreive_input_features(audio, max_source_positions, **audio_kwargs)
+
+                return BatchFeature(data=data, tensor_type=return_tensors)
+
+        if not is_batched:
+            return encoded_instruct_inputs[0]
+
+        return encoded_instruct_inputs
+
+    def __call__(
+        self,
+        text: Optional[Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]]],
+        **kwargs: Unpack[VoxtralProcessorKwargs],
+    ):
+        r"""
+        Method to prepare text to be fed as input to the model. This method forwards the `text`
+        arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.__call__`] to encode
+        the text. Please refer to the docstring of the above methods for more information.
+        This methods does not support audio. To prepare the audio, please use:
+        1. `apply_chat_template` [`~VoxtralProcessor.apply_chat_template`] method.
+        2. `apply_transcrition_request` [`~VoxtralProcessor.apply_transcrition_request`] method.
+
+        Args:
+            text (`str`, `list[str]`, `list[list[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                    - `'tf'`: Return TensorFlow `tf.constant` objects.
+                    - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                    - `'np'`: Return NumPy `np.ndarray` objects.
+                    - `'jax'`: Return JAX `jnp.ndarray` objects.
+        Returns:
+            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
+
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **input_features** -- List of audio values to be fed to a model. Returned when `audio` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
+              `None`).
+        """
+        if isinstance(text, str):
+            text = [text]
+
+        if any(self.audio_token in t for t in text):
+            raise ValueError(
+                f"{self.audio_token} is present in the provided text which is not supported by VoxtralProcessor. Please use the `apply_chat_template` method instead."
+            )
+
+        output_kwargs = self._merge_kwargs(
+            VoxtralProcessorKwargs,
+            **kwargs,
+        )
+        text_kwargs = output_kwargs["text_kwargs"]
+        common_kwargs = output_kwargs["common_kwargs"]
+
+        out = self.tokenizer(text, **text_kwargs)
+
+        return BatchFeature(data=out, tensor_type=common_kwargs.pop("return_tensors", None))
+
+    # TODO: @eustlb, this should be moved to mistral_common + testing
+    def apply_transcrition_request(
+        self,
+        language: Union[str, list[str]],
+        audio: Union[str, list[str], AudioInput],
+        model_id: str,
+        sampling_rate: Optional[int] = None,
+        format: Optional[Union[str, list[str]]] = None,
+        **kwargs: Unpack[VoxtralProcessorKwargs],
+    ):
+        """
+        This method applies the model's transcription request template given a language and audio.
+        It relies on MistralCommonTokenizer and WhisperFeatureExtractor to prepare input ids and input features to the model.
+
+        ```python
+        from transformers import VoxtralProcessor
+
+        model_id = "mistralai/Voxtral-Mini-3B-2507"
+        processor = VoxtralProcessor.from_pretrained(model_id)
+
+        language = "en"
+        audio = "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
+
+        inputs = processor.apply_transcrition_request(language=language, audio=audio, model_id=model_id)
+        ```
+
+        Args:
+            language (`str`, `list[str]`):
+                The language or languages of the audio. If provided as a string, will be applied uniformly to all audio.
+                If provided as a list, will be applied to each audio individually with a one-to-one mapping.
+            audio (`str`, `list[str]`, `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
+                The audio or batch of audio to be prepared. If provided as a string, it should correspond to the path or url of the audio file.
+            model_id (`str`:
+                The hub model id of the model to use for transcription.
+            sampling_rate (`int`, *optional*):
+                The sampling rate of the audio. Necessary if it is provided as `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`.
+                Used to avoid silent errors when passing audio that is not in the expected sampling rate.
+            format (`str`, `list[str]`, *optional*):
+                The format of the audio, necessary if is provided as `np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`.
+        """
+        output_kwargs = self._merge_kwargs(
+            VoxtralProcessorKwargs,
+            **kwargs,
+        )
+        text_kwargs = output_kwargs["text_kwargs"]
+        audio_kwargs = output_kwargs["audio_kwargs"]
+        common_kwargs = output_kwargs["common_kwargs"]
+
+        is_str = isinstance(audio, str)
+        is_list_of_str = all(isinstance(el, str) for el in audio)
+        is_list_of_audio = not (is_str or is_list_of_str)
+
+        if is_list_of_audio:
+            if sampling_rate is None:
+                logger.warning_once(
+                    f"You've provided audio without specifying the sampling rate. It will be assumed to be {audio_kwargs['sampling_rate']}, which can result in silent errors."
+                )
+            elif sampling_rate != audio_kwargs["sampling_rate"]:
+                raise ValueError(
+                    f"The sampling rate of the audio ({sampling_rate}) does not match the sampling rate of the processor ({audio_kwargs['sampling_rate']}). Please provide resampled the audio to the expected sampling rate."
+                )
+
+        sampling_rate = audio_kwargs["sampling_rate"]
+        return_dict = common_kwargs.pop("return_dict", False)
+        tokenize = common_kwargs.pop("tokenize", False)
+
+        # make sure to remove from text_kwargs and audio_kwargs
+        for k in ("return_dict", "tokenize"):
+            text_kwargs.pop(k, None)
+            audio_kwargs.pop(k, None)
+
+        return_tensors = common_kwargs.pop("return_tensors", None)
+        if return_tensors != "pt":
+            raise ValueError(f"{self.__class__.__name__} only supports `return_tensors='pt'`.")
+
+        # validate audio input
+        if is_str:
+            audio = [load_audio_as(audio, return_format="buffer", force_mono=True, sampling_rate=sampling_rate)]
+        elif is_list_of_str:
+            audio = [
+                load_audio_as(el, return_format="buffer", force_mono=True, sampling_rate=sampling_rate) for el in audio
+            ]
+        else:
+            audio = make_list_of_audio(audio)
+            if len(audio) != len(format):
+                raise ValueError(
+                    f"When passed as a list of audio, the length ({len(audio)}) must match the number of format ({len(format)})"
+                )
+            audio_buffers = []
+            for array, f in zip(audio, format):
+                # Create new BytesIO object and write audio data to it
+                buffer = io.BytesIO()
+                # Convert to mono if needed
+                if array.ndim == 2:
+                    array = array.mean(axis=1)
+                # Write to buffer with default format and sampling rate
+                sf.write(buffer, array, samplerate=audio_kwargs["sampling_rate"], format=f)
+                buffer.seek(0)
+                audio_buffers.append(buffer)
+            audio = audio_buffers
+
+        # validate language input
+        n_audio = len(audio)
+        if isinstance(language, str):
+            language = [language] * n_audio
+
+        if len(language) != n_audio:
+            raise ValueError(
+                f"When passed as a list of languages, the length ({len(language)}) must match the number of audio ({n_audio})"
+            )
+
+        input_ids = []
+        texts = []
+        audio_arrays = []
+        for audio_el, language_el in zip(audio, language):
+            openai_transcription_request = {
+                "model": model_id,
+                "file": audio_el,
+                "language": language_el,
+            }
+
+            transcription_request = TranscriptionRequest.from_openai(openai_transcription_request)
+            tokenized_transcription_request = self.tokenizer.tokenizer.encode_transcription(transcription_request)
+
+            input_ids.append(tokenized_transcription_request.tokens)
+            texts.append(tokenized_transcription_request.text)
+            audio_arrays.extend([el.audio_array for el in tokenized_transcription_request.audios])
+
+        if tokenize:
+            if return_dict:
+                # text are already tokenized but we need to pad etc
+                encoding = self.tokenizer(
+                    input_ids,
+                    add_special_tokens=False,
+                    **text_kwargs,
+                )
+                data = dict(encoding)
+
+                # extract the input features
+                max_source_positions = audio_kwargs.pop("max_source_positions")
+                data["input_features"] = self._retreive_input_features(
+                    audio_arrays, max_source_positions, **audio_kwargs
+                )
+
+                return BatchFeature(data=data, tensor_type=return_tensors)
+
+        return texts
+
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to MistralCommonTokenizer's [`~MistralCommonTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+
+
+__all__ = ["VoxtralProcessor"]
--- a/src/transformers/tokenization_mistral_common.py
+++ b/src/transformers/tokenization_mistral_common.py
@@ -22,6 +22,7 @@ from typing import Any, Callable, Optional, Union, overload

 import numpy as np

+from transformers.audio_utils import load_audio_as
 from transformers.tokenization_utils_base import (
    LARGE_INTEGER,
    VERY_LARGE_INTEGER,
@@ -41,11 +42,13 @@ from transformers.utils.import_utils import is_mistral_common_available, is_torc
 if is_mistral_common_available():
    from mistral_common.protocol.instruct.request import ChatCompletionRequest
    from mistral_common.protocol.instruct.validator import ValidationMode
-    from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy
+    from mistral_common.tokens.tokenizers.base import SpecialTokenPolicy, TokenizerVersion
+    from mistral_common.tokens.tokenizers.image import MultiModalVersion
    from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
    from mistral_common.tokens.tokenizers.tekken import Tekkenizer
    from mistral_common.tokens.tokenizers.utils import download_tokenizer_from_hf_hub

+
 if is_torch_available():
    import torch

@@ -1473,12 +1476,24 @@ class MistralCommonTokenizer(PushToHubMixin):
                    else:
                        raise ValueError("Image content must be specified.")
                    normalized_content.append({"type": "image_url", "image_url": {"url": image_content}})
+                elif content_type == "audio":
+                    maybe_url: Optional[str] = content.get("url")
+                    maybe_path: Optional[str] = content.get("path")
+                    maybe_base64: Optional[str] = content.get("base64")
+                    if maybe_url or maybe_path:
+                        audio_data = load_audio_as(maybe_url or maybe_path, return_format="dict", force_mono=True)
+                        normalized_content.append({"type": "input_audio", "input_audio": audio_data})
+                        continue
+                    if not maybe_base64:
+                        raise ValueError("Audio content must be specified.")
+                    normalized_content.append({"type": "audio_url", "audio_url": {"url": maybe_base64}})
                else:
                    normalized_content.append(content)
            message["content"] = normalized_content

        outputs = []
        images: list[np.ndarray] = []
+        audios: list[np.ndarray] = []

        for conversation in conversations:
            messages: list[dict[str, Union[str, list[dict[str, Union[str, dict[str, Any]]]]]]] = []
@@ -1498,6 +1513,7 @@ class MistralCommonTokenizer(PushToHubMixin):
            else:
                outputs.append(tokenized_request.text)
            images.extend(tokenized_request.images)
+            audios.extend([el.audio_array for el in tokenized_request.audios])

        if not is_batched:
            outputs = outputs[0]
@@ -1528,6 +1544,13 @@ class MistralCommonTokenizer(PushToHubMixin):
                    else:
                        raise ValueError(f"Unsupported return_tensors type: {return_tensors}")
                    out.data["pixel_values"] = pixel_values
+                if audios:
+                    if return_tensors is not None:
+                        raise NotImplementedError(
+                            "When passing audio content in apply_chat_template, `return_tensors` must be None since we cannot batch the audio inputs. The returned audio will be a list of numpy arrays."
+                        )
+                    # Transformers convention is audio for plural audio (audio does not take a "s")
+                    out.data["audio"] = audios
                return out
            else:
                return out["input_ids"]
@@ -1735,12 +1758,12 @@ class MistralCommonTokenizer(PushToHubMixin):
            raise ValueError("`init_inputs` are not supported by `MistralCommonTokenizer.from_pretrained`.")

        # Handle kwargs and AutoTokenizer case
-        if kwargs and not kwargs.keys() == {"_from_auto"}:
+        if kwargs and not set(kwargs.keys()).issubset({"_from_auto", "trust_remote_code"}):
            raise ValueError(
                f"Kwargs {list(kwargs.keys())} are not supported by `MistralCommonTokenizer.from_pretrained`."
            )

-        if not os.path.isfile(pretrained_model_name_or_path):
+        if not os.path.isdir(pretrained_model_name_or_path):
            tokenizer_path = download_tokenizer_from_hf_hub(
                repo_id=pretrained_model_name_or_path,
                cache_dir=cache_dir,
@@ -1750,7 +1773,37 @@ class MistralCommonTokenizer(PushToHubMixin):
                local_files_only=local_files_only,
            )
        else:
-            tokenizer_path = pretrained_model_name_or_path
+            valid_tokenizer_files = []
+            tokenizer_file: str
+
+            instruct_versions = list(TokenizerVersion.__members__)
+            mm_versions = list(MultiModalVersion.__members__) + [""]  # allow no mm version
+            sentencepiece_suffixes = [f".model.{v}{m}" for v in instruct_versions for m in mm_versions] + [".model"]
+
+            for path in os.listdir(pretrained_model_name_or_path):
+                pathlib_repo_file = Path(path)
+                file_name = pathlib_repo_file.name
+                suffix = "".join(pathlib_repo_file.suffixes)
+                if file_name == "tekken.json":
+                    valid_tokenizer_files.append(file_name)
+                elif suffix in sentencepiece_suffixes:
+                    valid_tokenizer_files.append(file_name)
+
+            if len(valid_tokenizer_files) == 0:
+                raise ValueError(f"No tokenizer file found in directory: {pretrained_model_name_or_path}")
+            # If there are multiple tokenizer files, we use tekken.json if it exists, otherwise the versioned one.
+            if len(valid_tokenizer_files) > 1:
+                if "tekken.json" in valid_tokenizer_files:
+                    tokenizer_file = "tekken.json"
+                else:
+                    tokenizer_file = sorted(valid_tokenizer_files)[-1]
+                logger.warning(
+                    f"Multiple tokenizer files found in directory: {pretrained_model_name_or_path}. Using {tokenizer_file}."
+                )
+            else:
+                tokenizer_file = valid_tokenizer_files[0]
+
+            tokenizer_path = os.path.join(pretrained_model_name_or_path, tokenizer_file)

        return cls(
            tokenizer_path=tokenizer_path,
@@ -1802,6 +1855,8 @@ class MistralCommonTokenizer(PushToHubMixin):
        Returns:
            A tuple of `str`: The files saved.
        """
+        # `save_jinja_files`` must be skipped to be able to save from a processor
+        kwargs.pop("save_jinja_files", None)
        if kwargs:
            raise ValueError(
                f"Kwargs {list(kwargs.keys())} are not supported by `MistralCommonTokenizer.save_pretrained`."
--- a/tests/models/voxtral/init.py
+++ b/tests/models/voxtral/init.py
--- a/tests/models/voxtral/test_modeling_voxtral.py
+++ b/tests/models/voxtral/test_modeling_voxtral.py
@@ -0,0 +1,472 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Testing suite for the PyTorch Voxtral model."""
+
+import tempfile
+import unittest
+
+from transformers import (
+    AutoProcessor,
+    VoxtralConfig,
+    VoxtralForConditionalGeneration,
+    is_torch_available,
+)
+from transformers.testing_utils import (
+    cleanup,
+    require_torch,
+    require_torch_sdpa,
+    slow,
+    torch_device,
+)
+
+from ...generation.test_utils import GenerationTesterMixin
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
+
+
+if is_torch_available():
+    import torch
+
+
+class VoxtralModelTester:
+    def __init__(
+        self,
+        parent,
+        ignore_index=-100,
+        audio_token_id=0,
+        seq_length=35,
+        feat_seq_length=60,
+        text_config={
+            "model_type": "llama",
+            "intermediate_size": 36,
+            "initializer_range": 0.02,
+            "hidden_size": 32,
+            "max_position_embeddings": 52,
+            "num_hidden_layers": 2,
+            "num_attention_heads": 4,
+            "num_key_value_heads": 2,
+            "use_labels": True,
+            "use_mrope": False,
+            "vocab_size": 99,
+            "head_dim": 8,
+        },
+        is_training=True,
+        audio_config={
+            "model_type": "voxtral_encoder",
+            "hidden_size": 16,
+            "num_attention_heads": 4,
+            "intermediate_size": 16,
+            "num_hidden_layers": 2,
+            "num_mel_bins": 80,
+            "max_source_positions": 30,
+            "initializer_range": 0.02,
+        },
+    ):
+        self.parent = parent
+        self.ignore_index = ignore_index
+        self.audio_token_id = audio_token_id
+        self.text_config = text_config
+        self.audio_config = audio_config
+        self.seq_length = seq_length
+        self.feat_seq_length = feat_seq_length
+
+        self.num_hidden_layers = text_config["num_hidden_layers"]
+        self.vocab_size = text_config["vocab_size"]
+        self.hidden_size = text_config["hidden_size"]
+        self.num_attention_heads = text_config["num_attention_heads"]
+        self.is_training = is_training
+
+        self.batch_size = 3
+        self.encoder_seq_length = seq_length
+
+    def get_config(self):
+        return VoxtralConfig(
+            text_config=self.text_config,
+            audio_config=self.audio_config,
+            ignore_index=self.ignore_index,
+            audio_token_id=self.audio_token_id,
+        )
+
+    def prepare_config_and_inputs(self):
+        input_features_values = floats_tensor(
+            [
+                self.batch_size,
+                self.audio_config["num_mel_bins"],
+                self.feat_seq_length,
+            ]
+        )
+        config = self.get_config()
+        return config, input_features_values
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, input_features_values = config_and_inputs
+        num_audio_tokens_per_batch_idx = 30
+
+        input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size - 1) + 1
+        attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(torch_device)
+        attention_mask[:, :1] = 0
+
+        input_ids[:, 1 : 1 + num_audio_tokens_per_batch_idx] = config.audio_token_id
+        inputs_dict = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "input_features": input_features_values,
+        }
+        return config, inputs_dict
+
+
+@require_torch
+class VoxtralForConditionalGenerationModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    """
+    Model tester for `VoxtralForConditionalGeneration`.
+    """
+
+    all_model_classes = (VoxtralForConditionalGeneration,) if is_torch_available() else ()
+    pipeline_model_mapping = (
+        {"text-to-speech": VoxtralForConditionalGeneration, "audio-text-to-text": VoxtralForConditionalGeneration}
+        if is_torch_available()
+        else {}
+    )
+    test_pruning = False
+    test_head_masking = False
+    _is_composite = True
+
+    def setUp(self):
+        self.model_tester = VoxtralModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=VoxtralConfig, has_text_modality=False)
+
+    @unittest.skip(
+        reason="This test does not apply to Voxtral since inputs_embeds corresponding to audio tokens are replaced when input features are provided."
+    )
+    def test_inputs_embeds_matches_input_ids(self):
+        pass
+
+    @require_torch_sdpa
+    def test_sdpa_can_dispatch_composite_models(self):
+        # overwrite because Voxtral is audio+text model (not vision+text)
+        if not self.has_attentions:
+            self.skipTest(reason="Model architecture does not support attentions")
+
+        if not self._is_composite:
+            self.skipTest(f"{self.all_model_classes[0].__name__} does not support SDPA")
+
+        for model_class in self.all_model_classes:
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+            model = model_class(config)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model_sdpa = model_class.from_pretrained(tmpdirname)
+                model_sdpa = model_sdpa.eval().to(torch_device)
+
+                text_attn = "sdpa" if model.language_model._supports_sdpa else "eager"
+                vision_attn = "sdpa" if model.audio_tower._supports_sdpa else "eager"
+
+                # `None` as it is the requested one which will be assigned to each sub-config
+                # Sub-model will dispatch to SDPA if it can (checked below that `SDPA` layers are present)
+                self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
+                self.assertTrue(model.language_model.config._attn_implementation == text_attn)
+                self.assertTrue(model.audio_tower.config._attn_implementation == vision_attn)
+
+                model_eager = model_class.from_pretrained(tmpdirname, attn_implementation="eager")
+                model_eager = model_eager.eval().to(torch_device)
+                self.assertTrue(model_eager.config._attn_implementation == "eager")
+                self.assertTrue(model_eager.language_model.config._attn_implementation == "eager")
+                self.assertTrue(model_eager.audio_tower.config._attn_implementation == "eager")
+
+                for name, submodule in model_eager.named_modules():
+                    class_name = submodule.__class__.__name__
+                    if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
+                        raise ValueError("The eager model should not have SDPA attention layers")
+
+
+@require_torch
+class VoxtralForConditionalGenerationIntegrationTest(unittest.TestCase):
+    def setUp(self):
+        self.checkpoint_name = "mistralai/Voxtral-Mini-3B-2507"
+        self.dtype = torch.bfloat16
+        self.processor = AutoProcessor.from_pretrained(self.checkpoint_name)
+
+    def tearDown(self):
+        cleanup(torch_device, gc_collect=True)
+
+    @slow
+    def test_mini_single_turn_audio_only(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "audio",
+                        "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+                    },
+                ],
+            }
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversation)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+        EXPECTED_OUTPUT = [
+            'The audio is a humorous exchange between two individuals, likely friends or acquaintances, about tattoos. Here\'s a breakdown:\n\n1. **Initial Reaction**: One person (let\'s call him A) is surprised to see the other person (let\'s call him B) has a tattoo. A asks if B has a tattoo, and B confirms.\n\n2. **Tattoo Interpretation**: A then asks what B\'s tattoo says, and B responds with "sweet." This exchange is repeated multiple times, with A asking what B\'s tattoo says, and B always answering "sweet."\n\n3. **Confusion**: A seems confused and asks what B\'s tattoo says multiple times, each time getting the same response. This leads to a humorous back-and-forth.\n\n4. **Clarification**: Eventually, B clarifies that A\'s tattoo says "dude" and A\'s says "sweet." This is the punchline of the joke, as A had been asking about B\'s tattoo but not his own.\n\n5. **Final Exchange**: B then asks what A\'s tattoo says, and A responds with "sweet," leading to a final round of confusion.\n\nThe humor comes from the repetition of the word "sweet" and the confusion that arises from A\'s lack of self-awareness about his own tattoo.'
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_mini_single_turn_text_and_audio(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "audio",
+                        "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+                    },
+                    {"type": "text", "text": "What can you tell me about this audio?"},
+                ],
+            }
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversation)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            "What can you tell me about this audio?This audio is a farewell address by President Barack Obama, delivered in Chicago. In the speech, he reflects on his eight years in office, highlighting the resilience, hope, and unity of the American people. He expresses gratitude for the conversations he had with the public, which kept him honest and inspired. The president also emphasizes the importance of self-government and civic engagement, encouraging Americans to participate in their democracy actively. He concludes by expressing optimism about the country's future and his commitment to serving as a citizen."
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_mini_single_turn_text_and_multiple_audios(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "audio",
+                        "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
+                    },
+                    {
+                        "type": "audio",
+                        "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+                    },
+                    {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
+                ],
+            }
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversation)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            'What sport and what nursery rhyme are referenced?The audio references both a nursery rhyme and a baseball game. The nursery rhyme is "Mary Had a Little Lamb," and the baseball game is a playoff game between the Baltimore Orioles and the Oakland Athletics.'
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_mini_single_turn_text_only(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversation = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "Hello, how are you doing today?"},
+                ],
+            }
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversation)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            "Hello, how are you doing today?Hello! I'm functioning as intended, thank you. How about you? How's your day going?"
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_mini_single_turn_text_and_multiple_audios_batched(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversations = [
+            [
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+                        },
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+                        },
+                        {
+                            "type": "text",
+                            "text": "Who's speaking in the speach and what city's weather is being discussed?",
+                        },
+                    ],
+                }
+            ],
+            [
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
+                        },
+                        {"type": "text", "text": "What can you tell me about this audio?"},
+                    ],
+                }
+            ],
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversations)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            "Who's speaking in the speach and what city's weather is being discussed?The speaker in the speech is Barack Obama, and the weather being discussed is in Barcelona.",
+            'What can you tell me about this audio?This audio is a commentary of a baseball game, specifically a home run hit by Edgar Martinez. Here are some key points:\n\n- **Game Context**: The game is likely a playoff or championship game, as the commentator mentions the American League Championship.\n- **Play Description**: Edgar Martinez hits a home run, which is described as a "line drive" and a "base hit."\n- **Team Involvement**: The team is the Mariners, and the commentator is excited about their chances to win the championship.\n- **Emotional Tone**: The commentator expresses disbelief and excitement, using phrases like "I don\'t believe it" and "my, oh my."\n- **Player Involvement**: The commentator mentions Joy and Junior, likely referring to other players or commentators in the broadcast.',
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_mini_multi_turn_text_and_audio(self):
+        """
+        reproducer: https://gist.github.com/eustlb/c5e0e0a12e84e3d575151ba63d17e4cf
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        conversations = [
+            [
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+                        },
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+                        },
+                        {"type": "text", "text": "Describe briefly what you can hear."},
+                    ],
+                },
+                {
+                    "role": "assistant",
+                    "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "audio",
+                            "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/dude_where_is_my_car.wav",
+                        },
+                        {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
+                    ],
+                },
+            ]
+        ]
+
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+
+        inputs = self.processor.apply_chat_template(conversations)
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            'Describe briefly what you can hear.The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.Ok, now compare this new audio with the previous one.The new audio is a humorous conversation between two friends, one of whom has a tattoo. The speaker is excited to see the tattoo and asks what it says. The other friend repeatedly says "sweet" in response, leading to a playful exchange. The speaker then realizes the joke and says "your tattoo says dude, your tattoo says sweet, got it?" The previous audio was a farewell address by a president, reflecting on his time in office and expressing gratitude to the American people. The new audio is a casual, light-hearted conversation in contrast to the serious and reflective tone of the previous audio.'
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
+
+    @slow
+    def test_transcribe_mode_audio_input(self):
+        """
+        To test transcribe mode of the model, WER evaluation has been run to compare with the declared model performances.
+        see https://github.com/huggingface/transformers/pull/39429 PR's descrition.
+        disclaimer: Perfect token matching cannot be achieved due to floating-point arithmetic differences between vLLM and Transformers implementations.
+        """
+        model = VoxtralForConditionalGeneration.from_pretrained(
+            self.checkpoint_name, torch_dtype=self.dtype, device_map=torch_device
+        )
+        inputs = self.processor.apply_transcrition_request(
+            language="en",
+            audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+            model_id=self.checkpoint_name,
+        )
+        inputs = inputs.to(torch_device, dtype=self.dtype)
+        outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+
+        decoded_outputs = self.processor.batch_decode(outputs, skip_special_tokens=True)
+
+        EXPECTED_OUTPUT = [
+            "lang:enThis week, I traveled to Chicago to deliver my final farewell address to the nation, following in the tradition of presidents before me. It was an opportunity to say thank you. Whether we've seen eye-to-eye or rarely agreed at all, my conversations with you, the American people, in living rooms and schools, at farms and on factory floors, at diners and on distant military outposts, All these conversations are what have kept me honest, kept me inspired, and kept me going. Every day, I learned from you. You made me a better president, and you made me a better man. Over the course of these eight years, I've seen the goodness, the resilience, and the hope of the American people. I've seen neighbors looking out for each other as we rescued our economy from the worst crisis of our lifetimes. I've hugged cancer survivors who finally know the security of affordable health care. I've seen communities like Joplin rebuild from disaster, and cities like Boston show the world that no terrorist will ever break the American spirit. I've seen the hopeful faces of young graduates and our newest military officers. I've mourned with grieving families searching for answers, and I found grace in a Charleston church. I've seen our scientists help a paralyzed man regain his sense of touch, and our wounded warriors walk again. I've seen our doctors and volunteers rebuild after earthquakes and stop pandemics in their tracks. I've learned from students who are building robots and curing diseases and who will change the world in ways we can't even imagine. I've seen the youngest of children remind us of our obligations to care for our refugees, to work in peace, and above all, to look out for each other. That's what's possible when we come together in the slow, hard, sometimes frustrating, but always vital work of self-government. But we can't take our democracy for granted. All of us, regardless of party, should throw ourselves into the work of citizenship. Not just when there's an election. Not just when our own narrow interest is at stake. But over the full span of a lifetime. If you're tired of arguing with strangers on the Internet, try to talk with one in real life. If something needs fixing, lace up your shoes and do some organizing. If you're disappointed by your elected officials, then grab a clipboard, get some signatures, and run for office yourself. Our success depends on our"
+        ]
+        self.assertEqual(decoded_outputs, EXPECTED_OUTPUT)
--- a/tests/test_tokenization_mistral_common.py
+++ b/tests/test_tokenization_mistral_common.py