Chat template docs (#36163)

* decompose chat template docs * add docs * update model docs * qwen2-5 * pixtral * remove old chat template * also video as list frames supported * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/chat_template_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * remove audio for now --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-02-14 10:32:14 +01:00
parent 3bf02cf440
commit 1931a35140
14 changed files with 1725 additions and 1470 deletions
--- a/docs/source/en/model_doc/llava.md
+++ b/docs/source/en/model_doc/llava.md
@@ -47,9 +47,19 @@ Adding these attributes means that LLaVA will try to infer the number of image t
 The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.


-### Single image inference
+### Formatting Prompts with Chat Templates  
+
+Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.  
+
+**Important:**  
+- You must construct a conversation history — passing a plain string won't work.  
+- Each message should be a dictionary with `"role"` and `"content"` keys.  
+- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.  
+
+
+Here’s an example of how to structure your input. 
+We will use [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:

-For best results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:

 ```python
 from transformers import AutoProcessor
@@ -84,60 +94,6 @@ print(text_prompt)
 >>> "USER: <image>\n<What’s shown in this image? ASSISTANT: This image shows a red stop sign.</s>USER: Describe the image in more details. ASSISTANT:"
 ```

-### Batched inference
-
-LLaVa also supports batched inference. Here is how you can do it:
-
-```python
-import requests
-from PIL import Image
-import torch
-from transformers import AutoProcessor, LlavaForConditionalGeneration
-
-# Load the model in half-precision
-model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
-processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
-
-# Get two different images
-url = "https://www.ilankelman.org/stopsigns/australia.jpg"
-image_stop = Image.open(requests.get(url, stream=True).raw)
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image_cats = Image.open(requests.get(url, stream=True).raw)
-
-# Prepare a batch of two prompts
-conversation_1 = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ],
-    },
-]
-
-conversation_2 = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "image"},
-            {"type": "text", "text": "What is shown in this image?"},
-        ],
-    },
-]
-
-prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
-prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
-prompts = [prompt_1, prompt_2]
-
-# We can simply feed images in the order they have to be used in the text prompt
-inputs = processor(images=[image_stop, image_cats], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)
-
-# Generate
-generate_ids = model.generate(**inputs, max_new_tokens=30)
-processor.batch_decode(generate_ids, skip_special_tokens=True)
-```
-
 - If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint:

 [llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format:
@@ -162,6 +118,96 @@ For multiple turns conversation:
 "USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
 ```

+🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
+
+
+## Usage examples
+
+### Single input inference
+
+
+```python
+import torch
+from transformers import AutoProcessor, LlavaForConditionalGeneration
+
+# Load the model in half-precision
+model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
+processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
+
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    },
+]
+
+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device, torch.float16)
+
+# Generate
+generate_ids = model.generate(**inputs, max_new_tokens=30)
+processor.batch_decode(generate_ids, skip_special_tokens=True)
+```
+
+
+### Batched inference
+
+LLaVa also supports batched inference. Here is how you can do it:
+
+```python
+import torch
+from transformers import AutoProcessor, LlavaForConditionalGeneration
+
+# Load the model in half-precision
+model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
+processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
+
+
+# Prepare a batch of two prompts
+conversation_1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    },
+]
+
+conversation_2 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    },
+]
+
+inputs = processor.apply_chat_template(
+    [conversation_1, conversation_2],
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    padding=True,
+    return_tensors="pt"
+).to(model.device, torch.float16)
+
+
+# Generate
+generate_ids = model.generate(**inputs, max_new_tokens=30)
+processor.batch_decode(generate_ids, skip_special_tokens=True)
+```
+
+
 ## Note regarding reproducing original implementation

 In order to match the logits of the [original implementation](https://github.com/haotian-liu/LLaVA/tree/main), one needs to additionally specify `do_pad=True` when instantiating `LLavaImageProcessor`:
--- a/docs/source/en/model_doc/llava_next.md
+++ b/docs/source/en/model_doc/llava_next.md
@@ -59,9 +59,17 @@ Adding these attributes means that LLaVA will try to infer the number of image t
 The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.


- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that and the list of formats accepted by each checkpoint.
+### Formatting Prompts with Chat Templates  

-We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
+Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.  
+
+**Important:**  
+- You must construct a conversation history — passing a plain string won't work.  
+- Each message should be a dictionary with `"role"` and `"content"` keys.  
+- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.  
+
+
+Here’s an example of how to structure your input. We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image.

 ```python
 from transformers import LlavaNextProcessor
@@ -125,6 +133,10 @@ print(text_prompt)
 "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n"
 ```

+🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
+
+
+
 ## Usage example

 ### Single image inference
--- a/docs/source/en/model_doc/llava_next_video.md
+++ b/docs/source/en/model_doc/llava_next_video.md
@@ -56,9 +56,17 @@ Adding these attributes means that LLaVA will try to infer the number of image t
 The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.


- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use tokenizer's `apply_chat_template` to format your prompts correctly. Below is an example of how to do that.
+### Formatting Prompts with Chat Templates  

-We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. Each content field has to be a list of dicts, as follows:
+Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.  
+
+**Important:**  
+- You must construct a conversation history — passing a plain string won't work.  
+- Each message should be a dictionary with `"role"` and `"content"` keys.  
+- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.  
+
+
+Here’s an example of how to structure your input. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images.

 ```python
 from transformers import LlavaNextVideoProcessor
@@ -99,6 +107,10 @@ text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=
 print(text_prompt)
 ```

+🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
+
+
+
 ## Usage example

 ### Single Media Mode
@@ -106,41 +118,16 @@ print(text_prompt)
 The model can accept both images and videos as input. Here's an example code for inference in half-precision (`torch.float16`):

 ```python
-import av
+from huggingface_hub import hf_hub_download
 import torch
-import numpy as np
 from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

-def read_video_pyav(container, indices):
-    '''
-    Decode the video with PyAV decoder.
-    Args:
-        container (`av.container.input.InputContainer`): PyAV container.
-        indices (`List[int]`): List of frame indices to decode.
-    Returns:
-        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
-    '''
-    frames = []
-    container.seek(0)
-    start_index = indices[0]
-    end_index = indices[-1]
-    for i, frame in enumerate(container.decode(video=0)):
-        if i > end_index:
-            break
-        if i >= start_index and i in indices:
-            frames.append(frame)
-    return np.stack([x.to_ndarray(format="rgb24") for x in frames])
-
 # Load the model in half-precision
 model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", torch_dtype=torch.float16, device_map="auto")
 processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

 # Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
 video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
-container = av.open(video_path)
-total_frames = container.streams.video[0].frames
-indices = np.arange(0, total_frames, total_frames / 8).astype(int)
-video = read_video_pyav(container, indices)

 conversation = [
    {
@@ -148,13 +135,12 @@ conversation = [
        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
-            {"type": "video"},
+            {"type": "video", "path": video_path},
            ],
    },
 ]

-prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(text=prompt, videos=video, return_tensors="pt")
+inputs = processor.apply_chat_template(conversation, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")

 out = model.generate(**inputs, max_new_tokens=60)
 processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
@@ -166,20 +152,15 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
 The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: 

 ```python
-from PIL import Image
-import requests

 # Generate from image and video mixed inputs
-# Load and image and write a new prompt
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
 conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "How many cats are there in the image?"},
-            {"type": "image"},
+            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            ],
    },
    {
@@ -192,12 +173,11 @@ conversation = [
        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
-            {"type": "video"},
+            {"type": "video", "path": video_path},
            ],
    },
 ]
-prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")
+inputs = processor.apply_chat_template(conversation, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, padding=True, return_tensors="pt")

 # Generate
 generate_ids = model.generate(**inputs, max_length=50)
--- a/docs/source/en/model_doc/llava_onevision.md
+++ b/docs/source/en/model_doc/llava_onevision.md
@@ -47,8 +47,18 @@ Tips:

 </Tip>

- Note that the model should use a specific prompt format, on which the large language model (LLM) was trained. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities.

+### Formatting Prompts with Chat Templates  
+
+Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processor’s `apply_chat_template` method.  
+
+**Important:**  
+- You must construct a conversation history — passing a plain string won't work.  
+- Each message should be a dictionary with `"role"` and `"content"` keys.  
+- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.  
+
+
+Here’s an example of how to structure your input. 
 We will use [llava-onevision-qwen2-7b-si-hf](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:

 ```python
@@ -84,6 +94,9 @@ print(text_prompt)
 '<|im_start|>user\n<image>What is shown in this image?<|im_end|>\n<|im_start|>assistant\nPage showing the list of options.<|im_end|>'
 ```

+🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
+
+
 This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
 The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main).

@@ -97,28 +110,28 @@ Here's how to load the model and perform inference in half-precision (`torch.flo
 ```python
 from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
 import torch
-from PIL import Image
-import requests

-processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
-model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
-model.to("cuda:0")
+processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf") 
+model = LlavaOnevisionForConditionalGeneration.from_pretrained(
+    "llava-hf/llava-onevision-qwen2-7b-ov-hf",
+    torch_dtype=torch.float16,
+    low_cpu_mem_usage=True,
+    device_map="cuda:0"
+)

 # prepare image and text prompt, using the appropriate prompt template
 url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
-image = Image.open(requests.get(url, stream=True).raw)
-
 conversation = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
 ]
-prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0", torch.float16)
+inputs = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
+inputs = inputs.to("cuda:0", torch.float16)

 # autoregressively complete prompt
 output = model.generate(**inputs, max_new_tokens=100)
@@ -140,22 +153,12 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
 model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto")
 processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")

-# Get three different images
-url = "https://www.ilankelman.org/stopsigns/australia.jpg"
-image_stop = Image.open(requests.get(url, stream=True).raw)
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image_cats = Image.open(requests.get(url, stream=True).raw)
-
-url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
-image_snowman = Image.open(requests.get(url, stream=True).raw)
-
 # Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not
 conversation_1 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
            {"type": "text", "text": "What is shown in this image?"},
            ],
    },
@@ -168,7 +171,7 @@ conversation_1 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What about this image? How many cats do you see?"},
            ],
    },
@@ -178,18 +181,20 @@ conversation_2 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
            {"type": "text", "text": "What is shown in this image?"},
            ],
    },
 ]

-prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
-prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
-prompts = [prompt_1, prompt_2]
-
-# We can simply feed images in the order they have to be used in the text prompt
-inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)
+inputs = processor.apply_chat_template(
+    [conversation_1, conversation_2],
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    padding=True,
+    return_tensors="pt"
+).to(model.device, torch.float16)

 # Generate
 generate_ids = model.generate(**inputs, max_new_tokens=30)
@@ -202,10 +207,7 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
 LLaVa-OneVision also can perform inference with videos as input, where video frames are treated as multiple images. Here is how you can do it:

 ```python
-import av
-import numpy as np
 from huggingface_hub import hf_hub_download
-
 import torch
 from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

@@ -213,48 +215,26 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
 model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto")
 processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")

-
-def read_video_pyav(container, indices):
-    '''
-    Decode the video with PyAV decoder.
-    Args:
-        container (`av.container.input.InputContainer`): PyAV container.
-        indices (`List[int]`): List of frame indices to decode.
-    Returns:
-        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
-    '''
-    frames = []
-    container.seek(0)
-    start_index = indices[0]
-    end_index = indices[-1]
-    for i, frame in enumerate(container.decode(video=0)):
-        if i > end_index:
-            break
-        if i >= start_index and i in indices:
-            frames.append(frame)
-    return np.stack([x.to_ndarray(format="rgb24") for x in frames])
-
-# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos, up to 32 frames)
 video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
-container = av.open(video_path)
-total_frames = container.streams.video[0].frames
-indices = np.arange(0, total_frames, total_frames / 8).astype(int)
-video = read_video_pyav(container, indices)
-
-# For videos we have to feed a "video" type instead of "image"
 conversation = [
    {

        "role": "user",
        "content": [
-            {"type": "video"},
+            {"type": "video", "path": video_path},
            {"type": "text", "text": "Why is this video funny?"},
            ],
    },
 ]

-prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(videos=list(video), text=prompt, return_tensors="pt").to("cuda:0", torch.float16)
+inputs = processor.apply_chat_template(
+    conversation,
+    num_frames=8
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device, torch.float16)

 out = model.generate(**inputs, max_new_tokens=60)
 processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
--- a/docs/source/en/model_doc/mllama.md
+++ b/docs/source/en/model_doc/mllama.md
@@ -28,7 +28,8 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
 - For text-only inputs use `MllamaForCausalLM` for generation to avoid loading vision tower.
 - Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images across samples and to a maximum number of tiles within each image.
 - The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor.
+- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
+


 <Tip warning={true}>
@@ -53,9 +54,7 @@ model.set_output_embeddings(resized_embeddings)

 #### Instruct model
 ```python
-import requests
 import torch
-from PIL import Image
 from transformers import MllamaForConditionalGeneration, AutoProcessor

 model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
@@ -67,18 +66,13 @@ messages = [
        {
            "role": "user", 
            "content": [
-                {"type": "image"},
+                {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
 ]
-text = processor.apply_chat_template(messages, add_generation_prompt=True)
-
-url = "https://llava-vl.github.io/static/images/view.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-
-inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
+inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
 output = model.generate(**inputs, max_new_tokens=25)
 print(processor.decode(output[0]))
 ```
--- a/docs/source/en/model_doc/pixtral.md
+++ b/docs/source/en/model_doc/pixtral.md
@@ -38,38 +38,42 @@ Tips:
 ```
 "<s>[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]"
 ```
-Then, the processor will replace each `[IMG]` token with a number of `[IMG]` tokens that depend on the height and the width of each image. Each *row* of the image is separated by an `[IMG_BREAK]` token, and each image is separated by an `[IMG_END]` token. It's advised to use the `apply_chat_template` method of the processor, which takes care of all of this. See the [usage section](#usage) for more info.
+Then, the processor will replace each `[IMG]` token with a number of `[IMG]` tokens that depend on the height and the width of each image. Each *row* of the image is separated by an `[IMG_BREAK]` token, and each image is separated by an `[IMG_END]` token. It's advised to use the `apply_chat_template` method of the processor, which takes care of all of this and formats the text for you. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the [usage section](#usage) for more info.
+

 This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/vllm-project/vllm/pull/8377).

+
 ## Usage

 At inference time, it's advised to use the processor's `apply_chat_template` method, which correctly formats the prompt for the model:

 ```python
 from transformers import AutoProcessor, LlavaForConditionalGeneration
-from PIL import Image

 model_id = "mistral-community/pixtral-12b"
 processor = AutoProcessor.from_pretrained(model_id)
-model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
-
-url_dog = "https://picsum.photos/id/237/200/300"
-url_mountain = "https://picsum.photos/seed/picsum/200/300"
+model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda")

 chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
-        {"type": "image"}, 
+        {"type": "image", "ur": "https://picsum.photos/id/237/200/300"}, 
        {"type": "text", "content": "live here?"}, 
-        {"type": "image"}
+        {"type": "image", "url": "https://picsum.photos/seed/picsum/200/300"}
      ]
    }
 ]

-prompt = processor.apply_chat_template(chat)
-inputs = processor(text=prompt, images=[url_dog, url_mountain], return_tensors="pt").to(model.device)
+inputs = processor.apply_chat_template(
+    chat,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+
 generate_ids = model.generate(**inputs, max_new_tokens=500)
 output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 ```
--- a/docs/source/en/model_doc/qwen2_5_vl.md
+++ b/docs/source/en/model_doc/qwen2_5_vl.md
@@ -32,21 +32,13 @@ The model can accept both images and videos as input. Here's an example code for

 ```python

-from PIL import Image
-import requests
 import torch
-from torchvision import io
-from typing import Dict
-from transformers.image_utils import load_images, load_video
 from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor

 # Load the model in half-precision on the available device(s)
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", device_map="auto")
 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

-# Image
-url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
-image = Image.open(requests.get(url, stream=True).raw)

 conversation = [
    {
@@ -54,6 +46,7 @@ conversation = [
        "content":[
            {
                "type":"image",
+                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            },
            {
                "type":"text",
@@ -63,13 +56,14 @@ conversation = [
    }
 ]

+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)

-# Preprocess the inputs
-text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
-
-inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
-inputs = inputs.to('cuda')

 # Inference: Generation of the output
 output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -78,25 +72,24 @@ output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, cl
 print(output_text)

 # Video
-video = load_video(video="/path/to/video.mp4")
 conversation = [
    {
        "role": "user",
        "content": [
-            {"type": "video"},
+            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "text", "text": "What happened in the video?"},
        ],
    }
 ]

-# Preprocess the inputs
-text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n'
-
-# Qwen2.5VL modifies the time positional encoding (MRoPE) according to the video's frame rate (FPS).
-# Therefore, the video's FPS information needs to be provided as input.
-inputs = processor(text=[text_prompt], videos=[video], fps=[1.0], padding=True, return_tensors="pt")
-inputs = inputs.to('cuda')
+inputs = processor.apply_chat_template(
+    conversation,
+    video_fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)

 # Inference: Generation of the output
 output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -110,21 +103,12 @@ print(output_text)
 The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.

 ```python
-images = load_images([
-    "/path/to/image1.jpg",
-    "/path/to/image2.jpg",
-    "/path/to/image3.jpg",
-    "/path/to/image4.jpg",
-    "/path/to/image5.jpg",
-])
-video = load_video(video="/path/to/video.mp4")
-
 # Conversation for the first image
 conversation1 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "path": "/path/to/image1.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
@@ -135,8 +119,8 @@ conversation2 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
-            {"type": "image"},
+            {"type": "image", "path": "/path/to/image2.jpg"},
+            {"type": "image", "path": "/path/to/image3.jpg"},
            {"type": "text", "text": "What is written in the pictures?"}
        ]
    }
@@ -156,9 +140,9 @@ conversation4 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
-            {"type": "image"},
-            {"type": "video"},
+            {"type": "image", "path": "/path/to/image3.jpg"},
+            {"type": "image", "path": "/path/to/image4.jpg"},
+            {"type": "video", "path": "/path/to/video.jpg"},
            {"type": "text", "text": "What are the common elements in these medias?"},
        ],
    }
@@ -166,15 +150,15 @@ conversation4 = [

 conversations = [conversation1, conversation2, conversation3, conversation4]
 # Preparation for batch inference
-texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
-inputs = processor(
-    text=texts,
-    images=images,
-    videos=[video],
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to('cuda')
+ipnuts = processor.apply_chat_template(
+    conversations,
+    video_fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+

 # Batch Inference
 output_ids = model.generate(**inputs, max_new_tokens=128)
--- a/docs/source/en/model_doc/qwen2_vl.md
+++ b/docs/source/en/model_doc/qwen2_vl.md
@@ -39,20 +39,13 @@ The model can accept both images and videos as input. Here's an example code for

 ```python

-from PIL import Image
-import requests
 import torch
-from torchvision import io
-from typing import Dict
 from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

 # Load the model in half-precision on the available device(s)
 model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")
 processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

-# Image
-url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
-image = Image.open(requests.get(url, stream=True).raw)

 conversation = [
    {
@@ -60,6 +53,7 @@ conversation = [
        "content":[
            {
                "type":"image",
+                "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            },
            {
                "type":"text",
@@ -69,13 +63,13 @@ conversation = [
    }
 ]

-
-# Preprocess the inputs
-text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
-
-inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
-inputs = inputs.to('cuda')
+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)

 # Inference: Generation of the output
 output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -83,50 +77,28 @@ generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(in
 output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
 print(output_text)

+
+
 # Video
-def fetch_video(ele: Dict, nframe_factor=2):
-    if isinstance(ele['video'], str):
-        def round_by_factor(number: int, factor: int) -> int:
-            return round(number / factor) * factor
-
-        video = ele["video"]
-        if video.startswith("file://"):
-            video = video[7:]
-
-        video, _, info = io.read_video(
-            video,
-            start_pts=ele.get("video_start", 0.0),
-            end_pts=ele.get("video_end", None),
-            pts_unit="sec",
-            output_format="TCHW",
-        )
-        assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
-        if "nframes" in ele:
-            nframes = round_by_factor(ele["nframes"], nframe_factor)
-        else:
-            fps = ele.get("fps", 1.0)
-            nframes = round_by_factor(video.size(0) / info["video_fps"] * fps, nframe_factor)
-        idx = torch.linspace(0, video.size(0) - 1, nframes, dtype=torch.int64)
-        return video[idx]
-
-video_info = {"type": "video", "video": "/path/to/video.mp4", "fps": 1.0}
-video = fetch_video(video_info)
 conversation = [
    {
        "role": "user",
        "content": [
-            {"type": "video"},
+            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "text", "text": "What happened in the video?"},
        ],
    }
 ]

-# Preprocess the inputs
-text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n'
+inputs = processor.apply_chat_template(
+    conversation,
+    video_fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)

-inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
-inputs = inputs.to('cuda')

 # Inference: Generation of the output
 output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -140,23 +112,13 @@ print(output_text)
 The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.

 ```python
-image1 = Image.open("/path/to/image1.jpg")
-image2 = Image.open("/path/to/image2.jpg")
-image3 = Image.open("/path/to/image3.jpg")
-image4 = Image.open("/path/to/image4.jpg")
-image5 = Image.open("/path/to/image5.jpg")
-video = fetch_video({
-    "type": "video",
-    "video": "/path/to/video.mp4",
-    "fps": 1.0
-})

 # Conversation for the first image
 conversation1 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
+            {"type": "image", "path": "/path/to/image1.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
@@ -167,8 +129,8 @@ conversation2 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
-            {"type": "image"},
+            {"type": "image", "path": "/path/to/image2.jpg"},
+            {"type": "image", "path": "/path/to/image3.jpg"},
            {"type": "text", "text": "What is written in the pictures?"}
        ]
    }
@@ -188,9 +150,9 @@ conversation4 = [
    {
        "role": "user",
        "content": [
-            {"type": "image"},
-            {"type": "image"},
-            {"type": "video"},
+            {"type": "image", "path": "/path/to/image3.jpg"},
+            {"type": "image", "path": "/path/to/image4.jpg"},
+            {"type": "video", "path": "/path/to/video.jpg"},
            {"type": "text", "text": "What are the common elements in these medias?"},
        ],
    }
@@ -198,15 +160,15 @@ conversation4 = [

 conversations = [conversation1, conversation2, conversation3, conversation4]
 # Preparation for batch inference
-texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
-inputs = processor(
-    text=texts,
-    images=[image1, image2, image3, image4, image5],
-    videos=[video],
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to('cuda')
+ipnuts = processor.apply_chat_template(
+    conversations,
+    video_fps=1,
+    add_generation_prompt=True,
+    tokenize=True,
+    return_dict=True,
+    return_tensors="pt"
+).to(model.device)
+

 # Batch Inference
 output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -236,6 +198,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixel
 ```
 This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28).

+
 #### Multiple Image Inputs

 By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings: