Chat template docs (#36163)

* decompose chat template docs

* add docs

* update model docs

* qwen2-5

* pixtral

* remove old chat template

* also video as list frames supported

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/chat_template_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* remove audio for now

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
Raushan Turganbay
2025-02-14 10:32:14 +01:00
committed by GitHub
parent 3bf02cf440
commit 1931a35140
14 changed files with 1725 additions and 1470 deletions

View File

@@ -47,9 +47,19 @@ Adding these attributes means that LLaVA will try to infer the number of image t
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
### Single image inference
### Formatting Prompts with Chat Templates
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Heres an example of how to structure your input.
We will use [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
For best results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
```python
from transformers import AutoProcessor
@@ -84,60 +94,6 @@ print(text_prompt)
>>> "USER: <image>\n<Whats shown in this image? ASSISTANT: This image shows a red stop sign.</s>USER: Describe the image in more details. ASSISTANT:"
```
### Batched inference
LLaVa also supports batched inference. Here is how you can do it:
```python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
# Get two different images
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image_stop = Image.open(requests.get(url, stream=True).raw)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_cats = Image.open(requests.get(url, stream=True).raw)
# Prepare a batch of two prompts
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
conversation_2 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
prompts = [prompt_1, prompt_2]
# We can simply feed images in the order they have to be used in the text prompt
inputs = processor(images=[image_stop, image_cats], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
```
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by each llava checkpoint:
[llava-interleave models](https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19) requires the following format:
@@ -162,6 +118,96 @@ For multiple turns conversation:
"USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"
```
🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
## Usage examples
### Single input inference
```python
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
```
### Batched inference
LLaVa also supports batched inference. Here is how you can do it:
```python
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load the model in half-precision
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
# Prepare a batch of two prompts
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
conversation_2 = [
{
"role": "user",
"content": [
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(
[conversation_1, conversation_2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
padding=True,
return_tensors="pt"
).to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True)
```
## Note regarding reproducing original implementation
In order to match the logits of the [original implementation](https://github.com/haotian-liu/LLaVA/tree/main), one needs to additionally specify `do_pad=True` when instantiating `LLavaImageProcessor`:

View File

@@ -59,9 +59,17 @@ Adding these attributes means that LLaVA will try to infer the number of image t
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that and the list of formats accepted by each checkpoint.
### Formatting Prompts with Chat Templates
We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Heres an example of how to structure your input. We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image.
```python
from transformers import LlavaNextProcessor
@@ -125,6 +133,10 @@ print(text_prompt)
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n"
```
🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
## Usage example
### Single image inference

View File

@@ -56,9 +56,17 @@ Adding these attributes means that LLaVA will try to infer the number of image t
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use tokenizer's `apply_chat_template` to format your prompts correctly. Below is an example of how to do that.
### Formatting Prompts with Chat Templates
We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. Each content field has to be a list of dicts, as follows:
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Heres an example of how to structure your input. We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images.
```python
from transformers import LlavaNextVideoProcessor
@@ -99,6 +107,10 @@ text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=
print(text_prompt)
```
🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
## Usage example
### Single Media Mode
@@ -106,41 +118,16 @@ print(text_prompt)
The model can accept both images and videos as input. Here's an example code for inference in half-precision (`torch.float16`):
```python
import av
from huggingface_hub import hf_hub_download
import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", torch_dtype=torch.float16, device_map="auto")
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)
conversation = [
{
@@ -148,13 +135,12 @@ conversation = [
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
{"type": "video", "path": video_path},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt")
inputs = processor.apply_chat_template(conversation, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
@@ -166,20 +152,15 @@ processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spac
The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet:
```python
from PIL import Image
import requests
# Generate from image and video mixed inputs
# Load and image and write a new prompt
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "How many cats are there in the image?"},
{"type": "image"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
],
},
{
@@ -192,12 +173,11 @@ conversation = [
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
{"type": "video", "path": video_path},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")
inputs = processor.apply_chat_template(conversation, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, padding=True, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_length=50)

View File

@@ -47,8 +47,18 @@ Tips:
</Tip>
- Note that the model should use a specific prompt format, on which the large language model (LLM) was trained. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities.
### Formatting Prompts with Chat Templates
Each **checkpoint** is trained with a specific prompt format, depending on the underlying large language model backbone. To ensure correct formatting, use the processors `apply_chat_template` method.
**Important:**
- You must construct a conversation history — passing a plain string won't work.
- Each message should be a dictionary with `"role"` and `"content"` keys.
- The `"content"` should be a list of dictionaries for different modalities like `"text"` and `"image"`.
Heres an example of how to structure your input.
We will use [llava-onevision-qwen2-7b-si-hf](https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
```python
@@ -84,6 +94,9 @@ print(text_prompt)
'<|im_start|>user\n<image>What is shown in this image?<|im_end|>\n<|im_start|>assistant\nPage showing the list of options.<|im_end|>'
```
🚀 **Bonus:** If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main).
@@ -97,28 +110,28 @@ Here's how to load the model and perform inference in half-precision (`torch.flo
```python
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image
import requests
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
"llava-hf/llava-onevision-qwen2-7b-ov-hf",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="cuda:0"
)
# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda:0", torch.float16)
inputs = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = inputs.to("cuda:0", torch.float16)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
@@ -140,22 +153,12 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
# Get three different images
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image_stop = Image.open(requests.get(url, stream=True).raw)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_cats = Image.open(requests.get(url, stream=True).raw)
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image_snowman = Image.open(requests.get(url, stream=True).raw)
# Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not
conversation_1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
@@ -168,7 +171,7 @@ conversation_1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "What about this image? How many cats do you see?"},
],
},
@@ -178,18 +181,20 @@ conversation_2 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)
prompts = [prompt_1, prompt_2]
# We can simply feed images in the order they have to be used in the text prompt
inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(model.device, torch.float16)
inputs = processor.apply_chat_template(
[conversation_1, conversation_2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
padding=True,
return_tensors="pt"
).to(model.device, torch.float16)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
@@ -202,10 +207,7 @@ processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokeniza
LLaVa-OneVision also can perform inference with videos as input, where video frames are treated as multiple images. Here is how you can do it:
```python
import av
import numpy as np
from huggingface_hub import hf_hub_download
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
@@ -213,48 +215,26 @@ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
def read_video_pyav(container, indices):
'''
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
'''
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos, up to 32 frames)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)
# For videos we have to feed a "video" type instead of "image"
conversation = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "video", "path": video_path},
{"type": "text", "text": "Why is this video funny?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(videos=list(video), text=prompt, return_tensors="pt").to("cuda:0", torch.float16)
inputs = processor.apply_chat_template(
conversation,
num_frames=8
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.float16)
out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)

View File

@@ -28,7 +28,8 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
- For text-only inputs use `MllamaForCausalLM` for generation to avoid loading vision tower.
- Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images across samples and to a maximum number of tiles within each image.
- The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the **Usage Examples** below for more details on how to use it.
<Tip warning={true}>
@@ -53,9 +54,7 @@ model.set_output_embeddings(resized_embeddings)
#### Instruct model
```python
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
@@ -67,18 +66,13 @@ messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
url = "https://llava-vl.github.io/static/images/view.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))
```

View File

@@ -38,38 +38,42 @@ Tips:
```
"<s>[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]"
```
Then, the processor will replace each `[IMG]` token with a number of `[IMG]` tokens that depend on the height and the width of each image. Each *row* of the image is separated by an `[IMG_BREAK]` token, and each image is separated by an `[IMG_END]` token. It's advised to use the `apply_chat_template` method of the processor, which takes care of all of this. See the [usage section](#usage) for more info.
Then, the processor will replace each `[IMG]` token with a number of `[IMG]` tokens that depend on the height and the width of each image. Each *row* of the image is separated by an `[IMG_BREAK]` token, and each image is separated by an `[IMG_END]` token. It's advised to use the `apply_chat_template` method of the processor, which takes care of all of this and formats the text for you. If you're using `transformers>=4.49.0`, you can also get a vectorized output from `apply_chat_template`. See the [usage section](#usage) for more info.
This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts) and [ArthurZ](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/vllm-project/vllm/pull/8377).
## Usage
At inference time, it's advised to use the processor's `apply_chat_template` method, which correctly formats the prompt for the model:
```python
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda")
chat = [
{
"role": "user", "content": [
{"type": "text", "content": "Can this animal"},
{"type": "image"},
{"type": "image", "ur": "https://picsum.photos/id/237/200/300"},
{"type": "text", "content": "live here?"},
{"type": "image"}
{"type": "image", "url": "https://picsum.photos/seed/picsum/200/300"}
]
}
]
prompt = processor.apply_chat_template(chat)
inputs = processor(text=prompt, images=[url_dog, url_mountain], return_tensors="pt").to(model.device)
inputs = processor.apply_chat_template(
chat,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
```

View File

@@ -32,21 +32,13 @@ The model can accept both images and videos as input. Here's an example code for
```python
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers.image_utils import load_images, load_video
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
@@ -54,6 +46,7 @@ conversation = [
"content":[
{
"type":"image",
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
},
{
"type":"text",
@@ -63,13 +56,14 @@ conversation = [
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -78,25 +72,24 @@ output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, cl
print(output_text)
# Video
video = load_video(video="/path/to/video.mp4")
conversation = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "text", "text": "What happened in the video?"},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n'
# Qwen2.5VL modifies the time positional encoding (MRoPE) according to the video's frame rate (FPS).
# Therefore, the video's FPS information needs to be provided as input.
inputs = processor(text=[text_prompt], videos=[video], fps=[1.0], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')
inputs = processor.apply_chat_template(
conversation,
video_fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -110,21 +103,12 @@ print(output_text)
The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.
```python
images = load_images([
"/path/to/image1.jpg",
"/path/to/image2.jpg",
"/path/to/image3.jpg",
"/path/to/image4.jpg",
"/path/to/image5.jpg",
])
video = load_video(video="/path/to/video.mp4")
# Conversation for the first image
conversation1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "path": "/path/to/image1.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
@@ -135,8 +119,8 @@ conversation2 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "image", "path": "/path/to/image2.jpg"},
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "text", "text": "What is written in the pictures?"}
]
}
@@ -156,9 +140,9 @@ conversation4 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "video"},
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "image", "path": "/path/to/image4.jpg"},
{"type": "video", "path": "/path/to/video.jpg"},
{"type": "text", "text": "What are the common elements in these medias?"},
],
}
@@ -166,15 +150,15 @@ conversation4 = [
conversations = [conversation1, conversation2, conversation3, conversation4]
# Preparation for batch inference
texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
inputs = processor(
text=texts,
images=images,
videos=[video],
padding=True,
return_tensors="pt",
)
inputs = inputs.to('cuda')
ipnuts = processor.apply_chat_template(
conversations,
video_fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Batch Inference
output_ids = model.generate(**inputs, max_new_tokens=128)

View File

@@ -39,20 +39,13 @@ The model can accept both images and videos as input. Here's an example code for
```python
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
@@ -60,6 +53,7 @@ conversation = [
"content":[
{
"type":"image",
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
},
{
"type":"text",
@@ -69,13 +63,13 @@ conversation = [
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -83,50 +77,28 @@ generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(in
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(output_text)
# Video
def fetch_video(ele: Dict, nframe_factor=2):
if isinstance(ele['video'], str):
def round_by_factor(number: int, factor: int) -> int:
return round(number / factor) * factor
video = ele["video"]
if video.startswith("file://"):
video = video[7:]
video, _, info = io.read_video(
video,
start_pts=ele.get("video_start", 0.0),
end_pts=ele.get("video_end", None),
pts_unit="sec",
output_format="TCHW",
)
assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
if "nframes" in ele:
nframes = round_by_factor(ele["nframes"], nframe_factor)
else:
fps = ele.get("fps", 1.0)
nframes = round_by_factor(video.size(0) / info["video_fps"] * fps, nframe_factor)
idx = torch.linspace(0, video.size(0) - 1, nframes, dtype=torch.int64)
return video[idx]
video_info = {"type": "video", "video": "/path/to/video.mp4", "fps": 1.0}
video = fetch_video(video_info)
conversation = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "text", "text": "What happened in the video?"},
],
}
]
# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>What happened in the video?<|im_end|>\n<|im_start|>assistant\n'
inputs = processor.apply_chat_template(
conversation,
video_fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
inputs = processor(text=[text_prompt], videos=[video], padding=True, return_tensors="pt")
inputs = inputs.to('cuda')
# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -140,23 +112,13 @@ print(output_text)
The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.
```python
image1 = Image.open("/path/to/image1.jpg")
image2 = Image.open("/path/to/image2.jpg")
image3 = Image.open("/path/to/image3.jpg")
image4 = Image.open("/path/to/image4.jpg")
image5 = Image.open("/path/to/image5.jpg")
video = fetch_video({
"type": "video",
"video": "/path/to/video.mp4",
"fps": 1.0
})
# Conversation for the first image
conversation1 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image", "path": "/path/to/image1.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
@@ -167,8 +129,8 @@ conversation2 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "image", "path": "/path/to/image2.jpg"},
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "text", "text": "What is written in the pictures?"}
]
}
@@ -188,9 +150,9 @@ conversation4 = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "video"},
{"type": "image", "path": "/path/to/image3.jpg"},
{"type": "image", "path": "/path/to/image4.jpg"},
{"type": "video", "path": "/path/to/video.jpg"},
{"type": "text", "text": "What are the common elements in these medias?"},
],
}
@@ -198,15 +160,15 @@ conversation4 = [
conversations = [conversation1, conversation2, conversation3, conversation4]
# Preparation for batch inference
texts = [processor.apply_chat_template(msg, add_generation_prompt=True) for msg in conversations]
inputs = processor(
text=texts,
images=[image1, image2, image3, image4, image5],
videos=[video],
padding=True,
return_tensors="pt",
)
inputs = inputs.to('cuda')
ipnuts = processor.apply_chat_template(
conversations,
video_fps=1,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# Batch Inference
output_ids = model.generate(**inputs, max_new_tokens=128)
@@ -236,6 +198,7 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixel
```
This ensures each image gets encoded using a number between 256-1024 tokens. The 28 comes from the fact that the model uses a patch size of 14 and a temporal patch size of 2 (14 x 2 = 28).
#### Multiple Image Inputs
By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings: