Add support for including in-memory videos (not just files/urls) in apply_chat_template (#39494)

* added code for handling video object ,as dictionary of frames and metadata, in chat template

* added new test where videos are passed as objects (dict of frames, metadata) in the chat template

* modified hardcoded video_len check that does not match with increased number of tests cases.

* Modify hardcoded video_len check that fails with increased number of tests

* update documentation of multi-modal chat templating with extra information about including video object in chat template.

* add array handling in load_video()

* temporary test video inlcuded

* skip testing smolvlm with videos that are list of frames

* update documentation & make fixup

* Address review comments
This commit is contained in:
Akib Jawad
2025-08-04 02:49:42 -07:00
committed by GitHub
parent 0d511f7a77
commit 2a9febd632
9 changed files with 106 additions and 16 deletions

View File

@@ -111,6 +111,7 @@ Some vision models also support video inputs. The message format is very similar
- The content `"type"` should be `"video"` to indicate the content is a video.
- For videos, it can be a link to the video (`"url"`) or it could be a file path (`"path"`). Videos loaded from a URL can only be decoded with [PyAV](https://pyav.basswood-io.com/docs/stable/) or [Decord](https://github.com/dmlc/decord).
- In addition to loading videos from a URL or file path, you can also pass decoded video data directly. This is useful if youve already preprocessed or decoded video frames elsewhere in memory (e.g., using OpenCV, decord, or torchvision). You don't need to save to files or store it in an URL.
> [!WARNING]
> Loading a video from `"url"` is only supported by the PyAV or Decord backends.
@@ -137,6 +138,52 @@ messages = [
]
```
### Example: Passing decoded video objects
```python
import numpy as np
video_object1 = np.random.randint(0, 255, size=(16, 224, 224, 3), dtype=np.uint8),
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object1},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
```
You can also use existing (`"load_video()"`) function to load a video, edit the video in memory and pass it in the messages.
```python
# Make sure a video backend library (pyav, decord, or torchvision) is available.
from transformers.video_utils import load_video
# load a video file in memory for testing
video_object2, _ = load_video(
"https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"
)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_object2},
{"type": "text", "text": "What do you see in this video?"}
],
},
]
```
Pass `messages` to [`~ProcessorMixin.apply_chat_template`] to tokenize the input content. There are a few extra parameters to include in [`~ProcessorMixin.apply_chat_template`] that controls the sampling process.
The `video_load_backend` parameter refers to a specific framework to load a video. It supports [PyAV](https://pyav.basswood-io.com/docs/stable/), [Decord](https://github.com/dmlc/decord), [OpenCV](https://github.com/opencv/opencv), and [torchvision](https://pytorch.org/vision/stable/index.html).