* update the glm4 model readme

* update test

* update GLM-4.1V model

* update as format

* update

* fix some tests

* fix the rest

* fix on a10, not t4

* nit: dummy import

---------

Co-authored-by: raushan <raushan@huggingface.co>
This commit is contained in:
Yuxuan Zhang
2025-07-08 14:22:04 +08:00
committed by GitHub
parent bbca9782ca
commit 17b3c96c00
5 changed files with 154 additions and 76 deletions

View File

@@ -18,7 +18,37 @@ rendered properly in your Markdown viewer.
## Overview ## Overview
To be released with the official model launch. The GLM family welcomes new members [GLM-4-0414](https://arxiv.org/pdf/2406.12793) series models.
The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAIs GPT
series and DeepSeeks V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414
was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the
foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference
alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we
enhanced the models performance in instruction following, engineering code, and function calling, thus strengthening
the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact
generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as
code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like
GPT-4o and DeepSeek-V3-0324 (671B).
**GLM-Z1-32B-0414** is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414
through cold start, extended reinforcement learning, and further training on tasks including mathematics, code, and
logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to
solve complex tasks. During training, we also introduced general reinforcement learning based on pairwise ranking
feedback, which enhances the model's general capabilities.
**GLM-Z1-Rumination-32B-0414** is a deep reasoning model with rumination capabilities (against OpenAI's Deep Research).
Unlike typical deep thinking models, the rumination model is capable of deeper and longer thinking to solve more
open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future
development plans). Z1-Rumination is trained through scaling end-to-end reinforcement learning with responses graded by
the ground truth answers or rubrics and can make use of search tools during its deep thinking process to handle complex
tasks. The model shows significant improvements in research-style writing and complex tasks.
Finally, **GLM-Z1-9B-0414** is a surprise. We employed all the aforementioned techniques to train a small model (9B).
GLM-Z1-9B-0414 exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is
top-ranked among all open-source models of the same size. Especially in resource-constrained scenarios, this model
achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking
lightweight deployment.
## Glm4Config ## Glm4Config

View File

@@ -23,6 +23,29 @@ rendered properly in your Markdown viewer.
# GLM-4.1V # GLM-4.1V
## Overview
**GLM-4.1V-9B-Thinking** is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces
a "thinking paradigm" with reinforcement learning, achieving state-of-the-art results among 10B-class models and
rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base
model for further research. You can check our paper [here](https://huggingface.co/papers/2507.01006). and below is a abstract.
*We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding
and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework.
We first develop a capable vision foundation model with significant potential through large-scale pre-training, which
arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum
Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a
diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding,
GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art
performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model
outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks
relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or
superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document
understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information
are released at https://github.com/THUDM/GLM-4.1V-Thinking.*
## Usage
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
<hfoptions id="usage"> <hfoptions id="usage">

View File

@@ -173,7 +173,10 @@ class Glm4vVideoProcessor(BaseVideoProcessor):
timestamps_list.append(timestamps) timestamps_list.append(timestamps)
processed_videos.append(video) processed_videos.append(video)
else: else:
raise AssertionError("Must set `do_sample_frames=True` to sample frames from GLM-4.1V Model.") # Assume 24 fps by default and prepare timestamps for the whole video when all frames are sampled
processed_videos = videos
timestamps_list = [[idx // 24 for idx in range(len(video))] for video in videos]
timestamps_list = timestamps_list[::2] # mrope
grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos) grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos)
resized_videos_grouped = {} resized_videos_grouped = {}

View File

@@ -16,15 +16,12 @@
import gc import gc
import unittest import unittest
import requests
from transformers import ( from transformers import (
AutoProcessor, AutoProcessor,
Glm4vConfig, Glm4vConfig,
Glm4vForConditionalGeneration, Glm4vForConditionalGeneration,
Glm4vModel, Glm4vModel,
is_torch_available, is_torch_available,
is_vision_available,
) )
from transformers.testing_utils import ( from transformers.testing_utils import (
require_flash_attn, require_flash_attn,
@@ -47,10 +44,6 @@ if is_torch_available():
import torch import torch
if is_vision_available():
from PIL import Image
class Glm4vVisionText2TextModelTester: class Glm4vVisionText2TextModelTester:
def __init__( def __init__(
self, self,
@@ -177,6 +170,8 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else () all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else ()
test_pruning = False test_pruning = False
test_head_masking = False test_head_masking = False
test_torchscript = False
model_split_percents = [0.7, 0.9] # model too big to split at 0.5
_is_composite = True _is_composite = True
def setUp(self): def setUp(self):
@@ -264,22 +259,34 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
torch.testing.assert_close(out_embeds, out_ids) torch.testing.assert_close(out_embeds, out_ids)
@unittest.skip("Model checkpoint not yet released")
@require_torch @require_torch
class Glm4vIntegrationTest(unittest.TestCase): class Glm4vIntegrationTest(unittest.TestCase):
def setUp(self): def setUp(self):
self.processor = AutoProcessor.from_pretrained("z") self.processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
self.messages = [ self.message = [
{ {
"role": "user", "role": "user",
"content": [ "content": [
{"type": "image"}, {
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{"type": "text", "text": "What kind of dog is this?"},
],
}
]
self.message2 = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png",
},
{"type": "text", "text": "What kind of dog is this?"}, {"type": "text", "text": "What kind of dog is this?"},
], ],
} }
] ]
url = "https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/demo_small.jpg"
self.image = Image.open(requests.get(url, stream=True).raw)
def tearDown(self): def tearDown(self):
gc.collect() gc.collect()
@@ -291,20 +298,20 @@ class Glm4vIntegrationTest(unittest.TestCase):
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) inputs = self.processor.apply_chat_template(
inputs = self.processor(text=[text], images=[self.image], return_tensors="pt") self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
)
expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655] # fmt: skip expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip
assert expected_input_ids == inputs.input_ids[0].tolist()[:17] assert expected_input_ids == inputs.input_ids[0].tolist()[:17]
expected_pixel_slice = torch.tensor( expected_pixel_slice = torch.tensor(
[ [
[0.8792, 0.8792, 0.9084], [-0.0988, -0.0842, -0.0842],
[1.1858, 1.1858, 1.2296], [-0.5660, -0.5514, -0.4200],
[1.2004, 1.2004, 1.2150], [-0.0259, -0.0259, -0.0259],
[1.4340, 1.4340, 1.4194], [-0.1280, -0.0988, -0.2010],
[1.3902, 1.4048, 1.4194], [-0.4638, -0.5806, -0.6974],
[1.5216, 1.5362, 1.5362], [-1.2083, -1.2229, -1.2083],
], ],
dtype=torch.float32, dtype=torch.float32,
device="cpu", device="cpu",
@@ -315,8 +322,7 @@ class Glm4vIntegrationTest(unittest.TestCase):
inputs = inputs.to(torch_device) inputs = inputs.to(torch_device)
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices" EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
self.assertEqual( self.assertEqual(
self.processor.decode(output[0], skip_special_tokens=True), self.processor.decode(output[0], skip_special_tokens=True),
EXPECTED_DECODED_TEXT, EXPECTED_DECODED_TEXT,
@@ -327,17 +333,17 @@ class Glm4vIntegrationTest(unittest.TestCase):
model = Glm4vForConditionalGeneration.from_pretrained( model = Glm4vForConditionalGeneration.from_pretrained(
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) batch_messages = [self.message] * 2
inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to( inputs = self.processor.apply_chat_template(
torch_device batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
) ).to(torch_device)
# it should not matter whether two images are the same size or not # it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
] # fmt: skip ] # fmt: skip
self.assertEqual( self.assertEqual(
self.processor.batch_decode(output, skip_special_tokens=True), self.processor.batch_decode(output, skip_special_tokens=True),
@@ -349,15 +355,15 @@ class Glm4vIntegrationTest(unittest.TestCase):
model = Glm4vForConditionalGeneration.from_pretrained( model = Glm4vForConditionalGeneration.from_pretrained(
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) inputs = self.processor.apply_chat_template(
inputs = self.processor(text=[text], images=[self.image], return_tensors="pt").to(torch_device) self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
).to(torch_device)
output = model.generate(**inputs, max_new_tokens=30, num_return_sequences=3) output = model.generate(**inputs, max_new_tokens=30, do_sample=False, num_beams=2, num_return_sequences=2)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat. Specifically",
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat, specifically"
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
] # fmt: skip ] # fmt: skip
self.assertEqual( self.assertEqual(
self.processor.batch_decode(output, skip_special_tokens=True), self.processor.batch_decode(output, skip_special_tokens=True),
@@ -369,22 +375,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
model = Glm4vForConditionalGeneration.from_pretrained( model = Glm4vForConditionalGeneration.from_pretrained(
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) message_wo_image = [
messages2 = [ {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
] ]
text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True) batched_messages = [self.message, message_wo_image]
inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to( inputs = self.processor.apply_chat_template(
torch_device batched_messages,
) tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(torch_device)
# it should not matter whether two images are the same size or not # it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.' '\nWho are you?\n<think>Got it, the user is asking "Who are you?" I need to respond appropriately. First, I should clarify that I\'m an AI assistant'
] # fmt: skip ] # fmt: skip
self.assertEqual( self.assertEqual(
self.processor.batch_decode(output, skip_special_tokens=True), self.processor.batch_decode(output, skip_special_tokens=True),
@@ -396,19 +405,22 @@ class Glm4vIntegrationTest(unittest.TestCase):
model = Glm4vForConditionalGeneration.from_pretrained( model = Glm4vForConditionalGeneration.from_pretrained(
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) batched_messages = [self.message, self.message2]
text2 = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) inputs = self.processor.apply_chat_template(
image2 = self.image.resize((224, 224)) batched_messages,
inputs = self.processor(text=[text, text2], images=[self.image, image2], padding=True, return_tensors="pt").to( tokenize=True,
torch_device add_generation_prompt=True,
) return_dict=True,
return_tensors="pt",
padding=True,
).to(torch_device)
# it should not matter whether two images are the same size or not # it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets' "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
] # fmt: skip ] # fmt: skip
self.assertEqual( self.assertEqual(
self.processor.batch_decode(output, skip_special_tokens=True), self.processor.batch_decode(output, skip_special_tokens=True),
@@ -425,18 +437,23 @@ class Glm4vIntegrationTest(unittest.TestCase):
attn_implementation="flash_attention_2", attn_implementation="flash_attention_2",
device_map="auto", device_map="auto",
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) batched_messages = [self.message, self.message2]
inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to( inputs = self.processor.apply_chat_template(
torch_device batched_messages,
) tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(torch_device)
# it should not matter whether two images are the same size or not # it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices", "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices", "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
] ] # fmt: skip
self.assertEqual( self.assertEqual(
self.processor.batch_decode(output, skip_special_tokens=True), self.processor.batch_decode(output, skip_special_tokens=True),
EXPECTED_DECODED_TEXT, EXPECTED_DECODED_TEXT,
@@ -452,22 +469,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
attn_implementation="flash_attention_2", attn_implementation="flash_attention_2",
device_map="auto", device_map="auto",
) )
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) message_wo_image = [
messages2 = [ {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
] ]
text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True) batched_messages = [self.message, message_wo_image]
inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to( inputs = self.processor.apply_chat_template(
torch_device batched_messages,
) tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(torch_device)
# it should not matter whether two images are the same size or not # it should not matter whether two images are the same size or not
output = model.generate(**inputs, max_new_tokens=30) output = model.generate(**inputs, max_new_tokens=30)
EXPECTED_DECODED_TEXT = [ EXPECTED_DECODED_TEXT = [
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.' '\nWho are you?\n<think>Got it, let\'s look at the question. The user is asking "Who are you?" which is a common question when someone meets an AI'
] # fmt: skip ] # fmt: skip
self.assertEqual( self.assertEqual(

View File

@@ -176,10 +176,12 @@ class VideoProcessingTestMixin:
torch.compiler.reset() torch.compiler.reset()
video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch") video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch")
video_processor = self.fast_video_processing_class(**self.video_processor_dict) video_processor = self.fast_video_processing_class(**self.video_processor_dict)
output_eager = video_processor(video_inputs, device=torch_device, return_tensors="pt") output_eager = video_processor(video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt")
video_processor = torch.compile(video_processor, mode="reduce-overhead") video_processor = torch.compile(video_processor, mode="reduce-overhead")
output_compiled = video_processor(video_inputs, device=torch_device, return_tensors="pt") output_compiled = video_processor(
video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt"
)
torch.testing.assert_close( torch.testing.assert_close(
output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4 output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4