Glm 4 doc (#39247)
* update the glm4 model readme * update test * update GLM-4.1V model * update as format * update * fix some tests * fix the rest * fix on a10, not t4 * nit: dummy import --------- Co-authored-by: raushan <raushan@huggingface.co>
This commit is contained in:
@@ -18,7 +18,37 @@ rendered properly in your Markdown viewer.
|
|||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
To be released with the official model launch.
|
The GLM family welcomes new members [GLM-4-0414](https://arxiv.org/pdf/2406.12793) series models.
|
||||||
|
|
||||||
|
The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAI’s GPT
|
||||||
|
series and DeepSeek’s V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414
|
||||||
|
was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the
|
||||||
|
foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference
|
||||||
|
alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we
|
||||||
|
enhanced the model’s performance in instruction following, engineering code, and function calling, thus strengthening
|
||||||
|
the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact
|
||||||
|
generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as
|
||||||
|
code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like
|
||||||
|
GPT-4o and DeepSeek-V3-0324 (671B).
|
||||||
|
|
||||||
|
**GLM-Z1-32B-0414** is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414
|
||||||
|
through cold start, extended reinforcement learning, and further training on tasks including mathematics, code, and
|
||||||
|
logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to
|
||||||
|
solve complex tasks. During training, we also introduced general reinforcement learning based on pairwise ranking
|
||||||
|
feedback, which enhances the model's general capabilities.
|
||||||
|
|
||||||
|
**GLM-Z1-Rumination-32B-0414** is a deep reasoning model with rumination capabilities (against OpenAI's Deep Research).
|
||||||
|
Unlike typical deep thinking models, the rumination model is capable of deeper and longer thinking to solve more
|
||||||
|
open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future
|
||||||
|
development plans). Z1-Rumination is trained through scaling end-to-end reinforcement learning with responses graded by
|
||||||
|
the ground truth answers or rubrics and can make use of search tools during its deep thinking process to handle complex
|
||||||
|
tasks. The model shows significant improvements in research-style writing and complex tasks.
|
||||||
|
|
||||||
|
Finally, **GLM-Z1-9B-0414** is a surprise. We employed all the aforementioned techniques to train a small model (9B).
|
||||||
|
GLM-Z1-9B-0414 exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is
|
||||||
|
top-ranked among all open-source models of the same size. Especially in resource-constrained scenarios, this model
|
||||||
|
achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking
|
||||||
|
lightweight deployment.
|
||||||
|
|
||||||
## Glm4Config
|
## Glm4Config
|
||||||
|
|
||||||
|
|||||||
@@ -23,6 +23,29 @@ rendered properly in your Markdown viewer.
|
|||||||
|
|
||||||
# GLM-4.1V
|
# GLM-4.1V
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
**GLM-4.1V-9B-Thinking** is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces
|
||||||
|
a "thinking paradigm" with reinforcement learning, achieving state-of-the-art results among 10B-class models and
|
||||||
|
rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base
|
||||||
|
model for further research. You can check our paper [here](https://huggingface.co/papers/2507.01006). and below is a abstract.
|
||||||
|
|
||||||
|
*We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding
|
||||||
|
and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework.
|
||||||
|
We first develop a capable vision foundation model with significant potential through large-scale pre-training, which
|
||||||
|
arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum
|
||||||
|
Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a
|
||||||
|
diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding,
|
||||||
|
GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art
|
||||||
|
performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model
|
||||||
|
outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks
|
||||||
|
relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or
|
||||||
|
superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document
|
||||||
|
understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information
|
||||||
|
are released at https://github.com/THUDM/GLM-4.1V-Thinking.*
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
||||||
|
|
||||||
<hfoptions id="usage">
|
<hfoptions id="usage">
|
||||||
|
|||||||
@@ -173,7 +173,10 @@ class Glm4vVideoProcessor(BaseVideoProcessor):
|
|||||||
timestamps_list.append(timestamps)
|
timestamps_list.append(timestamps)
|
||||||
processed_videos.append(video)
|
processed_videos.append(video)
|
||||||
else:
|
else:
|
||||||
raise AssertionError("Must set `do_sample_frames=True` to sample frames from GLM-4.1V Model.")
|
# Assume 24 fps by default and prepare timestamps for the whole video when all frames are sampled
|
||||||
|
processed_videos = videos
|
||||||
|
timestamps_list = [[idx // 24 for idx in range(len(video))] for video in videos]
|
||||||
|
timestamps_list = timestamps_list[::2] # mrope
|
||||||
|
|
||||||
grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos)
|
grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos)
|
||||||
resized_videos_grouped = {}
|
resized_videos_grouped = {}
|
||||||
|
|||||||
@@ -16,15 +16,12 @@
|
|||||||
import gc
|
import gc
|
||||||
import unittest
|
import unittest
|
||||||
|
|
||||||
import requests
|
|
||||||
|
|
||||||
from transformers import (
|
from transformers import (
|
||||||
AutoProcessor,
|
AutoProcessor,
|
||||||
Glm4vConfig,
|
Glm4vConfig,
|
||||||
Glm4vForConditionalGeneration,
|
Glm4vForConditionalGeneration,
|
||||||
Glm4vModel,
|
Glm4vModel,
|
||||||
is_torch_available,
|
is_torch_available,
|
||||||
is_vision_available,
|
|
||||||
)
|
)
|
||||||
from transformers.testing_utils import (
|
from transformers.testing_utils import (
|
||||||
require_flash_attn,
|
require_flash_attn,
|
||||||
@@ -47,10 +44,6 @@ if is_torch_available():
|
|||||||
import torch
|
import torch
|
||||||
|
|
||||||
|
|
||||||
if is_vision_available():
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
|
|
||||||
class Glm4vVisionText2TextModelTester:
|
class Glm4vVisionText2TextModelTester:
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
@@ -177,6 +170,8 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
|||||||
all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else ()
|
all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else ()
|
||||||
test_pruning = False
|
test_pruning = False
|
||||||
test_head_masking = False
|
test_head_masking = False
|
||||||
|
test_torchscript = False
|
||||||
|
model_split_percents = [0.7, 0.9] # model too big to split at 0.5
|
||||||
_is_composite = True
|
_is_composite = True
|
||||||
|
|
||||||
def setUp(self):
|
def setUp(self):
|
||||||
@@ -264,22 +259,34 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
|||||||
torch.testing.assert_close(out_embeds, out_ids)
|
torch.testing.assert_close(out_embeds, out_ids)
|
||||||
|
|
||||||
|
|
||||||
@unittest.skip("Model checkpoint not yet released")
|
|
||||||
@require_torch
|
@require_torch
|
||||||
class Glm4vIntegrationTest(unittest.TestCase):
|
class Glm4vIntegrationTest(unittest.TestCase):
|
||||||
def setUp(self):
|
def setUp(self):
|
||||||
self.processor = AutoProcessor.from_pretrained("z")
|
self.processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
|
||||||
self.messages = [
|
self.message = [
|
||||||
{
|
{
|
||||||
"role": "user",
|
"role": "user",
|
||||||
"content": [
|
"content": [
|
||||||
{"type": "image"},
|
{
|
||||||
|
"type": "image",
|
||||||
|
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
|
||||||
|
},
|
||||||
|
{"type": "text", "text": "What kind of dog is this?"},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
self.message2 = [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"type": "image",
|
||||||
|
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png",
|
||||||
|
},
|
||||||
{"type": "text", "text": "What kind of dog is this?"},
|
{"type": "text", "text": "What kind of dog is this?"},
|
||||||
],
|
],
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
url = "https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/demo_small.jpg"
|
|
||||||
self.image = Image.open(requests.get(url, stream=True).raw)
|
|
||||||
|
|
||||||
def tearDown(self):
|
def tearDown(self):
|
||||||
gc.collect()
|
gc.collect()
|
||||||
@@ -291,20 +298,20 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
||||||
)
|
)
|
||||||
|
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
inputs = self.processor.apply_chat_template(
|
||||||
inputs = self.processor(text=[text], images=[self.image], return_tensors="pt")
|
self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
|
||||||
|
)
|
||||||
expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655] # fmt: skip
|
expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip
|
||||||
assert expected_input_ids == inputs.input_ids[0].tolist()[:17]
|
assert expected_input_ids == inputs.input_ids[0].tolist()[:17]
|
||||||
|
|
||||||
expected_pixel_slice = torch.tensor(
|
expected_pixel_slice = torch.tensor(
|
||||||
[
|
[
|
||||||
[0.8792, 0.8792, 0.9084],
|
[-0.0988, -0.0842, -0.0842],
|
||||||
[1.1858, 1.1858, 1.2296],
|
[-0.5660, -0.5514, -0.4200],
|
||||||
[1.2004, 1.2004, 1.2150],
|
[-0.0259, -0.0259, -0.0259],
|
||||||
[1.4340, 1.4340, 1.4194],
|
[-0.1280, -0.0988, -0.2010],
|
||||||
[1.3902, 1.4048, 1.4194],
|
[-0.4638, -0.5806, -0.6974],
|
||||||
[1.5216, 1.5362, 1.5362],
|
[-1.2083, -1.2229, -1.2083],
|
||||||
],
|
],
|
||||||
dtype=torch.float32,
|
dtype=torch.float32,
|
||||||
device="cpu",
|
device="cpu",
|
||||||
@@ -315,8 +322,7 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
inputs = inputs.to(torch_device)
|
inputs = inputs.to(torch_device)
|
||||||
|
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices"
|
EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
|
||||||
|
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.decode(output[0], skip_special_tokens=True),
|
self.processor.decode(output[0], skip_special_tokens=True),
|
||||||
EXPECTED_DECODED_TEXT,
|
EXPECTED_DECODED_TEXT,
|
||||||
@@ -327,17 +333,17 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
model = Glm4vForConditionalGeneration.from_pretrained(
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
||||||
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
batch_messages = [self.message] * 2
|
||||||
inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to(
|
inputs = self.processor.apply_chat_template(
|
||||||
torch_device
|
batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
|
||||||
)
|
).to(torch_device)
|
||||||
|
|
||||||
# it should not matter whether two images are the same size or not
|
# it should not matter whether two images are the same size or not
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
|
||||||
] # fmt: skip
|
] # fmt: skip
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.batch_decode(output, skip_special_tokens=True),
|
self.processor.batch_decode(output, skip_special_tokens=True),
|
||||||
@@ -349,15 +355,15 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
model = Glm4vForConditionalGeneration.from_pretrained(
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
||||||
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
inputs = self.processor.apply_chat_template(
|
||||||
inputs = self.processor(text=[text], images=[self.image], return_tensors="pt").to(torch_device)
|
self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
output = model.generate(**inputs, max_new_tokens=30, num_return_sequences=3)
|
output = model.generate(**inputs, max_new_tokens=30, do_sample=False, num_beams=2, num_return_sequences=2)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat. Specifically",
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat, specifically"
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
|
||||||
] # fmt: skip
|
] # fmt: skip
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.batch_decode(output, skip_special_tokens=True),
|
self.processor.batch_decode(output, skip_special_tokens=True),
|
||||||
@@ -369,22 +375,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
model = Glm4vForConditionalGeneration.from_pretrained(
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
||||||
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
message_wo_image = [
|
||||||
messages2 = [
|
{"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
|
||||||
{"role": "user", "content": "Who are you?"},
|
|
||||||
]
|
]
|
||||||
text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
|
batched_messages = [self.message, message_wo_image]
|
||||||
inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to(
|
inputs = self.processor.apply_chat_template(
|
||||||
torch_device
|
batched_messages,
|
||||||
)
|
tokenize=True,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_dict=True,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding=True,
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
# it should not matter whether two images are the same size or not
|
# it should not matter whether two images are the same size or not
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
|
||||||
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.'
|
'\nWho are you?\n<think>Got it, the user is asking "Who are you?" I need to respond appropriately. First, I should clarify that I\'m an AI assistant'
|
||||||
] # fmt: skip
|
] # fmt: skip
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.batch_decode(output, skip_special_tokens=True),
|
self.processor.batch_decode(output, skip_special_tokens=True),
|
||||||
@@ -396,19 +405,22 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
model = Glm4vForConditionalGeneration.from_pretrained(
|
model = Glm4vForConditionalGeneration.from_pretrained(
|
||||||
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
"THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
batched_messages = [self.message, self.message2]
|
||||||
text2 = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
inputs = self.processor.apply_chat_template(
|
||||||
image2 = self.image.resize((224, 224))
|
batched_messages,
|
||||||
inputs = self.processor(text=[text, text2], images=[self.image, image2], padding=True, return_tensors="pt").to(
|
tokenize=True,
|
||||||
torch_device
|
add_generation_prompt=True,
|
||||||
)
|
return_dict=True,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding=True,
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
# it should not matter whether two images are the same size or not
|
# it should not matter whether two images are the same size or not
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets'
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
|
||||||
] # fmt: skip
|
] # fmt: skip
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.batch_decode(output, skip_special_tokens=True),
|
self.processor.batch_decode(output, skip_special_tokens=True),
|
||||||
@@ -425,18 +437,23 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
attn_implementation="flash_attention_2",
|
attn_implementation="flash_attention_2",
|
||||||
device_map="auto",
|
device_map="auto",
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
batched_messages = [self.message, self.message2]
|
||||||
inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to(
|
inputs = self.processor.apply_chat_template(
|
||||||
torch_device
|
batched_messages,
|
||||||
)
|
tokenize=True,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_dict=True,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding=True,
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
# it should not matter whether two images are the same size or not
|
# it should not matter whether two images are the same size or not
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices",
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
|
||||||
"system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices",
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
|
||||||
]
|
] # fmt: skip
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
self.processor.batch_decode(output, skip_special_tokens=True),
|
self.processor.batch_decode(output, skip_special_tokens=True),
|
||||||
EXPECTED_DECODED_TEXT,
|
EXPECTED_DECODED_TEXT,
|
||||||
@@ -452,22 +469,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
|
|||||||
attn_implementation="flash_attention_2",
|
attn_implementation="flash_attention_2",
|
||||||
device_map="auto",
|
device_map="auto",
|
||||||
)
|
)
|
||||||
text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
|
message_wo_image = [
|
||||||
messages2 = [
|
{"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
|
||||||
{"role": "user", "content": "Who are you?"},
|
|
||||||
]
|
]
|
||||||
text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
|
batched_messages = [self.message, message_wo_image]
|
||||||
inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to(
|
inputs = self.processor.apply_chat_template(
|
||||||
torch_device
|
batched_messages,
|
||||||
)
|
tokenize=True,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_dict=True,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding=True,
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
# it should not matter whether two images are the same size or not
|
# it should not matter whether two images are the same size or not
|
||||||
output = model.generate(**inputs, max_new_tokens=30)
|
output = model.generate(**inputs, max_new_tokens=30)
|
||||||
|
|
||||||
EXPECTED_DECODED_TEXT = [
|
EXPECTED_DECODED_TEXT = [
|
||||||
'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
|
"\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
|
||||||
'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.'
|
'\nWho are you?\n<think>Got it, let\'s look at the question. The user is asking "Who are you?" which is a common question when someone meets an AI'
|
||||||
] # fmt: skip
|
] # fmt: skip
|
||||||
|
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
|
|||||||
@@ -176,10 +176,12 @@ class VideoProcessingTestMixin:
|
|||||||
torch.compiler.reset()
|
torch.compiler.reset()
|
||||||
video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch")
|
video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch")
|
||||||
video_processor = self.fast_video_processing_class(**self.video_processor_dict)
|
video_processor = self.fast_video_processing_class(**self.video_processor_dict)
|
||||||
output_eager = video_processor(video_inputs, device=torch_device, return_tensors="pt")
|
output_eager = video_processor(video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt")
|
||||||
|
|
||||||
video_processor = torch.compile(video_processor, mode="reduce-overhead")
|
video_processor = torch.compile(video_processor, mode="reduce-overhead")
|
||||||
output_compiled = video_processor(video_inputs, device=torch_device, return_tensors="pt")
|
output_compiled = video_processor(
|
||||||
|
video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt"
|
||||||
|
)
|
||||||
|
|
||||||
torch.testing.assert_close(
|
torch.testing.assert_close(
|
||||||
output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4
|
output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4
|
||||||
|
|||||||
Reference in New Issue
Block a user