From 17b3c96c00cd8421bff85282aec32422bdfebd31 Mon Sep 17 00:00:00 2001 From: Yuxuan Zhang <2448370773@qq.com> Date: Tue, 8 Jul 2025 14:22:04 +0800 Subject: [PATCH] Glm 4 doc (#39247) * update the glm4 model readme * update test * update GLM-4.1V model * update as format * update * fix some tests * fix the rest * fix on a10, not t4 * nit: dummy import --------- Co-authored-by: raushan --- docs/source/en/model_doc/glm4.md | 32 +++- docs/source/en/model_doc/glm4v.md | 23 +++ .../models/glm4v/video_processing_glm4v.py | 5 +- tests/models/glm4v/test_modeling_glm4v.py | 164 ++++++++++-------- tests/test_video_processing_common.py | 6 +- 5 files changed, 154 insertions(+), 76 deletions(-) diff --git a/docs/source/en/model_doc/glm4.md b/docs/source/en/model_doc/glm4.md index f854bb658f..a7df833039 100644 --- a/docs/source/en/model_doc/glm4.md +++ b/docs/source/en/model_doc/glm4.md @@ -18,7 +18,37 @@ rendered properly in your Markdown viewer. ## Overview -To be released with the official model launch. +The GLM family welcomes new members [GLM-4-0414](https://arxiv.org/pdf/2406.12793) series models. + +The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAI’s GPT +series and DeepSeek’s V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414 +was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the +foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference +alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we +enhanced the model’s performance in instruction following, engineering code, and function calling, thus strengthening +the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact +generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as +code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like +GPT-4o and DeepSeek-V3-0324 (671B). + +**GLM-Z1-32B-0414** is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414 +through cold start, extended reinforcement learning, and further training on tasks including mathematics, code, and +logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to +solve complex tasks. During training, we also introduced general reinforcement learning based on pairwise ranking +feedback, which enhances the model's general capabilities. + +**GLM-Z1-Rumination-32B-0414** is a deep reasoning model with rumination capabilities (against OpenAI's Deep Research). +Unlike typical deep thinking models, the rumination model is capable of deeper and longer thinking to solve more +open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future +development plans). Z1-Rumination is trained through scaling end-to-end reinforcement learning with responses graded by +the ground truth answers or rubrics and can make use of search tools during its deep thinking process to handle complex +tasks. The model shows significant improvements in research-style writing and complex tasks. + +Finally, **GLM-Z1-9B-0414** is a surprise. We employed all the aforementioned techniques to train a small model (9B). +GLM-Z1-9B-0414 exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is +top-ranked among all open-source models of the same size. Especially in resource-constrained scenarios, this model +achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking +lightweight deployment. ## Glm4Config diff --git a/docs/source/en/model_doc/glm4v.md b/docs/source/en/model_doc/glm4v.md index d18a10e9b2..0884242150 100644 --- a/docs/source/en/model_doc/glm4v.md +++ b/docs/source/en/model_doc/glm4v.md @@ -23,6 +23,29 @@ rendered properly in your Markdown viewer. # GLM-4.1V +## Overview + +**GLM-4.1V-9B-Thinking** is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces +a "thinking paradigm" with reinforcement learning, achieving state-of-the-art results among 10B-class models and +rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base +model for further research. You can check our paper [here](https://huggingface.co/papers/2507.01006). and below is a abstract. + +*We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding +and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. +We first develop a capable vision foundation model with significant potential through large-scale pre-training, which +arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum +Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a +diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, +GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art +performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model +outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks +relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or +superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document +understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information +are released at https://github.com/THUDM/GLM-4.1V-Thinking.* + +## Usage + The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class. diff --git a/src/transformers/models/glm4v/video_processing_glm4v.py b/src/transformers/models/glm4v/video_processing_glm4v.py index ac6a992107..4f3ffed27e 100644 --- a/src/transformers/models/glm4v/video_processing_glm4v.py +++ b/src/transformers/models/glm4v/video_processing_glm4v.py @@ -173,7 +173,10 @@ class Glm4vVideoProcessor(BaseVideoProcessor): timestamps_list.append(timestamps) processed_videos.append(video) else: - raise AssertionError("Must set `do_sample_frames=True` to sample frames from GLM-4.1V Model.") + # Assume 24 fps by default and prepare timestamps for the whole video when all frames are sampled + processed_videos = videos + timestamps_list = [[idx // 24 for idx in range(len(video))] for video in videos] + timestamps_list = timestamps_list[::2] # mrope grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos) resized_videos_grouped = {} diff --git a/tests/models/glm4v/test_modeling_glm4v.py b/tests/models/glm4v/test_modeling_glm4v.py index a9901ded23..39b66875c2 100644 --- a/tests/models/glm4v/test_modeling_glm4v.py +++ b/tests/models/glm4v/test_modeling_glm4v.py @@ -16,15 +16,12 @@ import gc import unittest -import requests - from transformers import ( AutoProcessor, Glm4vConfig, Glm4vForConditionalGeneration, Glm4vModel, is_torch_available, - is_vision_available, ) from transformers.testing_utils import ( require_flash_attn, @@ -47,10 +44,6 @@ if is_torch_available(): import torch -if is_vision_available(): - from PIL import Image - - class Glm4vVisionText2TextModelTester: def __init__( self, @@ -177,6 +170,8 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else () test_pruning = False test_head_masking = False + test_torchscript = False + model_split_percents = [0.7, 0.9] # model too big to split at 0.5 _is_composite = True def setUp(self): @@ -264,22 +259,34 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) torch.testing.assert_close(out_embeds, out_ids) -@unittest.skip("Model checkpoint not yet released") @require_torch class Glm4vIntegrationTest(unittest.TestCase): def setUp(self): - self.processor = AutoProcessor.from_pretrained("z") - self.messages = [ + self.processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking") + self.message = [ { "role": "user", "content": [ - {"type": "image"}, + { + "type": "image", + "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", + }, + {"type": "text", "text": "What kind of dog is this?"}, + ], + } + ] + self.message2 = [ + { + "role": "user", + "content": [ + { + "type": "image", + "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png", + }, {"type": "text", "text": "What kind of dog is this?"}, ], } ] - url = "https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/demo_small.jpg" - self.image = Image.open(requests.get(url, stream=True).raw) def tearDown(self): gc.collect() @@ -291,20 +298,20 @@ class Glm4vIntegrationTest(unittest.TestCase): "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text], images=[self.image], return_tensors="pt") - - expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655] # fmt: skip + inputs = self.processor.apply_chat_template( + self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" + ) + expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip assert expected_input_ids == inputs.input_ids[0].tolist()[:17] expected_pixel_slice = torch.tensor( [ - [0.8792, 0.8792, 0.9084], - [1.1858, 1.1858, 1.2296], - [1.2004, 1.2004, 1.2150], - [1.4340, 1.4340, 1.4194], - [1.3902, 1.4048, 1.4194], - [1.5216, 1.5362, 1.5362], + [-0.0988, -0.0842, -0.0842], + [-0.5660, -0.5514, -0.4200], + [-0.0259, -0.0259, -0.0259], + [-0.1280, -0.0988, -0.2010], + [-0.4638, -0.5806, -0.6974], + [-1.2083, -1.2229, -1.2083], ], dtype=torch.float32, device="cpu", @@ -315,8 +322,7 @@ class Glm4vIntegrationTest(unittest.TestCase): inputs = inputs.to(torch_device) output = model.generate(**inputs, max_new_tokens=30) - EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices" - + EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" self.assertEqual( self.processor.decode(output[0], skip_special_tokens=True), EXPECTED_DECODED_TEXT, @@ -327,17 +333,17 @@ class Glm4vIntegrationTest(unittest.TestCase): model = Glm4vForConditionalGeneration.from_pretrained( "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to( - torch_device - ) + batch_messages = [self.message] * 2 + inputs = self.processor.apply_chat_template( + batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" + ).to(torch_device) # it should not matter whether two images are the same size or not output = model.generate(**inputs, max_new_tokens=30) EXPECTED_DECODED_TEXT = [ - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" ] # fmt: skip self.assertEqual( self.processor.batch_decode(output, skip_special_tokens=True), @@ -349,15 +355,15 @@ class Glm4vIntegrationTest(unittest.TestCase): model = Glm4vForConditionalGeneration.from_pretrained( "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text], images=[self.image], return_tensors="pt").to(torch_device) + inputs = self.processor.apply_chat_template( + self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" + ).to(torch_device) - output = model.generate(**inputs, max_new_tokens=30, num_return_sequences=3) + output = model.generate(**inputs, max_new_tokens=30, do_sample=False, num_beams=2, num_return_sequences=2) EXPECTED_DECODED_TEXT = [ - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat. Specifically", + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat, specifically" ] # fmt: skip self.assertEqual( self.processor.batch_decode(output, skip_special_tokens=True), @@ -369,22 +375,25 @@ class Glm4vIntegrationTest(unittest.TestCase): model = Glm4vForConditionalGeneration.from_pretrained( "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - messages2 = [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Who are you?"}, + message_wo_image = [ + {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]}, ] - text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to( - torch_device - ) + batched_messages = [self.message, message_wo_image] + inputs = self.processor.apply_chat_template( + batched_messages, + tokenize=True, + add_generation_prompt=True, + return_dict=True, + return_tensors="pt", + padding=True, + ).to(torch_device) # it should not matter whether two images are the same size or not output = model.generate(**inputs, max_new_tokens=30) EXPECTED_DECODED_TEXT = [ - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.' + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", + '\nWho are you?\nGot it, the user is asking "Who are you?" I need to respond appropriately. First, I should clarify that I\'m an AI assistant' ] # fmt: skip self.assertEqual( self.processor.batch_decode(output, skip_special_tokens=True), @@ -396,19 +405,22 @@ class Glm4vIntegrationTest(unittest.TestCase): model = Glm4vForConditionalGeneration.from_pretrained( "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto" ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - text2 = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - image2 = self.image.resize((224, 224)) - inputs = self.processor(text=[text, text2], images=[self.image, image2], padding=True, return_tensors="pt").to( - torch_device - ) + batched_messages = [self.message, self.message2] + inputs = self.processor.apply_chat_template( + batched_messages, + tokenize=True, + add_generation_prompt=True, + return_dict=True, + return_tensors="pt", + padding=True, + ).to(torch_device) # it should not matter whether two images are the same size or not output = model.generate(**inputs, max_new_tokens=30) EXPECTED_DECODED_TEXT = [ - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets' + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's", + "\nWhat kind of dog is this?\nGot it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but" ] # fmt: skip self.assertEqual( self.processor.batch_decode(output, skip_special_tokens=True), @@ -425,18 +437,23 @@ class Glm4vIntegrationTest(unittest.TestCase): attn_implementation="flash_attention_2", device_map="auto", ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to( - torch_device - ) + batched_messages = [self.message, self.message2] + inputs = self.processor.apply_chat_template( + batched_messages, + tokenize=True, + add_generation_prompt=True, + return_dict=True, + return_tensors="pt", + padding=True, + ).to(torch_device) # it should not matter whether two images are the same size or not output = model.generate(**inputs, max_new_tokens=30) EXPECTED_DECODED_TEXT = [ - "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices", - "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices", - ] + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's", + "\nWhat kind of dog is this?\nGot it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but" + ] # fmt: skip self.assertEqual( self.processor.batch_decode(output, skip_special_tokens=True), EXPECTED_DECODED_TEXT, @@ -452,22 +469,25 @@ class Glm4vIntegrationTest(unittest.TestCase): attn_implementation="flash_attention_2", device_map="auto", ) - text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True) - messages2 = [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Who are you?"}, + message_wo_image = [ + {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]}, ] - text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True) - inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to( - torch_device - ) + batched_messages = [self.message, message_wo_image] + inputs = self.processor.apply_chat_template( + batched_messages, + tokenize=True, + add_generation_prompt=True, + return_dict=True, + return_tensors="pt", + padding=True, + ).to(torch_device) # it should not matter whether two images are the same size or not output = model.generate(**inputs, max_new_tokens=30) EXPECTED_DECODED_TEXT = [ - 'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices', - 'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.' + "\nWhat kind of dog is this?\nGot it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", + '\nWho are you?\nGot it, let\'s look at the question. The user is asking "Who are you?" which is a common question when someone meets an AI' ] # fmt: skip self.assertEqual( diff --git a/tests/test_video_processing_common.py b/tests/test_video_processing_common.py index 1bbdb4fdcc..ec6e059f91 100644 --- a/tests/test_video_processing_common.py +++ b/tests/test_video_processing_common.py @@ -176,10 +176,12 @@ class VideoProcessingTestMixin: torch.compiler.reset() video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch") video_processor = self.fast_video_processing_class(**self.video_processor_dict) - output_eager = video_processor(video_inputs, device=torch_device, return_tensors="pt") + output_eager = video_processor(video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt") video_processor = torch.compile(video_processor, mode="reduce-overhead") - output_compiled = video_processor(video_inputs, device=torch_device, return_tensors="pt") + output_compiled = video_processor( + video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt" + ) torch.testing.assert_close( output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4