Glm 4 doc (#39247)

* update the glm4 model readme * update test * update GLM-4.1V model * update as format * update * fix some tests * fix the rest * fix on a10, not t4 * nit: dummy import --------- Co-authored-by: raushan <raushan@huggingface.co>
2025-07-08 14:22:04 +08:00
parent bbca9782ca
commit 17b3c96c00
5 changed files with 154 additions and 76 deletions
--- a/docs/source/en/model_doc/glm4.md
+++ b/docs/source/en/model_doc/glm4.md
@@ -18,7 +18,37 @@ rendered properly in your Markdown viewer.
 ## Overview
-To be released with the official model launch.
+The GLM family welcomes new members [GLM-4-0414](https://arxiv.org/pdf/2406.12793) series models.
 The **GLM-4-32B-0414** series models, featuring 32 billion parameters. Its performance is comparable to OpenAI’s GPT
 series and DeepSeek’s V3/R1 series. It also supports very user-friendly local deployment features. GLM-4-32B-Base-0414
 was pre-trained on 15T of high-quality data, including substantial reasoning-type synthetic data. This lays the
 foundation for subsequent reinforcement learning extensions. In the post-training stage, we employed human preference
 alignment for dialogue scenarios. Additionally, using techniques like rejection sampling and reinforcement learning, we
 enhanced the model’s performance in instruction following, engineering code, and function calling, thus strengthening
 the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in engineering code, Artifact
 generation, function calling, search-based Q&A, and report generation. In particular, on several benchmarks, such as
 code generation or specific Q&A tasks, GLM-4-32B-Base-0414 achieves comparable performance with those larger models like
 GPT-4o and DeepSeek-V3-0324 (671B).
 **GLM-Z1-32B-0414** is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414
 through cold start, extended reinforcement learning, and further training on tasks including mathematics, code, and
 logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to
 solve complex tasks. During training, we also introduced general reinforcement learning based on pairwise ranking
 feedback, which enhances the model's general capabilities.
 **GLM-Z1-Rumination-32B-0414** is a deep reasoning model with rumination capabilities (against OpenAI's Deep Research).
 Unlike typical deep thinking models, the rumination model is capable of deeper and longer thinking to solve more
 open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future
 development plans). Z1-Rumination is trained through scaling end-to-end reinforcement learning with responses graded by
 the ground truth answers or rubrics and can make use of search tools during its deep thinking process to handle complex
 tasks. The model shows significant improvements in research-style writing and complex tasks.
 Finally, **GLM-Z1-9B-0414** is a surprise. We employed all the aforementioned techniques to train a small model (9B).
 GLM-Z1-9B-0414 exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is
 top-ranked among all open-source models of the same size. Especially in resource-constrained scenarios, this model
 achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking
 lightweight deployment.
 ## Glm4Config
--- a/docs/source/en/model_doc/glm4v.md
+++ b/docs/source/en/model_doc/glm4v.md
@@ -23,6 +23,29 @@ rendered properly in your Markdown viewer.
 # GLM-4.1V
 ## Overview
 **GLM-4.1V-9B-Thinking** is a bilingual vision-language model optimized for reasoning, built on GLM-4-9B. It introduces
 a "thinking paradigm" with reinforcement learning, achieving state-of-the-art results among 10B-class models and
 rivaling 72B-scale models. It supports 64k context, 4K resolution, and arbitrary aspect ratios, with an open-source base
 model for further research. You can check our paper [here](https://huggingface.co/papers/2507.01006). and below is a abstract.
 *We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding
 and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework.
 We first develop a capable vision foundation model with significant potential through large-scale pre-training, which
 arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum
 Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a
 diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding,
 GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art
 performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model
 outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks
 relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or
 superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document
 understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information
 are released at https://github.com/THUDM/GLM-4.1V-Thinking.*
 ## Usage
 The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
 <hfoptions id="usage">
--- a/src/transformers/models/glm4v/video_processing_glm4v.py
+++ b/src/transformers/models/glm4v/video_processing_glm4v.py
@@ -173,7 +173,10 @@ class Glm4vVideoProcessor(BaseVideoProcessor):
                timestamps_list.append(timestamps)
                processed_videos.append(video)
        else:
-            raise AssertionError("Must set `do_sample_frames=True` to sample frames from GLM-4.1V Model.")
+            # Assume 24 fps by default and prepare timestamps for the whole video when all frames are sampled
            processed_videos = videos
            timestamps_list = [[idx // 24 for idx in range(len(video))] for video in videos]
            timestamps_list = timestamps_list[::2]  # mrope
        grouped_videos, grouped_videos_index = group_videos_by_shape(processed_videos)
        resized_videos_grouped = {}
--- a/tests/models/glm4v/test_modeling_glm4v.py
+++ b/tests/models/glm4v/test_modeling_glm4v.py
@@ -16,15 +16,12 @@
 import gc
 import unittest
 import requests
 from transformers import (
    AutoProcessor,
    Glm4vConfig,
    Glm4vForConditionalGeneration,
    Glm4vModel,
    is_torch_available,
    is_vision_available,
 )
 from transformers.testing_utils import (
    require_flash_attn,
@@ -47,10 +44,6 @@ if is_torch_available():
    import torch
 if is_vision_available():
    from PIL import Image
 class Glm4vVisionText2TextModelTester:
    def __init__(
        self,
@@ -177,6 +170,8 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
    all_model_classes = (Glm4vModel, Glm4vForConditionalGeneration) if is_torch_available() else ()
    test_pruning = False
    test_head_masking = False
    test_torchscript = False
    model_split_percents = [0.7, 0.9]  # model too big to split at 0.5
    _is_composite = True
    def setUp(self):
@@ -264,22 +259,34 @@ class Glm4vModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
            torch.testing.assert_close(out_embeds, out_ids)
@unittest.skip("Model checkpoint not yet released")
@require_torch
 class Glm4vIntegrationTest(unittest.TestCase):
    def setUp(self):
-        self.processor = AutoProcessor.from_pretrained("z")
+        self.processor = AutoProcessor.from_pretrained("THUDM/GLM-4.1V-9B-Thinking")
-        self.messages = [
+        self.message = [
            {
                "role": "user",
                "content": [
-                    {"type": "image"},
+                    {
                        "type": "image",
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
                    },
                    {"type": "text", "text": "What kind of dog is this?"},
                ],
            }
        ]
        self.message2 = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png",
                    },
                    {"type": "text", "text": "What kind of dog is this?"},
                ],
            }
        ]
        url = "https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/demo_small.jpg"
        self.image = Image.open(requests.get(url, stream=True).raw)
    def tearDown(self):
        gc.collect()
@@ -291,20 +298,20 @@ class Glm4vIntegrationTest(unittest.TestCase):
            "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        inputs = self.processor.apply_chat_template(
-        inputs = self.processor(text=[text], images=[self.image], return_tensors="pt")
+            self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
-
+        )
-        expected_input_ids = [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655]  # fmt: skip
+        expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343]  # fmt: skip
        assert expected_input_ids == inputs.input_ids[0].tolist()[:17]
        expected_pixel_slice = torch.tensor(
            [
-                [0.8792, 0.8792, 0.9084],
+                [-0.0988, -0.0842, -0.0842],
-                [1.1858, 1.1858, 1.2296],
+                [-0.5660, -0.5514, -0.4200],
-                [1.2004, 1.2004, 1.2150],
+                [-0.0259, -0.0259, -0.0259],
-                [1.4340, 1.4340, 1.4194],
+                [-0.1280, -0.0988, -0.2010],
-                [1.3902, 1.4048, 1.4194],
+                [-0.4638, -0.5806, -0.6974],
-                [1.5216, 1.5362, 1.5362],
+                [-1.2083, -1.2229, -1.2083],
            ],
            dtype=torch.float32,
            device="cpu",
@@ -315,8 +322,7 @@ class Glm4vIntegrationTest(unittest.TestCase):
        inputs = inputs.to(torch_device)
        output = model.generate(**inputs, max_new_tokens=30)
-        EXPECTED_DECODED_TEXT = "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices"
+        EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
        self.assertEqual(
            self.processor.decode(output[0], skip_special_tokens=True),
            EXPECTED_DECODED_TEXT,
@@ -327,17 +333,17 @@ class Glm4vIntegrationTest(unittest.TestCase):
        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        batch_messages = [self.message] * 2
-        inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to(
+        inputs = self.processor.apply_chat_template(
-            torch_device
+            batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
-        )
+        ).to(torch_device)
        # it should not matter whether two images are the same size or not
        output = model.generate(**inputs, max_new_tokens=30)
        EXPECTED_DECODED_TEXT = [
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks"
        ]  # fmt: skip
        self.assertEqual(
            self.processor.batch_decode(output, skip_special_tokens=True),
@@ -349,15 +355,15 @@ class Glm4vIntegrationTest(unittest.TestCase):
        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        inputs = self.processor.apply_chat_template(
-        inputs = self.processor(text=[text], images=[self.image], return_tensors="pt").to(torch_device)
+            self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
        ).to(torch_device)
-        output = model.generate(**inputs, max_new_tokens=30, num_return_sequences=3)
+        output = model.generate(**inputs, max_new_tokens=30, do_sample=False, num_beams=2, num_return_sequences=2)
        EXPECTED_DECODED_TEXT = [
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat. Specifically",
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture doesn't look like a dog; it's actually a cat, specifically"
            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
        ]  # fmt: skip
        self.assertEqual(
            self.processor.batch_decode(output, skip_special_tokens=True),
@@ -369,22 +375,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        message_wo_image = [
-        messages2 = [
+            {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who are you?"},
        ]
-        text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
+        batched_messages = [self.message, message_wo_image]
-        inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to(
+        inputs = self.processor.apply_chat_template(
-            torch_device
+            batched_messages,
-        )
+            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
            padding=True,
        ).to(torch_device)
        # it should not matter whether two images are the same size or not
        output = model.generate(**inputs, max_new_tokens=30)
        EXPECTED_DECODED_TEXT = [
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
-            'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.'
+            '\nWho are you?\n<think>Got it, the user is asking "Who are you?" I need to respond appropriately. First, I should clarify that I\'m an AI assistant'
        ]  # fmt: skip
        self.assertEqual(
            self.processor.batch_decode(output, skip_special_tokens=True),
@@ -396,19 +405,22 @@ class Glm4vIntegrationTest(unittest.TestCase):
        model = Glm4vForConditionalGeneration.from_pretrained(
            "THUDM/GLM-4.1V-9B-Thinking", torch_dtype="auto", device_map="auto"
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        batched_messages = [self.message, self.message2]
-        text2 = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        inputs = self.processor.apply_chat_template(
-        image2 = self.image.resize((224, 224))
+            batched_messages,
-        inputs = self.processor(text=[text, text2], images=[self.image, image2], padding=True, return_tensors="pt").to(
+            tokenize=True,
-            torch_device
+            add_generation_prompt=True,
-        )
+            return_dict=True,
            return_tensors="pt",
            padding=True,
        ).to(torch_device)
        # it should not matter whether two images are the same size or not
        output = model.generate(**inputs, max_new_tokens=30)
        EXPECTED_DECODED_TEXT = [
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular pets'
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
        ]  # fmt: skip
        self.assertEqual(
            self.processor.batch_decode(output, skip_special_tokens=True),
@@ -425,18 +437,23 @@ class Glm4vIntegrationTest(unittest.TestCase):
            attn_implementation="flash_attention_2",
            device_map="auto",
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        batched_messages = [self.message, self.message2]
-        inputs = self.processor(text=[text, text], images=[self.image, self.image], return_tensors="pt").to(
+        inputs = self.processor.apply_chat_template(
-            torch_device
+            batched_messages,
-        )
+            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
            padding=True,
        ).to(torch_device)
        # it should not matter whether two images are the same size or not
        output = model.generate(**inputs, max_new_tokens=30)
        EXPECTED_DECODED_TEXT = [
-            "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices",
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture has a stocky build, thick fur, and a face that's",
-            "system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices",
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. Wait, the animals here are cats, not dogs. The question is about a dog, but"
-        ]
+        ]  # fmt: skip
        self.assertEqual(
            self.processor.batch_decode(output, skip_special_tokens=True),
            EXPECTED_DECODED_TEXT,
@@ -452,22 +469,25 @@ class Glm4vIntegrationTest(unittest.TestCase):
            attn_implementation="flash_attention_2",
            device_map="auto",
        )
-        text = self.processor.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
+        message_wo_image = [
-        messages2 = [
+            {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who are you?"},
        ]
-        text2 = self.processor.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
+        batched_messages = [self.message, message_wo_image]
-        inputs = self.processor(text=[text, text2], images=[self.image], padding=True, return_tensors="pt").to(
+        inputs = self.processor.apply_chat_template(
-            torch_device
+            batched_messages,
-        )
+            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
            padding=True,
        ).to(torch_device)
        # it should not matter whether two images are the same size or not
        output = model.generate(**inputs, max_new_tokens=30)
        EXPECTED_DECODED_TEXT = [
-            'system\nYou are a helpful assistant.\nuser\nWhat kind of dog is this?\nassistant\nThe dog in the picture appears to be a Labrador Retriever. Labradors are known for their friendly and intelligent nature, making them popular choices',
+            "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks",
-            'system\nYou are a helpful assistant.\nuser\nWho are you?\nassistant\nI am a large language model created by Alibaba Cloud. I am called Qwen.'
+            '\nWho are you?\n<think>Got it, let\'s look at the question. The user is asking "Who are you?" which is a common question when someone meets an AI'
        ]  # fmt: skip
        self.assertEqual(
--- a/tests/test_video_processing_common.py
+++ b/tests/test_video_processing_common.py
@@ -176,10 +176,12 @@ class VideoProcessingTestMixin:
        torch.compiler.reset()
        video_inputs = self.video_processor_tester.prepare_video_inputs(equal_resolution=False, return_tensors="torch")
        video_processor = self.fast_video_processing_class(**self.video_processor_dict)
-        output_eager = video_processor(video_inputs, device=torch_device, return_tensors="pt")
+        output_eager = video_processor(video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt")
        video_processor = torch.compile(video_processor, mode="reduce-overhead")
-        output_compiled = video_processor(video_inputs, device=torch_device, return_tensors="pt")
+        output_compiled = video_processor(
            video_inputs, device=torch_device, do_sample_frames=False, return_tensors="pt"
        )
        torch.testing.assert_close(
            output_eager[self.input_name], output_compiled[self.input_name], rtol=1e-4, atol=1e-4