Add MLLama (#33703)

* current changes * nit * Add cross_attenttion_mask to processor * multi-image fixed * Add cross_attenttion_mask to processor * cross attn works in all cases * WIP refactoring function for image processor * WIP refactoring image processor functions * Refactor preprocess to use global loops instead of list nested list comps * Docstrings * Add channels unification * fix dtype issues * Update docsrings and format * Consistent max_image_tiles * current script * updates * Add convert to rgb * Add image processor tests * updates! * update * god damn it I am dumb sometimes * Precompute aspect ratios * now this works, full match * fix 😉 * nits * style * fix model and conversion * nit * nit * kinda works * hack for sdpa non-contiguous bias * nits here and there * latest c hanges * merge? * run forward * Add aspect_ratio_mask * vision attention mask * update script and config variable names * nit * nits * be able to load * style * nits * there * nits * make forward run * small update * enable generation multi-turn * nit * nit * Clean up a bit for errors and typos * A bit more constant fixes * 90B keys and shapes match * Fix for 11B model * Fixup, remove debug part * Docs * Make max_aspect_ratio_id to be minimal * Update image processing code to match new implementation * Adjust conversion for final checkpoint state * Change dim in repeat_interleave (accordig to meta code) * tmp fix for num_tiles * Fix for conversion (gate<->up, q/k_proj rope permute) * nits * codestyle * Vision encoder fixes * pass cross attn mask further * Refactor aspect ratio mask * Disable text-only generation * Fix cross attention layers order, remove q/k norm rotation for cross atention layers * Refactor gated position embeddings * fix bugs but needs test with new weights * rope scaling should be llama3 * Fix rope scaling name * Remove debug for linear layer * fix copies * Make mask prepare private func * Remove linear patch embed * Make precomputed embeddings as nn.Embedding module * MllamaPrecomputedAspectRatioEmbedding with config init * Remove unused self.output_dim * nit, intermediate layers * Rename ln and pos_embed * vision_chunk_size -> image_size * return_intermediate -> intermediate_layers_indices * vision_input_dim -> hidden_size * Fix copied from statements * fix most tests * Fix more copied from * layer_id->layer_idx * Comment * Fix tests for processor * Copied from for _prepare_4d_causal_attention_mask_with_cache_position * Style fix * Add MllamaForCausalLM * WIP fixing tests * Remove duplicated layers * Remove dummy file * Fix style * Fix consistency * Fix some TODOs * fix language_model instantiation, add docstring * Move docstring, remove todos for precomputed embeds (we cannot init them properly) * Add initial docstrings * Fix * fix some tests * lets skip these * nits, remove print, style * Add one more copied from * Improve test message * Make validate func private * Fix dummy objects * Refactor `data_format` a bit + add comment * typos/nits Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> * fix dummy objects and imports * Add chat template config json * remove num_kv_heads from vision attention * fix * move some commits and add more tests * fix test * Remove `update_key_name` from modeling utils * remove num-kv-heads again * some prelimiary docs * Update chat template + tests * nit, conversion script max_num_tiles from params * Fix warning for text-only generation * Update conversion script for instruct models * Update chat template in converstion + test * add tests for CausalLM model * model_max_length, avoid null chat_template * Refactor conversion script * Fix forward * Fix integration tests * Refactor vision config + docs * Fix default * Refactor text config * Doc fixes * Remove unused args, fix docs example * Squashed commit of the following: commit b51ce5a2efffbecdefbf6fc92ee87372ec9d8830 Author: qubvel <qubvel@gmail.com> Date: Wed Sep 18 13:39:15 2024 +0000 Move model + add output hidden states and output attentions * Fix num_channels * Add mllama text and mllama vision models * Fixing repo consistency * Style fix * Fixing repo consistency * Fixing unused config params * Fix failed tests after refactoring * hidden_activation -> hidden_act for text mlp * Remove from_pretrained from sub-configs * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/mllama/convert_mllama_weights_to_hf.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Reuse lambda in conversion script * Remove run.py * Update docs/source/en/model_doc/mllama.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/mllama/processing_mllama.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Remove unused LlamaTokenizerFast * Fix logging * Refactor gating * Remove cycle for collecting intermediate states * Refactor text-only check, add integration test for text-only * Revert from pretrained to configs * Fix example * Add auto `bos_token` adding in processor * Fix tips * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Enable supports_gradient_checkpointing model flag * add eager/sdpa options * don't skip attn tests and bring back GC skips (did i really remove those?) * Fix signature, but get error with None gradient * Fix output attention tests * Disable GC back * Change no split modules * Fix dropout * Style * Add Mllama to sdpa list * Add post init for vision model * Refine config for MllamaForCausalLMModelTest and skipped tests for CausalLM model * if skipped, say it, don't pass * Clean vision tester config * Doc for args * Update tests/models/mllama/test_modeling_mllama.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add cross_attention_mask to test * typehint * Remove todo * Enable gradient checkpointing * Docstring * Style * Fixing and skipping some tests for new cache * Mark flaky test * Skip `test_sdpa_can_compile_dynamic` test * Fixing some offload tests * Add direct GenerationMixin inheritance * Remove unused code * Add initializer_range to vision config * update the test to make sure we show if split * fix gc? * Fix repo consistency * Undo modeling utils debug changes * Fix link * mllama -> Mllama * [mllama] -> [Mllama] * Enable compile test for CausalLM model (text-only) * Fix TextModel prefix * Update doc * Docs for forward, type hints, and vision model prefix * make sure to reset * fix init * small script refactor and styling * nit * updates! * some nits * Interpolate embeddings for 560 size and update integration tests * nit * does not suppor static cache! * update * fix * nit2 * this? * Fix conversion * Style * 4x memory improvement with image cache AFAIK * Token decorator for tests * Skip failing tests * update processor errors * fix split issues * style * weird * style * fix failing tests * update * nit fixing the whisper tests * fix path * update --------- Co-authored-by: raushan <raushan@huggingface.co> Co-authored-by: pavel <ubuntu@ip-10-90-0-11.ec2.internal> Co-authored-by: qubvel <qubvel@gmail.com> Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2024-09-25 19:56:25 +02:00
parent 94f18cf23c
commit 19d58d31f1
31 changed files with 6183 additions and 98 deletions
--- a/tests/models/mllama/test_processor_mllama.py
+++ b/tests/models/mllama/test_processor_mllama.py
@@ -0,0 +1,179 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+from transformers import MllamaProcessor
+from transformers.testing_utils import require_torch, require_vision
+from transformers.utils import is_vision_available
+
+
+if is_vision_available():
+    from PIL import Image
+
+
+@require_torch
+@require_vision
+class MllamaProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.checkpoint = "hf-internal-testing/mllama-11b"  # TODO: change
+        self.processor = MllamaProcessor.from_pretrained(self.checkpoint)
+        self.image1 = Image.new("RGB", (224, 220))
+        self.image2 = Image.new("RGB", (512, 128))
+        self.image_token = self.processor.image_token
+        self.image_token_id = self.processor.image_token_id
+        self.pad_token_id = self.processor.tokenizer.pad_token_id
+        self.bos_token = self.processor.bos_token
+        self.bos_token_id = self.processor.tokenizer.bos_token_id
+
+    def test_apply_chat_template(self):
+        # Message contains content which a mix of lists with images and image urls and string
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "image"},
+                    {"type": "text", "text": "What do these images show?"},
+                ],
+            },
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": "The first image shows the statue of Liberty in New York."},
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "And who is that?"},
+                ],
+            },
+        ]
+
+        rendered = self.processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+
+        expected_rendered = (
+            "<|begin_of_text|>"
+            "<|start_header_id|>user<|end_header_id|>\n\n"
+            "<|image|><|image|>What do these images show?"
+            "<|eot_id|>"
+            "<|start_header_id|>assistant<|end_header_id|>\n\n"
+            "The first image shows the statue of Liberty in New York."
+            "<|eot_id|>"
+            "<|start_header_id|>user<|end_header_id|>\n\n"
+            "And who is that?"
+            "<|eot_id|>"
+            "<|start_header_id|>assistant<|end_header_id|>\n\n"
+        )
+        self.assertEqual(rendered, expected_rendered)
+
+        messages = [
+            {
+                "role": "system",
+                "content": [
+                    {"type": "text", "text": "This is a test sentence."},
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "This is a response."},
+                ],
+            },
+        ]
+        input_ids = self.processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True)
+        expected_ids = [
+            128000,  # <|begin_of_text|>
+            128006,  # <|start_header_id|>
+            9125,  # "system"
+            128007,  # <|end_of_header|>
+            271,  # "\n\n"
+            2028,
+            374,
+            264,
+            1296,
+            11914,
+            13,  # "This is a test sentence."
+            128009,  # <|eot_id|>
+            128006,  # <|start_header_id|>
+            882,  # "user"
+            128007,  # <|end_of_header|>
+            271,  # "\n\n"
+            2028,
+            374,
+            264,
+            2077,
+            13,  # "This is a response.",
+            128009,  # <|eot_id|>
+            128006,  # <|start_header_id|>
+            78191,  # "assistant"
+            128007,  # <|end_of_header|>
+            271,  # "\n\n"
+        ]
+
+        self.assertEqual(input_ids, expected_ids)
+
+        # test image in multiple locations
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "Describe this image in two sentences"},
+                    {"type": "image"},
+                    {"type": "text", "text": " Test sentence   "},
+                    {"type": "image"},
+                    {"type": "text", "text": "ok\n"},
+                ],
+            }
+        ]
+
+        rendered = self.processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+        expected_rendered = (
+            "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
+            "Describe this image in two sentences<|image|> Test sentence   <|image|>ok\n<|eot_id|>"
+            "<|start_header_id|>assistant<|end_header_id|>\n\n"
+        )
+        self.assertEqual(rendered, expected_rendered)
+
+        input_ids = self.processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True)
+        # fmt: off
+        expected_ids = [
+            128000, 128006, 882, 128007, 271, 75885, 420, 2217, 304, 1403, 23719, 128256,
+            3475, 11914, 262, 128256, 564, 198, 128009, 128006, 78191, 128007, 271,
+        ]
+        # fmt: on
+        self.assertEqual(input_ids, expected_ids)
+
+        # text format for content
+        messages_list = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "text", "text": "Describe this image in two sentences"},
+                ],
+            }
+        ]
+        messages_str = [
+            {
+                "role": "user",
+                "content": "<|image|>Describe this image in two sentences",
+            }
+        ]
+
+        rendered_list = self.processor.apply_chat_template(messages_list, add_generation_prompt=True, tokenize=False)
+        rendered_str = self.processor.apply_chat_template(messages_str, add_generation_prompt=True, tokenize=False)
+        self.assertEqual(rendered_list, rendered_str)