Add MM Grounding DINO (#37925)

* first commit Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface. TODO: Some tests are failing so that needs to be fixed. * fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model * cleaned up a hack in the conversion script * Fixed the expected values in integration tests Cross att masking and cpu-gpu consistency tests are still failing however. * changes for make style and quality * add documentation * clean up contrastive embedding * add mm grounding dino to loss mapping * add model link to config docstring * hack fix for mm grounding dino consistency tests * add special cases for unused config attr check * add all models and update docs * update model doc to the new style * Use super_kwargs for modular config * Move init to the _init_weights function * Add copied from for tests * fixup * update typehints * Fix-copies for tests * fix-copies * Fix init test * fix snippets in docs * fix consistency * fix consistency * update conversion script * fix nits in readme and remove old comments from conversion script * add license * remove unused config args * remove unnecessary if/else in model init * fix quality * Update references * fix test * fixup --------- Co-authored-by: qubvel <qubvel@gmail.com>
2025-08-01 16:43:23 +02:00
parent 50145474b7
commit 3951d4ad5d
17 changed files with 4884 additions and 1 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -1051,6 +1051,8 @@
        title: Mistral3
      - local: model_doc/mllama
        title: mllama
      - local: model_doc/mm-grounding-dino
        title: MM Grounding DINO
      - local: model_doc/nougat
        title: Nougat
      - local: model_doc/omdet-turbo
--- a/docs/source/en/model_doc/mm-grounding-dino.md
+++ b/docs/source/en/model_doc/mm-grounding-dino.md
@@ -0,0 +1,124 @@
 <!--Copyright 2025 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
 -->
 <div style="float: right;">
    <div class="flex flex-wrap space-x-1">
           <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
    </div>
 </div>
 # MM Grounding DINO
 [MM Grounding DINO](https://arxiv.org/abs/2401.02361) model was proposed in [An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361) by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
 MM Grounding DINO improves upon the [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
 You can find all the original MM Grounding DINO checkpoints under the [MM Grounding DINO](https://huggingface.co/collections/openmmlab-community/mm-grounding-dino-688cbde05b814c4e2832f9df) collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/iSEE-Laboratory/llmdet-688475906dc235d5f1dc678e) collection.
 > [!TIP]
 > Click on the MM Grounding DINO models in the right sidebar for more examples of how to apply MM Grounding DINO to different MM Grounding DINO tasks.
 The example below demonstrates how to generate text based on an image with the [`AutoModelForZeroShotObjectDetection`] class.
 <hfoptions id="usage">
 <hfoption id="AutoModel">
 ```py
 import torch
 from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
 from transformers.image_utils import load_image
 # Prepare processor and model
 model_id = "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
 device = "cuda" if torch.cuda.is_available() else "cpu"
 processor = AutoProcessor.from_pretrained(model_id)
 model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
 # Prepare inputs
 image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 image = load_image(image_url)
 text_labels = [["a cat", "a remote control"]]
 inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
 # Run inference
 with torch.no_grad():
    outputs = model(**inputs)
 # Postprocess outputs
 results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
 )
 # Retrieve the first image result
 result = results[0]
 for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
 ```
 </hfoption>
 </hfoptions>
 ## Notes
 - Here's a table of models and their object detection performance results on COCO (results from [official repo](https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/README.md)):
    |                                                              Model                                                             | Backbone |      Pre-Train Data      |   Style   |  COCO mAP  |
    | ------------------------------------------------------------------------------------------------------------------------------ | -------- | ------------------------ | --------- | ---------- |
    |  [mm_grounding_dino_tiny_o365v1_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg)                       |  Swin-T  |        O365,GoldG        | Zero-shot | 50.4(+2.3) |
    |  [mm_grounding_dino_tiny_o365v1_goldg_grit](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit)             |  Swin-T  |     O365,GoldG,GRIT      | Zero-shot | 50.5(+2.1) |
    |  [mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det)           |  Swin-T  |     O365,GoldG,V3Det     | Zero-shot | 50.6(+2.2) |
    |  [mm_grounding_dino_tiny_o365v1_goldg_grit_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit_v3det) |  Swin-T  |  O365,GoldG,GRIT,V3Det   | Zero-shot | 50.4(+2.0) |
    |  [mm_grounding_dino_base_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_base_o365v1_goldg_v3det)           |  Swin-B  |     O365,GoldG,V3Det     | Zero-shot |    52.5    |
    |  [mm_grounding_dino_base_all](https://huggingface.co/openmmlab-community/mm_grounding_dino_base_all)                                         |  Swin-B  |         O365,ALL         |     -     |    59.5    |
    |  [mm_grounding_dino_large_o365v2_oiv6_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_large_o365v2_oiv6_goldg)           |  Swin-L  | O365V2,OpenImageV6,GoldG | Zero-shot |    53.0    |
    |  [mm_grounding_dino_large_all](https://huggingface.co/openmmlab-community/mm_grounding_dino_large_all)                                       |  Swin-L  |  O365V2,OpenImageV6,ALL  |     -     |    60.3    |
 - Here's a table of MM Grounding DINO tiny models and their object detection performance on LVIS (results from [official repo](https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/README.md)):
    |                                                              Model                                                             |    Pre-Train Data     | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP  | Val1.0 APr | Val1.0 APc | Val1.0 APf |  Val1.0 AP  |
    | ------------------------------------------------------------------------------------------------------------------------------ | --------------------- | ----------- | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
    |  [mm_grounding_dino_tiny_o365v1_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg)                       |      O365,GoldG       |    28.1     |    30.2     |    42.0     | 35.7(+6.9)  |    17.1    |    22.4    |    36.5    | 27.0(+6.9)  |
    |  [mm_grounding_dino_tiny_o365v1_goldg_grit](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit)             |    O365,GoldG,GRIT    |    26.6     |    32.4     |    41.8     | 36.5(+7.7)  |    17.3    |    22.6    |    36.4    | 27.1(+7.0)  |
    |  [mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det)           |   O365,GoldG,V3Det    |    33.0     |    36.0     |    45.9     | 40.5(+11.7) |    21.5    |    25.5    |    40.2    | 30.6(+10.5) |
    |  [mm_grounding_dino_tiny_o365v1_goldg_grit_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit_v3det) | O365,GoldG,GRIT,V3Det |    34.2     |    37.4     |    46.2     | 41.4(+12.6) |    23.6    |    27.6    |    40.5    | 31.9(+11.8) |
 - This implementation also supports inference for [LLMDet](https://github.com/iSEE-Laboratory/LLMDet). Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):
    |                             Model                         | Pre-Train Data            |  MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP  | Val1.0 APr | Val1.0 APc | Val1.0 APf |  Val1.0 AP  |
    | --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
    | [llmdet_tiny](https://huggingface.co/iSEE-Laboratory/llmdet_tiny)   | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M    | 44.7         | 37.3        | 39.5        | 50.7        | 34.9       | 26.0       | 30.1       | 44.3        |
    | [llmdet_base](https://huggingface.co/iSEE-Laboratory/llmdet_base)   | (O365,GoldG,V3Det) + GroundingCap-1M         | 48.3         | 40.8        | 43.1        | 54.3        | 38.5       | 28.2       | 34.3       | 47.8        |
    | [llmdet_large](https://huggingface.co/iSEE-Laboratory/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1         | 45.1        | 46.1        | 56.6        | 42.0       | 31.6       | 38.8       | 50.2        |
 ## MMGroundingDinoConfig
 [[autodoc]] MMGroundingDinoConfig
 ## MMGroundingDinoModel
 [[autodoc]] MMGroundingDinoModel
    - forward
 ## MMGroundingDinoForObjectDetection
 [[autodoc]] MMGroundingDinoForObjectDetection
    - forward
--- a/src/transformers/loss/loss_utils.py
+++ b/src/transformers/loss/loss_utils.py
@@ -158,6 +158,7 @@ LOSS_MAPPING = {
    "ConditionalDetrForObjectDetection": DeformableDetrForObjectDetectionLoss,
    "DabDetrForObjectDetection": DeformableDetrForObjectDetectionLoss,
    "GroundingDinoForObjectDetection": GroundingDinoForObjectDetectionLoss,
    "MMGroundingDinoForObjectDetection": GroundingDinoForObjectDetectionLoss,
    "ConditionalDetrForSegmentation": DeformableDetrForSegmentationLoss,
    "RTDetrForObjectDetection": RTDetrForObjectDetectionLoss,
    "RTDetrV2ForObjectDetection": RTDetrForObjectDetectionLoss,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -244,6 +244,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
        ("mixtral", "MixtralConfig"),
        ("mlcd", "MLCDVisionConfig"),
        ("mllama", "MllamaConfig"),
        ("mm-grounding-dino", "MMGroundingDinoConfig"),
        ("mobilebert", "MobileBertConfig"),
        ("mobilenet_v1", "MobileNetV1Config"),
        ("mobilenet_v2", "MobileNetV2Config"),
@@ -657,6 +658,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
        ("mlcd", "MLCD"),
        ("mllama", "Mllama"),
        ("mluke", "mLUKE"),
        ("mm-grounding-dino", "MM Grounding DINO"),
        ("mms", "MMS"),
        ("mobilebert", "MobileBERT"),
        ("mobilenet_v1", "MobileNetV1"),
--- a/src/transformers/models/auto/image_processing_auto.py
+++ b/src/transformers/models/auto/image_processing_auto.py
@@ -130,6 +130,7 @@ else:
            ("mistral3", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
            ("mlcd", ("CLIPImageProcessor", "CLIPImageProcessorFast")),
            ("mllama", ("MllamaImageProcessor",)),
            ("mm-grounding-dino", ("GroundingDinoImageProcessor", "GroundingDinoImageProcessorFast")),
            ("mobilenet_v1", ("MobileNetV1ImageProcessor", "MobileNetV1ImageProcessorFast")),
            ("mobilenet_v2", ("MobileNetV2ImageProcessor", "MobileNetV2ImageProcessorFast")),
            ("mobilevit", ("MobileViTImageProcessor", "MobileViTImageProcessorFast")),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -233,6 +233,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("mixtral", "MixtralModel"),
        ("mlcd", "MLCDVisionModel"),
        ("mllama", "MllamaModel"),
        ("mm-grounding-dino", "MMGroundingDinoModel"),
        ("mobilebert", "MobileBertModel"),
        ("mobilenet_v1", "MobileNetV1Model"),
        ("mobilenet_v2", "MobileNetV2Model"),
@@ -1057,6 +1058,7 @@ MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict(
    [
        # Model for Zero Shot Object Detection mapping
        ("grounding-dino", "GroundingDinoForObjectDetection"),
        ("mm-grounding-dino", "MMGroundingDinoForObjectDetection"),
        ("omdet-turbo", "OmDetTurboForObjectDetection"),
        ("owlv2", "Owlv2ForObjectDetection"),
        ("owlvit", "OwlViTForObjectDetection"),
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -100,6 +100,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("mgp-str", "MgpstrProcessor"),
        ("mistral3", "PixtralProcessor"),
        ("mllama", "MllamaProcessor"),
        ("mm-grounding-dino", "GroundingDinoProcessor"),
        ("moonshine", "Wav2Vec2Processor"),
        ("oneformer", "OneFormerProcessor"),
        ("owlv2", "Owlv2Processor"),
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -430,6 +430,7 @@ TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
        ),
        ("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
        ("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
        ("mm-grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
        ("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
        ("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
        ("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
--- a/src/transformers/models/mm_grounding_dino/init.py
+++ b/src/transformers/models/mm_grounding_dino/init.py
@@ -0,0 +1,27 @@
 # Copyright 2025 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import TYPE_CHECKING
 from ...utils import _LazyModule
 from ...utils.import_utils import define_import_structure
 if TYPE_CHECKING:
    from .configuration_mm_grounding_dino import *
    from .modeling_mm_grounding_dino import *
 else:
    import sys
    _file = globals()["__file__"]
    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
--- a/src/transformers/models/mm_grounding_dino/configuration_mm_grounding_dino.py
+++ b/src/transformers/models/mm_grounding_dino/configuration_mm_grounding_dino.py
@@ -0,0 +1,292 @@
 #                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
 #           This file was automatically generated from src/transformers/models/mm_grounding_dino/modular_mm_grounding_dino.py.
 #               Do NOT edit this file manually as any edits will be overwritten by the generation of
 #             the file from the modular. If any change should be done, please apply the change to the
 #                          modular_mm_grounding_dino.py file directly. One of our CI enforces this.
 #                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
 # coding=utf-8
 # Copyright 2025 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
 from ...utils.backbone_utils import verify_backbone_config_arguments
 from ..auto import CONFIG_MAPPING
 logger = logging.get_logger(__name__)
 class MMGroundingDinoConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`MMGroundingDinoModel`]. It is used to instantiate a
    MM Grounding DINO model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the MM Grounding DINO tiny architecture
    [openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det).
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
            The configuration of the backbone model.
        backbone (`str`, *optional*):
            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
        use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
            Whether to use pretrained weights for the backbone.
        use_timm_backbone (`bool`, *optional*, defaults to `False`):
            Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
            library.
        backbone_kwargs (`dict`, *optional*):
            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
        text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `BertConfig`):
            The config object or dictionary of the text backbone.
        num_queries (`int`, *optional*, defaults to 900):
            Number of object queries, i.e. detection slots. This is the maximal number of objects
            [`MMGroundingDinoModel`] can detect in a single image.
        encoder_layers (`int`, *optional*, defaults to 6):
            Number of encoder layers.
        encoder_ffn_dim (`int`, *optional*, defaults to 2048):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        encoder_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer encoder.
        decoder_layers (`int`, *optional*, defaults to 6):
            Number of decoder layers.
        decoder_ffn_dim (`int`, *optional*, defaults to 2048):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        decoder_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer decoder.
        is_encoder_decoder (`bool`, *optional*, defaults to `True`):
            Whether the model is used as an encoder/decoder or not.
        activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        d_model (`int`, *optional*, defaults to 256):
            Dimension of the layers.
        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
        auxiliary_loss (`bool`, *optional*, defaults to `False`):
            Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
        position_embedding_type (`str`, *optional*, defaults to `"sine"`):
            Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
        num_feature_levels (`int`, *optional*, defaults to 4):
            The number of input feature levels.
        encoder_n_points (`int`, *optional*, defaults to 4):
            The number of sampled keys in each feature level for each attention head in the encoder.
        decoder_n_points (`int`, *optional*, defaults to 4):
            The number of sampled keys in each feature level for each attention head in the decoder.
        two_stage (`bool`, *optional*, defaults to `True`):
            Whether to apply a two-stage deformable DETR, where the region proposals are also generated by a variant of
            Grounding DINO, which are further fed into the decoder for iterative bounding box refinement.
        class_cost (`float`, *optional*, defaults to 1.0):
            Relative weight of the classification error in the Hungarian matching cost.
        bbox_cost (`float`, *optional*, defaults to 5.0):
            Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
        giou_cost (`float`, *optional*, defaults to 2.0):
            Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
        bbox_loss_coefficient (`float`, *optional*, defaults to 5.0):
            Relative weight of the L1 bounding box loss in the object detection loss.
        giou_loss_coefficient (`float`, *optional*, defaults to 2.0):
            Relative weight of the generalized IoU loss in the object detection loss.
        focal_alpha (`float`, *optional*, defaults to 0.25):
            Alpha parameter in the focal loss.
        disable_custom_kernels (`bool`, *optional*, defaults to `False`):
            Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
            kernels are not supported by PyTorch ONNX export.
        max_text_len (`int`, *optional*, defaults to 256):
            The maximum length of the text input.
        text_enhancer_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the text enhancer.
        fusion_droppath (`float`, *optional*, defaults to 0.1):
            The droppath ratio for the fusion module.
        fusion_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the fusion module.
        embedding_init_target (`bool`, *optional*, defaults to `True`):
            Whether to initialize the target with Embedding weights.
        query_dim (`int`, *optional*, defaults to 4):
            The dimension of the query vector.
        positional_embedding_temperature (`float`, *optional*, defaults to 20):
            The temperature for Sine Positional Embedding that is used together with vision backbone.
        init_std (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
    Examples:
    ```python
    >>> from transformers import MMGroundingDinoConfig, MMGroundingDinoModel
    >>> # Initializing a MM Grounding DINO configuration
    >>> configuration = MMGroundingDinoConfig()
    >>> # Initializing a model (with random weights) from the configuration
    >>> model = MMGroundingDinoModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "mm-grounding-dino"
    attribute_map = {
        "hidden_size": "d_model",
        "num_attention_heads": "encoder_attention_heads",
    }
    def __init__(
        self,
        backbone_config=None,
        backbone=None,
        use_pretrained_backbone=False,
        use_timm_backbone=False,
        backbone_kwargs=None,
        text_config=None,
        num_queries=900,
        encoder_layers=6,
        encoder_ffn_dim=2048,
        encoder_attention_heads=8,
        decoder_layers=6,
        decoder_ffn_dim=2048,
        decoder_attention_heads=8,
        is_encoder_decoder=True,
        activation_function="relu",
        d_model=256,
        dropout=0.1,
        attention_dropout=0.0,
        activation_dropout=0.0,
        auxiliary_loss=False,
        position_embedding_type="sine",
        num_feature_levels=4,
        encoder_n_points=4,
        decoder_n_points=4,
        two_stage=True,
        class_cost=1.0,
        bbox_cost=5.0,
        giou_cost=2.0,
        bbox_loss_coefficient=5.0,
        giou_loss_coefficient=2.0,
        focal_alpha=0.25,
        disable_custom_kernels=False,
        # other parameters
        max_text_len=256,
        text_enhancer_dropout=0.0,
        fusion_droppath=0.1,
        fusion_dropout=0.0,
        embedding_init_target=True,
        query_dim=4,
        positional_embedding_temperature=20,
        init_std=0.02,
        layer_norm_eps=1e-5,
        **kwargs,
    ):
        super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
        if backbone_config is None and backbone is None:
            logger.info("`backbone_config` is `None`. Initializing the config with the default `Swin` backbone.")
            backbone_config = CONFIG_MAPPING["swin"](
                window_size=7,
                image_size=224,
                embed_dim=96,
                depths=[2, 2, 6, 2],
                num_heads=[3, 6, 12, 24],
                out_indices=[2, 3, 4],
            )
        elif isinstance(backbone_config, dict):
            backbone_model_type = backbone_config.pop("model_type")
            config_class = CONFIG_MAPPING[backbone_model_type]
            backbone_config = config_class.from_dict(backbone_config)
        verify_backbone_config_arguments(
            use_timm_backbone=use_timm_backbone,
            use_pretrained_backbone=use_pretrained_backbone,
            backbone=backbone,
            backbone_config=backbone_config,
            backbone_kwargs=backbone_kwargs,
        )
        if text_config is None:
            text_config = {}
            logger.info("text_config is None. Initializing the text config with default values (`BertConfig`).")
        self.backbone_config = backbone_config
        self.backbone = backbone
        self.use_pretrained_backbone = use_pretrained_backbone
        self.use_timm_backbone = use_timm_backbone
        self.backbone_kwargs = backbone_kwargs
        self.num_queries = num_queries
        self.d_model = d_model
        self.encoder_ffn_dim = encoder_ffn_dim
        self.encoder_layers = encoder_layers
        self.encoder_attention_heads = encoder_attention_heads
        self.decoder_ffn_dim = decoder_ffn_dim
        self.decoder_layers = decoder_layers
        self.decoder_attention_heads = decoder_attention_heads
        self.dropout = dropout
        self.attention_dropout = attention_dropout
        self.activation_dropout = activation_dropout
        self.activation_function = activation_function
        self.auxiliary_loss = auxiliary_loss
        self.position_embedding_type = position_embedding_type
        # deformable attributes
        self.num_feature_levels = num_feature_levels
        self.encoder_n_points = encoder_n_points
        self.decoder_n_points = decoder_n_points
        self.two_stage = two_stage
        # Hungarian matcher
        self.class_cost = class_cost
        self.bbox_cost = bbox_cost
        self.giou_cost = giou_cost
        # Loss coefficients
        self.bbox_loss_coefficient = bbox_loss_coefficient
        self.giou_loss_coefficient = giou_loss_coefficient
        self.focal_alpha = focal_alpha
        self.disable_custom_kernels = disable_custom_kernels
        # Text backbone
        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "bert"
            text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["bert"]()
        self.text_config = text_config
        self.max_text_len = max_text_len
        # Text Enhancer
        self.text_enhancer_dropout = text_enhancer_dropout
        # Fusion
        self.fusion_droppath = fusion_droppath
        self.fusion_dropout = fusion_dropout
        # Others
        self.embedding_init_target = embedding_init_target
        self.query_dim = query_dim
        self.positional_embedding_temperature = positional_embedding_temperature
        self.init_std = init_std
        self.layer_norm_eps = layer_norm_eps
    @property
    def num_attention_heads(self) -> int:
        return self.encoder_attention_heads
    @property
    def hidden_size(self) -> int:
        return self.d_model
 __all__ = ["MMGroundingDinoConfig"]
--- a/src/transformers/models/mm_grounding_dino/convert_mm_grounding_dino_to_hf.py
+++ b/src/transformers/models/mm_grounding_dino/convert_mm_grounding_dino_to_hf.py
@@ -0,0 +1,504 @@
 # coding=utf-8
 # Copyright 2025 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import re
 import requests
 import torch
 from PIL import Image
 from transformers.models.bert.tokenization_bert import BertTokenizer
 from transformers.models.grounding_dino.image_processing_grounding_dino import GroundingDinoImageProcessor
 from transformers.models.grounding_dino.processing_grounding_dino import GroundingDinoProcessor
 from transformers.models.mm_grounding_dino.configuration_mm_grounding_dino import MMGroundingDinoConfig
 from transformers.models.mm_grounding_dino.modeling_mm_grounding_dino import MMGroundingDinoForObjectDetection
 from transformers.models.swin.configuration_swin import SwinConfig
 MODEL_NAME_TO_CHECKPOINT_URL_MAPPING = {
    "mm_grounding_dino_tiny_o365v1_goldg": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg/grounding_dino_swin-t_pretrain_obj365_goldg_20231122_132602-4ea751ce.pth",
    "mm_grounding_dino_tiny_o365v1_goldg_grit": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_20231128_200818-169cc352.pth",
    "mm_grounding_dino_tiny_o365v1_goldg_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth",
    "mm_grounding_dino_tiny_o365v1_goldg_grit_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth",
    "mm_grounding_dino_base_o365v1_goldg_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth",
    "mm_grounding_dino_base_all": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_all/grounding_dino_swin-b_pretrain_all-f9818a7c.pth",
    "mm_grounding_dino_large_o365v2_oiv6_goldg": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth",
    "mm_grounding_dino_large_all": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_all/grounding_dino_swin-l_pretrain_all-56d69e78.pth",
    "llmdet_tiny": "https://huggingface.co/fushh7/LLMDet/resolve/main/tiny.pth?download=true",
    "llmdet_base": "https://huggingface.co/fushh7/LLMDet/resolve/main/base.pth?download=true",
    "llmdet_large": "https://huggingface.co/fushh7/LLMDet/resolve/main/large.pth?download=true",
 }
 MODEL_NAME_TO_EXPECTED_OUTPUT_MAPPING = {
    "mm_grounding_dino_tiny_o365v1_goldg": {
        "scores": torch.tensor([0.7722, 0.7584, 0.7984, 0.7163]),
        "boxes": torch.tensor(
            [
                [0.5212, 0.1594, 0.5792, 0.3895],
                [0.5424, 0.0513, 0.9996, 0.7757],
                [0.0629, 0.1526, 0.2746, 0.2447],
                [0.0091, 0.1127, 0.4945, 0.9911],
            ]
        ),
    },
    "mm_grounding_dino_tiny_o365v1_goldg_grit": {
        "scores": torch.tensor([0.7865, 0.7180, 0.7665, 0.8177]),
        "boxes": torch.tensor(
            [
                [0.0084, 0.1129, 0.4940, 0.9895],
                [0.5214, 0.1597, 0.5786, 0.3875],
                [0.5413, 0.0507, 0.9998, 0.7768],
                [0.0631, 0.1527, 0.2740, 0.2449],
            ]
        ),
    },
    "mm_grounding_dino_tiny_o365v1_goldg_v3det": {
        "scores": torch.tensor([0.5690, 0.5553, 0.6075, 0.5775]),
        "boxes": torch.tensor(
            [
                [0.5393, 0.0502, 0.9989, 0.7763],
                [0.0090, 0.1125, 0.4950, 0.9895],
                [0.5207, 0.1589, 0.5794, 0.3889],
                [0.0625, 0.1519, 0.2750, 0.2446],
            ]
        ),
    },
    "mm_grounding_dino_tiny_o365v1_goldg_grit_v3det": {
        "scores": torch.tensor([0.8381, 0.8204, 0.7970, 0.7175]),
        "boxes": torch.tensor(
            [
                [0.0099, 0.1129, 0.4942, 0.9903],
                [0.5413, 0.0506, 0.9998, 0.7753],
                [0.0626, 0.1527, 0.2744, 0.2443],
                [0.5211, 0.1596, 0.5790, 0.3890],
            ]
        ),
    },
    "mm_grounding_dino_base_o365v1_goldg_v3det": {
        "scores": torch.tensor([0.8418, 0.8364, 0.8342, 0.7885]),
        "boxes": torch.tensor(
            [
                [0.5427, 0.0502, 0.9996, 0.7770],
                [0.0628, 0.1529, 0.2747, 0.2448],
                [0.0085, 0.1132, 0.4947, 0.9898],
                [0.5208, 0.1597, 0.5787, 0.3910],
            ]
        ),
    },
    "mm_grounding_dino_base_all": {
        "scores": torch.tensor([0.4713]),
        "boxes": torch.tensor([[0.5423, 0.0507, 0.9998, 0.7761]]),
    },
    "mm_grounding_dino_large_o365v2_oiv6_goldg": {
        "scores": torch.tensor([0.7824, 0.8275, 0.7715, 0.8211]),
        "boxes": torch.tensor(
            [
                [0.0082, 0.1133, 0.4945, 0.9889],
                [0.5410, 0.0508, 0.9998, 0.7771],
                [0.0632, 0.1526, 0.2740, 0.2439],
                [0.5205, 0.1599, 0.5787, 0.3906],
            ]
        ),
    },
    "mm_grounding_dino_large_all": {
        "scores": torch.tensor([0.7373, 0.6208, 0.6913, 0.4523]),
        "boxes": torch.tensor(
            [
                [0.5424, 0.0509, 0.9997, 0.7765],
                [0.0632, 0.1529, 0.2744, 0.2447],
                [0.0121, 0.1125, 0.4947, 0.9884],
                [0.5206, 0.1597, 0.5789, 0.3933],
            ]
        ),
    },
    "llmdet_tiny": {
        "scores": torch.tensor([0.7262, 0.7552, 0.7656, 0.8207]),
        "boxes": torch.tensor(
            [
                [0.0114, 0.1132, 0.4947, 0.9854],
                [0.5387, 0.0513, 0.9992, 0.7765],
                [0.5212, 0.1605, 0.5788, 0.3890],
                [0.0634, 0.1536, 0.2743, 0.2440],
            ]
        ),
    },
    "llmdet_base": {
        "scores": torch.tensor([0.8646, 0.7567, 0.6978, 0.8084]),
        "boxes": torch.tensor(
            [
                [0.0632, 0.1529, 0.2745, 0.2438],
                [0.5420, 0.0512, 0.9989, 0.7774],
                [0.0110, 0.1134, 0.4950, 0.9875],
                [0.5209, 0.1602, 0.5789, 0.3908],
            ]
        ),
    },
    "llmdet_large": {
        "scores": torch.tensor([0.7107, 0.8626, 0.7458, 0.8166]),
        "boxes": torch.tensor(
            [
                [0.0147, 0.1128, 0.4957, 0.9858],
                [0.0634, 0.1528, 0.2744, 0.2447],
                [0.5414, 0.0511, 0.9997, 0.7776],
                [0.5209, 0.1602, 0.5792, 0.3916],
            ]
        ),
    },
 }
 # fmt: off
 ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
    # vision backbone
    r"backbone.patch_embed.projection.(weight|bias)":                                                               r"model.backbone.conv_encoder.model.embeddings.patch_embeddings.projection.\1",
    r"backbone.patch_embed.norm.(weight|bias)":                                                                     r"model.backbone.conv_encoder.model.embeddings.norm.\1",
    r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.(relative_position_bias_table|relative_position_index)":        r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.self.\3",
    r"backbone.stages.(\d+).blocks.(\d+).norm1.(weight|bias)":                                                      r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.layernorm_before.\3",
    r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.(query|key|value).(weight|bias)":                               r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.self.\3.\4",
    r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.proj.(weight|bias)":                                            r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.output.dense.\3",
    r"backbone.stages.(\d+).blocks.(\d+).norm2.(weight|bias)":                                                      r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.layernorm_after.\3",
    r"backbone.stages.(\d+).blocks.(\d+).ffn.layers.0.0.(weight|bias)":                                             r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.intermediate.dense.\3",
    r"backbone.stages.(\d+).blocks.(\d+).ffn.layers.1.(weight|bias)":                                               r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.output.dense.\3",
    r"backbone.stages.(\d+).downsample.reduction.weight":                                                           r"model.backbone.conv_encoder.model.encoder.layers.\1.downsample.reduction.weight",
    r"backbone.stages.(\d+).downsample.norm.(weight|bias)":                                                         r"model.backbone.conv_encoder.model.encoder.layers.\1.downsample.norm.\2",
    r"backbone.norms.(\d+).(weight|bias)":                                                                            r"model.backbone.conv_encoder.model.hidden_states_norms.stage\1.\2",
    r"neck.convs.(\d+).conv.(weight|bias)":                                                                         r"model.input_proj_vision.\1.0.\2",
    r"neck.convs.(\d+).gn.(weight|bias)":                                                                           r"model.input_proj_vision.\1.1.\2",
    r"neck.extra_convs.(\d+).conv.(weight|bias)":                                                                   r"model.input_proj_vision.\1.0.\2",
    r"neck.extra_convs.(\d+).gn.(weight|bias)":                                                                     r"model.input_proj_vision.\1.1.\2",
    # text backbone
    r"language_model.language_backbone.body.model.(.*)":                                                            r"model.text_backbone.\1",
    r"text_feat_map.(weight|bias)":                                                                                 r"model.text_projection.\1",
    # encoder
    r"encoder.fusion_layers.(\d+).gamma_v":                                                                         r"model.encoder.layers.\1.fusion_layer.vision_param",
    r"encoder.fusion_layers.(\d+).gamma_l":                                                                         r"model.encoder.layers.\1.fusion_layer.text_param",
    r"encoder.fusion_layers.(\d+).layer_norm_v.(weight|bias)":                                                      r"model.encoder.layers.\1.fusion_layer.layer_norm_vision.\2",
    r"encoder.fusion_layers.(\d+).attn.v_proj.(weight|bias)":                                                       r"model.encoder.layers.\1.fusion_layer.attn.vision_proj.\2",
    r"encoder.fusion_layers.(\d+).attn.values_v_proj.(weight|bias)":                                                r"model.encoder.layers.\1.fusion_layer.attn.values_vision_proj.\2",
    r"encoder.fusion_layers.(\d+).attn.out_v_proj.(weight|bias)":                                                   r"model.encoder.layers.\1.fusion_layer.attn.out_vision_proj.\2",
    r"encoder.fusion_layers.(\d+).layer_norm_l.(weight|bias)":                                                      r"model.encoder.layers.\1.fusion_layer.layer_norm_text.\2",
    r"encoder.fusion_layers.(\d+).attn.l_proj.(weight|bias)":                                                       r"model.encoder.layers.\1.fusion_layer.attn.text_proj.\2",
    r"encoder.fusion_layers.(\d+).attn.values_l_proj.(weight|bias)":                                                r"model.encoder.layers.\1.fusion_layer.attn.values_text_proj.\2",
    r"encoder.fusion_layers.(\d+).attn.out_l_proj.(weight|bias)":                                                   r"model.encoder.layers.\1.fusion_layer.attn.out_text_proj.\2",
    r"encoder.layers.(\d+).self_attn.(sampling_offsets|attention_weights|value_proj|output_proj).(weight|bias)":    r"model.encoder.layers.\1.deformable_layer.self_attn.\2.\3",
    r"encoder.layers.(\d+).norms.0.(weight|bias)":                                                                  r"model.encoder.layers.\1.deformable_layer.self_attn_layer_norm.\2",
    r"encoder.layers.(\d+).ffn.layers.0.0.(weight|bias)":                                                           r"model.encoder.layers.\1.deformable_layer.fc1.\2",
    r"encoder.layers.(\d+).ffn.layers.1.(weight|bias)":                                                             r"model.encoder.layers.\1.deformable_layer.fc2.\2",
    r"encoder.layers.(\d+).norms.1.(weight|bias)":                                                                  r"model.encoder.layers.\1.deformable_layer.final_layer_norm.\2",
    r"encoder.text_layers.(\d+).self_attn.attn.(query|key|value)_proj_(weight|bias)":                               r"model.encoder.layers.\1.text_enhancer_layer.self_attn.\2.\3",
    r"encoder.text_layers.(\d+).self_attn.attn.out_proj.(weight|bias)":                                             r"model.encoder.layers.\1.text_enhancer_layer.self_attn.out_proj.\2",
    r"encoder.text_layers.(\d+).norms.0.(weight|bias)":                                                             r"model.encoder.layers.\1.text_enhancer_layer.layer_norm_before.\2",
    r"encoder.text_layers.(\d+).ffn.layers.0.0.(weight|bias)":                                                      r"model.encoder.layers.\1.text_enhancer_layer.fc1.\2",
    r"encoder.text_layers.(\d+).ffn.layers.1.(weight|bias)":                                                        r"model.encoder.layers.\1.text_enhancer_layer.fc2.\2",
    r"encoder.text_layers.(\d+).norms.1.(weight|bias)":                                                             r"model.encoder.layers.\1.text_enhancer_layer.layer_norm_after.\2",
    r"encoder.bbox_head.cls_branch.bias":                                                                           r"model.encoder_output_class_embed.bias",
    r"encoder.bbox_head.reg_branch.0.(weight|bias)":                                                                r"model.encoder_output_bbox_embed.layers.0.\1",
    r"encoder.bbox_head.reg_branch.2.(weight|bias)":                                                                r"model.encoder_output_bbox_embed.layers.1.\1",
    r"encoder.bbox_head.reg_branch.4.(weight|bias)":                                                                r"model.encoder_output_bbox_embed.layers.2.\1",
    # decoder
    r"decoder.norm.(weight|bias)":                                                                                  r"model.decoder.layer_norm.\1",
    r"decoder.ref_point_head.layers.(\d+).(weight|bias)":                                                           r"model.decoder.reference_points_head.layers.\1.\2",
    r"decoder.layers.(\d+).self_attn.attn.(query|key|value)_proj_(weight|bias)":                                    r"model.decoder.layers.\1.self_attn.\2.\3",
    r"decoder.layers.(\d+).self_attn.attn.out_proj.(weight|bias)":                                                  r"model.decoder.layers.\1.self_attn.out_proj.\2",
    r"decoder.layers.(\d+).norms.0.(weight|bias)":                                                                  r"model.decoder.layers.\1.self_attn_layer_norm.\2",
    r"decoder.layers.(\d+).cross_attn_text.attn.(query|key|value)_proj_(weight|bias)":                              r"model.decoder.layers.\1.encoder_attn_text.\2.\3",
    r"decoder.layers.(\d+).cross_attn_text.attn.out_proj.(weight|bias)":                                            r"model.decoder.layers.\1.encoder_attn_text.out_proj.\2",
    r"decoder.layers.(\d+).norms.1.(weight|bias)":                                                                  r"model.decoder.layers.\1.encoder_attn_text_layer_norm.\2",
    r"decoder.layers.(\d+).cross_attn.(sampling_offsets|attention_weights|value_proj|output_proj).(weight|bias)":   r"model.decoder.layers.\1.encoder_attn.\2.\3",
    r"decoder.layers.(\d+).norms.2.(weight|bias)":                                                                  r"model.decoder.layers.\1.encoder_attn_layer_norm.\2",
    r"decoder.layers.(\d+).ffn.layers.0.0.(weight|bias)":                                                           r"model.decoder.layers.\1.fc1.\2",
    r"decoder.layers.(\d+).ffn.layers.1.(weight|bias)":                                                             r"model.decoder.layers.\1.fc2.\2",
    r"decoder.layers.(\d+).norms.3.(weight|bias)":                                                                  r"model.decoder.layers.\1.final_layer_norm.\2",
    r"decoder.bbox_head.cls_branches.(\d+).bias":                                                                   r"model.decoder.class_embed.\1.bias",
    r"decoder.bbox_head.reg_branches.(\d+).0.(weight|bias)":                                                        r"model.decoder.bbox_embed.\1.layers.0.\2",
    r"decoder.bbox_head.reg_branches.(\d+).2.(weight|bias)":                                                        r"model.decoder.bbox_embed.\1.layers.1.\2",
    r"decoder.bbox_head.reg_branches.(\d+).4.(weight|bias)":                                                        r"model.decoder.bbox_embed.\1.layers.2.\2",
    # other
    r"level_embed":                                                                                                 r"model.level_embed",
    r"query_embedding.weight":                                                                                      r"model.query_position_embeddings.weight",
    r"memory_trans_fc.(weight|bias)":                                                                               r"model.enc_output.\1",
    r"memory_trans_norm.(weight|bias)":                                                                             r"model.enc_output_norm.\1",
    r"bbox_head.cls_branches.(\d+).bias":                                                                           r"class_embed.\1.bias",
    r"bbox_head.reg_branches.(\d+).0.(weight|bias)":                                                                r"bbox_embed.\1.layers.0.\2",
    r"bbox_head.reg_branches.(\d+).2.(weight|bias)":                                                                r"bbox_embed.\1.layers.1.\2",
    r"bbox_head.reg_branches.(\d+).4.(weight|bias)":                                                                r"bbox_embed.\1.layers.2.\2",
 }
 # fmt: on
 def get_mm_grounding_dino_config(model_name: str) -> MMGroundingDinoConfig:
    if "tiny" in model_name:
        swin_image_size = 224
        swin_window_size = 7
        swin_embed_dim = 96
        swin_depths = (2, 2, 6, 2)
        swin_num_heads = (3, 6, 12, 24)
        swin_out_features = ["stage2", "stage3", "stage4"]
        num_feature_levels = 4
    elif "base" in model_name:
        swin_image_size = 384
        swin_window_size = 12
        swin_embed_dim = 128
        swin_depths = (2, 2, 18, 2)
        swin_num_heads = (4, 8, 16, 32)
        swin_out_features = ["stage2", "stage3", "stage4"]
        num_feature_levels = 4
    elif "large" in model_name:
        swin_image_size = 384
        swin_window_size = 12
        swin_embed_dim = 192
        swin_depths = (2, 2, 18, 2)
        swin_num_heads = (6, 12, 24, 48)
        swin_out_features = ["stage1", "stage2", "stage3", "stage4"]
        num_feature_levels = 5
    else:
        raise ValueError(
            f"Model name: {model_name} is not supported. Only `tiny`, `base` and `large` models are currently supported."
        )
    backbone_config = SwinConfig(
        image_size=swin_image_size,
        window_size=swin_window_size,
        embed_dim=swin_embed_dim,
        depths=swin_depths,
        num_heads=swin_num_heads,
        out_features=swin_out_features,
    )
    model_config = MMGroundingDinoConfig(
        backbone_config=backbone_config,
        num_feature_levels=num_feature_levels,
    )
    return model_config
 def get_mm_grounding_dino_processor() -> GroundingDinoProcessor:
    img_processor = GroundingDinoImageProcessor()
    txt_processor = BertTokenizer.from_pretrained("bert-base-uncased")
    processor = GroundingDinoProcessor(img_processor, txt_processor)
    return processor
 # Copied from: https://github.com/iSEE-Laboratory/LLMDet/blob/96ec8c82a9d97b170db759e043afd5b81445d0f1/hf_model/mmdet2groundingdino_swint.py#L8C1-L13C13
 def correct_unfold_reduction_order(x: torch.Tensor) -> torch.Tensor:
    out_channel, in_channel = x.shape
    x = x.reshape(out_channel, in_channel // 4, 4).transpose(1, 2)
    x = x[:, [0, 2, 1, 3], :]
    x = x.reshape(out_channel, in_channel)
    return x
 # Copied from: https://github.com/iSEE-Laboratory/LLMDet/blob/96ec8c82a9d97b170db759e043afd5b81445d0f1/hf_model/mmdet2groundingdino_swint.py#L15C1-L20C13
 def correct_unfold_norm_order(x: torch.Tensor) -> torch.Tensor:
    in_channel = x.shape[0]
    x = x.reshape(in_channel // 4, 4).transpose(0, 1)
    x = x[[0, 2, 1, 3], :]
    x = x.reshape(in_channel)
    return x
 def preprocess_old_state(state_dict: dict, config: MMGroundingDinoConfig) -> dict:
    """
    Preprocesses old state dict to enable 1-1 mapping:
        - split qkv projections in Swin backbone
        - reorder reduction and norm parameters in Swin backbone
        - shift output norm indices in Swin backbone
        - shift output proj indices in neck
        - split q,k,v projections in text self and cross attentions in encoder and decoder
        - duplicate detection head parameters for decoder and encoder
    """
    new_state_dict = state_dict.copy()
    for k in state_dict:
        if k.startswith("backbone"):
            if "downsample.reduction" in k:
                new_state_dict[k] = correct_unfold_reduction_order(new_state_dict.pop(k))
            elif "downsample.norm" in k:
                new_state_dict[k] = correct_unfold_norm_order(new_state_dict.pop(k))
            elif "w_msa.qkv" in k:
                q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
                new_state_dict[k.replace("qkv", "query")] = q_param
                new_state_dict[k.replace("qkv", "key")] = k_param
                new_state_dict[k.replace("qkv", "value")] = v_param
            elif "backbone.norm" in k:
                match = re.match(r"backbone.norm(\d+).(weight|bias)", k)
                new_state_dict[f"backbone.norms.{int(match.group(1)) + 1}.{match.group(2)}"] = new_state_dict.pop(k)
        elif k.startswith("neck.extra_convs"):
            num_normal_convs = len(config.backbone_config.out_indices)
            if "gn" in k:
                match = re.match(r"neck.extra_convs.(\d+).gn.(weight|bias)", k)
                new_state_dict[f"neck.extra_convs.{num_normal_convs + int(match.group(1))}.gn.{match.group(2)}"] = (
                    new_state_dict.pop(k)
                )
            elif "conv" in k:
                match = re.match(r"neck.extra_convs.(\d+).conv.(weight|bias)", k)
                new_state_dict[f"neck.extra_convs.{num_normal_convs + int(match.group(1))}.conv.{match.group(2)}"] = (
                    new_state_dict.pop(k)
                )
        elif k.startswith("encoder"):
            if "self_attn.attn.in_proj" in k:
                q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
                new_state_dict[k.replace("in", "query")] = q_param
                new_state_dict[k.replace("in", "key")] = k_param
                new_state_dict[k.replace("in", "value")] = v_param
        elif k.startswith("decoder"):
            if "self_attn.attn.in_proj" in k or "cross_attn_text.attn.in_proj" in k:
                q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
                new_state_dict[k.replace("in", "query")] = q_param
                new_state_dict[k.replace("in", "key")] = k_param
                new_state_dict[k.replace("in", "value")] = v_param
        elif k.startswith("bbox_head"):
            num_decoder_layers = config.decoder_layers
            match = re.match(r"bbox_head.(cls|reg)_branches.(\d+).(.*)", k)
            cls_or_reg = match.group(1)
            layer_idx = int(match.group(2))
            suffix = match.group(3)
            if layer_idx < num_decoder_layers:
                new_key = f"decoder.bbox_head.{cls_or_reg}_branches.{layer_idx}.{suffix}"
                new_state_dict[new_key] = new_state_dict[k]  # copy
            else:
                new_key = f"encoder.bbox_head.{cls_or_reg}_branch.{suffix}"
                new_state_dict[new_key] = new_state_dict.pop(k)  # move
        # remove unused params
        if (
            k == "dn_query_generator.label_embedding.weight"
            or k == "language_model.language_backbone.body.model.embeddings.position_ids"
            or k == "image_seperate.weight"
            or k.startswith("lmm")
            or k.startswith("connector")
            or k.startswith("region_connector")
            or k.startswith("ref_point_head")
        ):
            new_state_dict.pop(k)
    return new_state_dict
 # Copied from transformers/models/siglip2/convert_siglip2_to_hf.py
 def convert_old_keys_to_new_keys(state_dict_keys: list) -> dict:
    """
    This function should be applied only once, on the concatenated keys to efficiently rename using
    the key mappings.
    """
    output_dict = {}
    if state_dict_keys is not None:
        old_text = "\n".join(state_dict_keys)
        new_text = old_text
        for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
            if replacement is None:
                new_text = re.sub(pattern, "", new_text)  # an empty line
                continue
            new_text = re.sub(pattern, replacement, new_text)
        output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
    return output_dict
 def convert_mm_to_hf_state(original_state: dict, hf_cfg: MMGroundingDinoConfig) -> dict:
    original_state = preprocess_old_state(original_state, hf_cfg)
    original_state_keys = list(original_state.keys())
    original_to_hf_key_map = convert_old_keys_to_new_keys(original_state_keys)
    hf_state = {}
    for original_key in original_state_keys:
        hf_key = original_to_hf_key_map[original_key]
        hf_state[hf_key] = original_state.pop(original_key)
    return hf_state
 def prepare_test_inputs():
    image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw)
    text = [["cat", "remote"]]
    return image, text
@torch.no_grad()
 def convert_mm_grounding_dino_checkpoint(
    model_name: str,
    verify_outputs: bool,
    push_to_hub: bool,
    hub_user_name: str,
 ) -> tuple[MMGroundingDinoConfig, dict]:
    # Load original state
    checkpoint_url = MODEL_NAME_TO_CHECKPOINT_URL_MAPPING[model_name]
    print(f"Loading checkpoint from: {checkpoint_url}")
    ckpt = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")
    mm_state = ckpt["state_dict"]
    # Create hf model and processor
    print("Creating model...")
    hf_cfg = get_mm_grounding_dino_config(model_name)
    hf_state = convert_mm_to_hf_state(mm_state, hf_cfg)
    hf_model = MMGroundingDinoForObjectDetection(hf_cfg).eval()
    hf_model.load_state_dict(hf_state)
    hf_processor = get_mm_grounding_dino_processor()
    # Verify outputs if needed
    if verify_outputs:
        print("Running inference to verify outputs...")
        image, text = prepare_test_inputs()
        model_inputs = hf_processor(images=image, text=text, return_tensors="pt")
        model_outputs = hf_model(**model_inputs)
        results = hf_processor.post_process_grounded_object_detection(
            model_outputs,
            model_inputs.input_ids,
            box_threshold=0.4,
            text_threshold=0.3,
        )
        result = results[0]
        print(result)
        expected = MODEL_NAME_TO_EXPECTED_OUTPUT_MAPPING[model_name]
        for key in expected:
            torch.testing.assert_close(result[key], expected[key], atol=1e-3, rtol=1e-3)
        print("Outputs match.")
    # Push to hub if needed
    if push_to_hub:
        print("Pushing to hub...")
        hub_url = f"{hub_user_name}/{model_name}"
        hf_model.push_to_hub(hub_url)
        hf_processor.push_to_hub(hub_url)
        print(f"Pushed to huggingface hub at: {hub_url}.")
    return hf_cfg, hf_state
 def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model-name",
        required=True,
        type=str,
        choices=list(MODEL_NAME_TO_CHECKPOINT_URL_MAPPING.keys()),
        help="URL to the original mm grounding dino checkpoint.",
    )
    parser.add_argument("--hub-user-name", type=str, help="User name on the huggingface hub.")
    parser.add_argument("--push-to-hub", action="store_true", help="Whether to push model to hub or not.")
    parser.add_argument(
        "--verify-outputs", action="store_true", help="Whether to verify that model output is correct or not."
    )
    return parser.parse_args()
 if __name__ == "__main__":
    args = parse_args()
    convert_mm_grounding_dino_checkpoint(
        args.model_name,
        args.verify_outputs,
        args.push_to_hub,
        args.hub_user_name,
    )
--- a/src/transformers/models/mm_grounding_dino/modeling_mm_grounding_dino.py
+++ b/src/transformers/models/mm_grounding_dino/modeling_mm_grounding_dino.py
--- a/src/transformers/models/mm_grounding_dino/modular_mm_grounding_dino.py
+++ b/src/transformers/models/mm_grounding_dino/modular_mm_grounding_dino.py
@@ -0,0 +1,434 @@
 # coding=utf-8
 # Copyright 2025 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import torch
 from torch import nn
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
 from ...utils.backbone_utils import verify_backbone_config_arguments
 from ..auto import CONFIG_MAPPING
 from ..auto.modeling_auto import AutoModel
 from ..grounding_dino.configuration_grounding_dino import GroundingDinoConfig
 from ..grounding_dino.modeling_grounding_dino import (
    GroundingDinoContrastiveEmbedding,
    GroundingDinoConvEncoder,
    GroundingDinoConvModel,
    GroundingDinoDecoder,
    GroundingDinoEncoder,
    GroundingDinoForObjectDetection,
    GroundingDinoMLPPredictionHead,
    GroundingDinoModel,
    GroundingDinoPreTrainedModel,
    build_position_encoding,
 )
 logger = logging.get_logger(__name__)
 class MMGroundingDinoConfig(GroundingDinoConfig, PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`MMGroundingDinoModel`]. It is used to instantiate a
    MM Grounding DINO model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the MM Grounding DINO tiny architecture
    [openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det).
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
            The configuration of the backbone model.
        backbone (`str`, *optional*):
            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
        use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
            Whether to use pretrained weights for the backbone.
        use_timm_backbone (`bool`, *optional*, defaults to `False`):
            Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
            library.
        backbone_kwargs (`dict`, *optional*):
            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
        text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `BertConfig`):
            The config object or dictionary of the text backbone.
        num_queries (`int`, *optional*, defaults to 900):
            Number of object queries, i.e. detection slots. This is the maximal number of objects
            [`MMGroundingDinoModel`] can detect in a single image.
        encoder_layers (`int`, *optional*, defaults to 6):
            Number of encoder layers.
        encoder_ffn_dim (`int`, *optional*, defaults to 2048):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        encoder_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer encoder.
        decoder_layers (`int`, *optional*, defaults to 6):
            Number of decoder layers.
        decoder_ffn_dim (`int`, *optional*, defaults to 2048):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        decoder_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer decoder.
        is_encoder_decoder (`bool`, *optional*, defaults to `True`):
            Whether the model is used as an encoder/decoder or not.
        activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        d_model (`int`, *optional*, defaults to 256):
            Dimension of the layers.
        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
        auxiliary_loss (`bool`, *optional*, defaults to `False`):
            Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
        position_embedding_type (`str`, *optional*, defaults to `"sine"`):
            Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
        num_feature_levels (`int`, *optional*, defaults to 4):
            The number of input feature levels.
        encoder_n_points (`int`, *optional*, defaults to 4):
            The number of sampled keys in each feature level for each attention head in the encoder.
        decoder_n_points (`int`, *optional*, defaults to 4):
            The number of sampled keys in each feature level for each attention head in the decoder.
        two_stage (`bool`, *optional*, defaults to `True`):
            Whether to apply a two-stage deformable DETR, where the region proposals are also generated by a variant of
            Grounding DINO, which are further fed into the decoder for iterative bounding box refinement.
        class_cost (`float`, *optional*, defaults to 1.0):
            Relative weight of the classification error in the Hungarian matching cost.
        bbox_cost (`float`, *optional*, defaults to 5.0):
            Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
        giou_cost (`float`, *optional*, defaults to 2.0):
            Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
        bbox_loss_coefficient (`float`, *optional*, defaults to 5.0):
            Relative weight of the L1 bounding box loss in the object detection loss.
        giou_loss_coefficient (`float`, *optional*, defaults to 2.0):
            Relative weight of the generalized IoU loss in the object detection loss.
        focal_alpha (`float`, *optional*, defaults to 0.25):
            Alpha parameter in the focal loss.
        disable_custom_kernels (`bool`, *optional*, defaults to `False`):
            Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
            kernels are not supported by PyTorch ONNX export.
        max_text_len (`int`, *optional*, defaults to 256):
            The maximum length of the text input.
        text_enhancer_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the text enhancer.
        fusion_droppath (`float`, *optional*, defaults to 0.1):
            The droppath ratio for the fusion module.
        fusion_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the fusion module.
        embedding_init_target (`bool`, *optional*, defaults to `True`):
            Whether to initialize the target with Embedding weights.
        query_dim (`int`, *optional*, defaults to 4):
            The dimension of the query vector.
        positional_embedding_temperature (`float`, *optional*, defaults to 20):
            The temperature for Sine Positional Embedding that is used together with vision backbone.
        init_std (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
    Examples:
    ```python
    >>> from transformers import MMGroundingDinoConfig, MMGroundingDinoModel
    >>> # Initializing a MM Grounding DINO configuration
    >>> configuration = MMGroundingDinoConfig()
    >>> # Initializing a model (with random weights) from the configuration
    >>> model = MMGroundingDinoModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "mm-grounding-dino"
    def __init__(
        self,
        backbone_config=None,
        backbone=None,
        use_pretrained_backbone=False,
        use_timm_backbone=False,
        backbone_kwargs=None,
        text_config=None,
        num_queries=900,
        encoder_layers=6,
        encoder_ffn_dim=2048,
        encoder_attention_heads=8,
        decoder_layers=6,
        decoder_ffn_dim=2048,
        decoder_attention_heads=8,
        is_encoder_decoder=True,
        activation_function="relu",
        d_model=256,
        dropout=0.1,
        attention_dropout=0.0,
        activation_dropout=0.0,
        auxiliary_loss=False,
        position_embedding_type="sine",
        num_feature_levels=4,
        encoder_n_points=4,
        decoder_n_points=4,
        two_stage=True,
        class_cost=1.0,
        bbox_cost=5.0,
        giou_cost=2.0,
        bbox_loss_coefficient=5.0,
        giou_loss_coefficient=2.0,
        focal_alpha=0.25,
        disable_custom_kernels=False,
        # other parameters
        max_text_len=256,
        text_enhancer_dropout=0.0,
        fusion_droppath=0.1,
        fusion_dropout=0.0,
        embedding_init_target=True,
        query_dim=4,
        positional_embedding_temperature=20,
        init_std=0.02,
        layer_norm_eps=1e-5,
        **kwargs,
    ):
        PretrainedConfig.__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
        if backbone_config is None and backbone is None:
            logger.info("`backbone_config` is `None`. Initializing the config with the default `Swin` backbone.")
            backbone_config = CONFIG_MAPPING["swin"](
                window_size=7,
                image_size=224,
                embed_dim=96,
                depths=[2, 2, 6, 2],
                num_heads=[3, 6, 12, 24],
                out_indices=[2, 3, 4],
            )
        elif isinstance(backbone_config, dict):
            backbone_model_type = backbone_config.pop("model_type")
            config_class = CONFIG_MAPPING[backbone_model_type]
            backbone_config = config_class.from_dict(backbone_config)
        verify_backbone_config_arguments(
            use_timm_backbone=use_timm_backbone,
            use_pretrained_backbone=use_pretrained_backbone,
            backbone=backbone,
            backbone_config=backbone_config,
            backbone_kwargs=backbone_kwargs,
        )
        if text_config is None:
            text_config = {}
            logger.info("text_config is None. Initializing the text config with default values (`BertConfig`).")
        self.backbone_config = backbone_config
        self.backbone = backbone
        self.use_pretrained_backbone = use_pretrained_backbone
        self.use_timm_backbone = use_timm_backbone
        self.backbone_kwargs = backbone_kwargs
        self.num_queries = num_queries
        self.d_model = d_model
        self.encoder_ffn_dim = encoder_ffn_dim
        self.encoder_layers = encoder_layers
        self.encoder_attention_heads = encoder_attention_heads
        self.decoder_ffn_dim = decoder_ffn_dim
        self.decoder_layers = decoder_layers
        self.decoder_attention_heads = decoder_attention_heads
        self.dropout = dropout
        self.attention_dropout = attention_dropout
        self.activation_dropout = activation_dropout
        self.activation_function = activation_function
        self.auxiliary_loss = auxiliary_loss
        self.position_embedding_type = position_embedding_type
        # deformable attributes
        self.num_feature_levels = num_feature_levels
        self.encoder_n_points = encoder_n_points
        self.decoder_n_points = decoder_n_points
        self.two_stage = two_stage
        # Hungarian matcher
        self.class_cost = class_cost
        self.bbox_cost = bbox_cost
        self.giou_cost = giou_cost
        # Loss coefficients
        self.bbox_loss_coefficient = bbox_loss_coefficient
        self.giou_loss_coefficient = giou_loss_coefficient
        self.focal_alpha = focal_alpha
        self.disable_custom_kernels = disable_custom_kernels
        # Text backbone
        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "bert"
            text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["bert"]()
        self.text_config = text_config
        self.max_text_len = max_text_len
        # Text Enhancer
        self.text_enhancer_dropout = text_enhancer_dropout
        # Fusion
        self.fusion_droppath = fusion_droppath
        self.fusion_dropout = fusion_dropout
        # Others
        self.embedding_init_target = embedding_init_target
        self.query_dim = query_dim
        self.positional_embedding_temperature = positional_embedding_temperature
        self.init_std = init_std
        self.layer_norm_eps = layer_norm_eps
 class MMGroundingDinoContrastiveEmbedding(GroundingDinoContrastiveEmbedding):
    def __init__(self, config):
        super().__init__(config)
        self.bias = nn.Parameter(torch.tensor(0.0))
    def forward(
        self,
        vision_hidden_state: torch.FloatTensor,
        text_hidden_state: torch.FloatTensor,
        text_token_mask: torch.BoolTensor,
    ) -> torch.FloatTensor:
        res = vision_hidden_state @ text_hidden_state.transpose(-1, -2)
        res = res / math.sqrt(vision_hidden_state.shape[-1])
        res = res + self.bias
        res.masked_fill_(~text_token_mask[:, None, :], float("-inf"))
        # padding to max_text_len
        new_res = torch.full((*res.shape[:-1], self.max_text_len), float("-inf"), device=res.device)
        new_res[..., : res.shape[-1]] = res
        return new_res
 class MMGroundingDinoPreTrainedModel(GroundingDinoPreTrainedModel):
    def _init_weights(self, module):
        super()._init_weights(module)
        if isinstance(module, MMGroundingDinoContrastiveEmbedding):
            nn.init.constant_(module.bias, -math.log((1 - 0.01) / 0.01))
 class MMGroundingDinoConvEncoder(GroundingDinoConvEncoder):
    pass
 class MMGroundingDinoConvModel(GroundingDinoConvModel):
    pass
 class MMGroundingDinoEncoder(GroundingDinoEncoder):
    pass
 class MMGroundingDinoDecoder(GroundingDinoDecoder):
    pass
 class MMGroundingDinoModel(GroundingDinoModel, MMGroundingDinoPreTrainedModel):
    def __init__(self, config: MMGroundingDinoConfig):
        MMGroundingDinoPreTrainedModel.__init__(config)
        # Create backbone + positional encoding
        backbone = MMGroundingDinoConvEncoder(config)
        position_embeddings = build_position_encoding(config)
        self.backbone = MMGroundingDinoConvModel(backbone, position_embeddings)
        # Create input projection layers
        num_backbone_outs = len(backbone.intermediate_channel_sizes)
        input_proj_list = []
        for i in range(num_backbone_outs):
            in_channels = backbone.intermediate_channel_sizes[i]
            input_proj_list.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, config.d_model, kernel_size=1),
                    nn.GroupNorm(32, config.d_model),
                )
            )
        for _ in range(config.num_feature_levels - num_backbone_outs):
            input_proj_list.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, config.d_model, kernel_size=3, stride=2, padding=1),
                    nn.GroupNorm(32, config.d_model),
                )
            )
            in_channels = config.d_model
        self.input_proj_vision = nn.ModuleList(input_proj_list)
        # Create text backbone
        self.text_backbone = AutoModel.from_config(config.text_config, add_pooling_layer=False)
        self.text_projection = nn.Linear(config.text_config.hidden_size, config.d_model)
        if config.embedding_init_target or not config.two_stage:
            self.query_position_embeddings = nn.Embedding(config.num_queries, config.d_model)
        self.encoder = MMGroundingDinoEncoder(config)
        self.decoder = MMGroundingDinoDecoder(config)
        self.level_embed = nn.Parameter(torch.Tensor(config.num_feature_levels, config.d_model))
        self.enc_output = nn.Linear(config.d_model, config.d_model)
        self.enc_output_norm = nn.LayerNorm(config.d_model, config.layer_norm_eps)
        self.encoder_output_bbox_embed = MMGroundingDinoMLPPredictionHead(
            input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
        )
        self.encoder_output_class_embed = MMGroundingDinoContrastiveEmbedding(config)
        self.post_init()
 class MMGroundingDinoMLPPredictionHead(GroundingDinoMLPPredictionHead):
    pass
 class MMGroundingDinoForObjectDetection(GroundingDinoForObjectDetection, MMGroundingDinoPreTrainedModel):
    _tied_weights_keys = [
        r"bbox_embed\.[1-9]\d*",
        r"model\.decoder\.bbox_embed\.[0-9]\d*",
        r"class_embed\.[1-9]\d*",
        r"model\.decoder\.class_embed\.[0-9]\d*",
    ]
    def __init__(self, config: MMGroundingDinoConfig):
        MMGroundingDinoPreTrainedModel.__init__(config)
        self.model = MMGroundingDinoModel(config)
        self.class_embed = nn.ModuleList(
            [MMGroundingDinoContrastiveEmbedding(config) for _ in range(config.decoder_layers)]
        )
        self.bbox_embed = nn.ModuleList(
            [
                MMGroundingDinoMLPPredictionHead(
                    input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
                )
                for _ in range(config.decoder_layers)
            ]
        )
        # hack for box-refinement
        self.model.decoder.bbox_embed = self.bbox_embed
        # hack implementation for two-stage
        self.model.decoder.class_embed = self.class_embed
        # Initialize weights and apply final processing
        self.post_init()
 __all__ = [
    "MMGroundingDinoConfig",
    "MMGroundingDinoForObjectDetection",
    "MMGroundingDinoModel",
    "MMGroundingDinoPreTrainedModel",
 ]
--- a/tests/models/grounding_dino/test_modeling_grounding_dino.py
+++ b/tests/models/grounding_dino/test_modeling_grounding_dino.py
@@ -818,7 +818,9 @@ class GroundingDinoModelIntegrationTests(unittest.TestCase):
        prompt = ". ".join(id2label.values()) + "."
        text_inputs = tokenizer([prompt, prompt], return_tensors="pt")
-        image_inputs = image_processor(images=ds["image"], annotations=ds["annotations"], return_tensors="pt")
+        image_inputs = image_processor(
            images=list(ds["image"]), annotations=list(ds["annotations"]), return_tensors="pt"
        )
        # Passing auxiliary_loss=True to compare with the expected loss
        model = GroundingDinoForObjectDetection.from_pretrained(
--- a/tests/models/mm_grounding_dino/init.py
+++ b/tests/models/mm_grounding_dino/init.py
--- a/tests/models/mm_grounding_dino/test_modeling_mm_grounding_dino.py
+++ b/tests/models/mm_grounding_dino/test_modeling_mm_grounding_dino.py
@@ -0,0 +1,871 @@
 # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Testing suite for the PyTorch MM Grounding DINO model."""
 import collections
 import inspect
 import math
 import re
 import unittest
 from datasets import load_dataset
 from transformers import (
    MMGroundingDinoConfig,
    SwinConfig,
    is_torch_available,
    is_vision_available,
 )
 from transformers.file_utils import cached_property
 from transformers.testing_utils import (
    is_flaky,
    require_timm,
    require_torch,
    require_torch_accelerator,
    require_vision,
    slow,
    torch_device,
 )
 from ...test_configuration_common import ConfigTester
 from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
 from ...test_pipeline_mixin import PipelineTesterMixin
 if is_torch_available():
    import torch
    from transformers import MMGroundingDinoConfig, MMGroundingDinoForObjectDetection, MMGroundingDinoModel
    from transformers.pytorch_utils import id_tensor_storage
 if is_vision_available():
    from PIL import Image
    from transformers import AutoProcessor
 # Copied from tests.models.grounding_dino.test_modeling_grounding_dino.generate_fake_bounding_boxes
 def generate_fake_bounding_boxes(n_boxes):
    """Generate bounding boxes in the format (center_x, center_y, width, height)"""
    # Validate the input
    if not isinstance(n_boxes, int):
        raise TypeError("n_boxes must be an integer")
    if n_boxes <= 0:
        raise ValueError("n_boxes must be a positive integer")
    # Generate random bounding boxes in the format (center_x, center_y, width, height)
    bounding_boxes = torch.rand((n_boxes, 4))
    # Extract the components
    center_x = bounding_boxes[:, 0]
    center_y = bounding_boxes[:, 1]
    width = bounding_boxes[:, 2]
    height = bounding_boxes[:, 3]
    # Ensure width and height do not exceed bounds
    width = torch.min(width, torch.tensor(1.0))
    height = torch.min(height, torch.tensor(1.0))
    # Ensure the bounding box stays within the normalized space
    center_x = torch.where(center_x - width / 2 < 0, width / 2, center_x)
    center_x = torch.where(center_x + width / 2 > 1, 1 - width / 2, center_x)
    center_y = torch.where(center_y - height / 2 < 0, height / 2, center_y)
    center_y = torch.where(center_y + height / 2 > 1, 1 - height / 2, center_y)
    # Combine back into bounding boxes
    bounding_boxes = torch.stack([center_x, center_y, width, height], dim=1)
    return bounding_boxes
 # Copied from tests.models.grounding_dino.test_modeling_grounding_dino.GroundingDinoModelTester with GroundingDino->MMGroundingDino
 class MMGroundingDinoModelTester:
    def __init__(
        self,
        parent,
        batch_size=4,
        is_training=True,
        use_labels=True,
        hidden_size=32,
        num_hidden_layers=2,
        num_attention_heads=4,
        intermediate_size=4,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        num_queries=2,
        num_channels=3,
        image_size=98,
        n_targets=8,
        num_labels=2,
        num_feature_levels=4,
        encoder_n_points=2,
        decoder_n_points=6,
        max_text_len=7,
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.is_training = is_training
        self.use_labels = use_labels
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.num_queries = num_queries
        self.num_channels = num_channels
        self.image_size = image_size
        self.n_targets = n_targets
        self.num_labels = num_labels
        self.num_feature_levels = num_feature_levels
        self.encoder_n_points = encoder_n_points
        self.decoder_n_points = decoder_n_points
        self.max_text_len = max_text_len
        # we also set the expected seq length for both encoder and decoder
        self.encoder_seq_length_vision = (
            math.ceil(self.image_size / 8) ** 2
            + math.ceil(self.image_size / 16) ** 2
            + math.ceil(self.image_size / 32) ** 2
            + math.ceil(self.image_size / 64) ** 2
        )
        self.encoder_seq_length_text = self.max_text_len
        self.decoder_seq_length = self.num_queries
    def prepare_config_and_inputs(self):
        pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
        pixel_mask = torch.ones([self.batch_size, self.image_size, self.image_size], device=torch_device)
        # When using `MMGroundingDino` the text input template is '{label1}. {label2}. {label3. ... {labelN}.'
        # Therefore to avoid errors when running tests with `labels` `input_ids` have to follow this structure.
        # Otherwise when running `build_label_maps` it will throw an error when trying to split the input_ids into segments.
        input_ids = torch.tensor([101, 3869, 1012, 11420, 3869, 1012, 102], device=torch_device)
        input_ids = input_ids.unsqueeze(0).expand(self.batch_size, -1)
        labels = None
        if self.use_labels:
            # labels is a list of Dict (each Dict being the labels for a given example in the batch)
            labels = []
            for i in range(self.batch_size):
                target = {}
                target["class_labels"] = torch.randint(
                    high=self.num_labels, size=(self.n_targets,), device=torch_device
                )
                target["boxes"] = generate_fake_bounding_boxes(self.n_targets).to(torch_device)
                target["masks"] = torch.rand(self.n_targets, self.image_size, self.image_size, device=torch_device)
                labels.append(target)
        config = self.get_config()
        return config, pixel_values, pixel_mask, input_ids, labels
    def get_config(self):
        swin_config = SwinConfig(
            window_size=7,
            embed_dim=8,
            depths=[1, 1, 1, 1],
            num_heads=[1, 1, 1, 1],
            image_size=self.image_size,
            out_features=["stage2", "stage3", "stage4"],
            out_indices=[2, 3, 4],
        )
        text_backbone = {
            "hidden_size": 8,
            "num_hidden_layers": 2,
            "num_attention_heads": 2,
            "intermediate_size": 8,
            "max_position_embeddings": 8,
            "model_type": "bert",
        }
        return MMGroundingDinoConfig(
            d_model=self.hidden_size,
            encoder_layers=self.num_hidden_layers,
            decoder_layers=self.num_hidden_layers,
            encoder_attention_heads=self.num_attention_heads,
            decoder_attention_heads=self.num_attention_heads,
            encoder_ffn_dim=self.intermediate_size,
            decoder_ffn_dim=self.intermediate_size,
            dropout=self.hidden_dropout_prob,
            attention_dropout=self.attention_probs_dropout_prob,
            num_queries=self.num_queries,
            num_labels=self.num_labels,
            num_feature_levels=self.num_feature_levels,
            encoder_n_points=self.encoder_n_points,
            decoder_n_points=self.decoder_n_points,
            use_timm_backbone=False,
            backbone_config=swin_config,
            max_text_len=self.max_text_len,
            text_config=text_backbone,
        )
    def prepare_config_and_inputs_for_common(self):
        config, pixel_values, pixel_mask, input_ids, labels = self.prepare_config_and_inputs()
        inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask, "input_ids": input_ids}
        return config, inputs_dict
    def create_and_check_model(self, config, pixel_values, pixel_mask, input_ids, labels):
        model = MMGroundingDinoModel(config=config)
        model.to(torch_device)
        model.eval()
        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids)
        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.num_queries, self.hidden_size))
    def create_and_check_object_detection_head_model(self, config, pixel_values, pixel_mask, input_ids, labels):
        model = MMGroundingDinoForObjectDetection(config=config)
        model.to(torch_device)
        model.eval()
        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids)
        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, config.max_text_len))
        self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
        result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids, labels=labels)
        self.parent.assertEqual(result.loss.shape, ())
        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, config.max_text_len))
        self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
@require_torch
 # Copied from tests.models.grounding_dino.test_modeling_grounding_dino.GroundingDinoModelTest with Grounding->MMGrounding
 class MMGroundingDinoModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
    all_model_classes = (MMGroundingDinoModel, MMGroundingDinoForObjectDetection) if is_torch_available() else ()
    is_encoder_decoder = True
    test_torchscript = False
    test_pruning = False
    test_head_masking = False
    test_missing_keys = False
    pipeline_model_mapping = (
        {
            "image-feature-extraction": MMGroundingDinoModel,
            "zero-shot-object-detection": MMGroundingDinoForObjectDetection,
        }
        if is_torch_available()
        else {}
    )
    # special case for head models
    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
        inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
        if return_labels:
            if model_class.__name__ == "MMGroundingDinoForObjectDetection":
                labels = []
                for i in range(self.model_tester.batch_size):
                    target = {}
                    target["class_labels"] = torch.ones(
                        size=(self.model_tester.n_targets,), device=torch_device, dtype=torch.long
                    )
                    target["boxes"] = torch.ones(
                        self.model_tester.n_targets, 4, device=torch_device, dtype=torch.float
                    )
                    target["masks"] = torch.ones(
                        self.model_tester.n_targets,
                        self.model_tester.image_size,
                        self.model_tester.image_size,
                        device=torch_device,
                        dtype=torch.float,
                    )
                    labels.append(target)
                inputs_dict["labels"] = labels
        return inputs_dict
    def setUp(self):
        self.model_tester = MMGroundingDinoModelTester(self)
        self.config_tester = ConfigTester(
            self,
            config_class=MMGroundingDinoConfig,
            has_text_modality=False,
            common_properties=["d_model", "encoder_attention_heads", "decoder_attention_heads"],
        )
    def test_config(self):
        self.config_tester.run_common_tests()
    def test_model(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_model(*config_and_inputs)
    def test_object_detection_head_model(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_object_detection_head_model(*config_and_inputs)
    @unittest.skip(reason="MMGrounding DINO does not use inputs_embeds")
    def test_inputs_embeds(self):
        pass
    @unittest.skip(reason="MMGrounding DINO does not have a get_input_embeddings method")
    def test_model_get_set_embeddings(self):
        pass
    @unittest.skip(reason="MMGrounding DINO does not use token embeddings")
    def test_resize_tokens_embeddings(self):
        pass
    @unittest.skip(reason="Feed forward chunking is not implemented")
    def test_feed_forward_chunking(self):
        pass
    def test_attention_outputs(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        for model_class in self.all_model_classes:
            inputs_dict["output_attentions"] = True
            inputs_dict["output_hidden_states"] = False
            config.return_dict = True
            model = model_class._from_config(config, attn_implementation="eager")
            config = model.config
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            attentions = outputs.encoder_attentions[-1]
            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
            # check that output_attentions also work using config
            del inputs_dict["output_attentions"]
            config.output_attentions = True
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            attentions = outputs.encoder_attentions[-1]
            self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
            self.assertListEqual(
                list(attentions[0].shape[-3:]),
                [
                    self.model_tester.num_attention_heads,
                    self.model_tester.num_feature_levels,
                    self.model_tester.encoder_n_points,
                ],
            )
            out_len = len(outputs)
            correct_outlen = 12
            # loss is at first position
            if "labels" in inputs_dict:
                correct_outlen += 1  # loss is added to beginning
            # Object Detection model returns pred_logits and pred_boxes and input_ids
            if model_class.__name__ == "MMGroundingDinoForObjectDetection":
                correct_outlen += 3
            self.assertEqual(out_len, correct_outlen)
            # decoder attentions
            decoder_attentions = outputs.decoder_attentions[0]
            self.assertIsInstance(decoder_attentions, (list, tuple))
            self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
            self.assertListEqual(
                list(decoder_attentions[0].shape[-3:]),
                [self.model_tester.num_attention_heads, self.model_tester.num_queries, self.model_tester.num_queries],
            )
            # cross attentions
            cross_attentions = outputs.decoder_attentions[-1]
            self.assertIsInstance(cross_attentions, (list, tuple))
            self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
            self.assertListEqual(
                list(cross_attentions[0].shape[-3:]),
                [
                    self.model_tester.num_attention_heads,
                    self.model_tester.num_feature_levels,
                    self.model_tester.decoder_n_points,
                ],
            )
            # Check attention is always last and order is fine
            inputs_dict["output_attentions"] = True
            inputs_dict["output_hidden_states"] = True
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            self.assertEqual(out_len + 3, len(outputs))
            self_attentions = outputs.encoder_attentions[-1]
            self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
            self.assertListEqual(
                list(self_attentions[0].shape[-3:]),
                [
                    self.model_tester.num_attention_heads,
                    self.model_tester.num_feature_levels,
                    self.model_tester.encoder_n_points,
                ],
            )
    # overwrite since hidden_states are called encoder_text_hidden_states
    def test_hidden_states_output(self):
        def check_hidden_states_output(inputs_dict, config, model_class):
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            hidden_states = outputs.encoder_vision_hidden_states
            expected_num_layers = getattr(
                self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
            )
            self.assertEqual(len(hidden_states), expected_num_layers)
            seq_len = self.model_tester.encoder_seq_length_vision
            self.assertListEqual(
                list(hidden_states[0].shape[-2:]),
                [seq_len, self.model_tester.hidden_size],
            )
            hidden_states = outputs.encoder_text_hidden_states
            self.assertEqual(len(hidden_states), expected_num_layers)
            seq_len = self.model_tester.encoder_seq_length_text
            self.assertListEqual(
                list(hidden_states[0].shape[-2:]),
                [seq_len, self.model_tester.hidden_size],
            )
            hidden_states = outputs.decoder_hidden_states
            self.assertIsInstance(hidden_states, (list, tuple))
            self.assertEqual(len(hidden_states), expected_num_layers)
            seq_len = getattr(self.model_tester, "seq_length", None)
            decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
            self.assertListEqual(
                list(hidden_states[0].shape[-2:]),
                [decoder_seq_length, self.model_tester.hidden_size],
            )
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        for model_class in self.all_model_classes:
            inputs_dict["output_hidden_states"] = True
            check_hidden_states_output(inputs_dict, config, model_class)
            # check that output_hidden_states also work using config
            del inputs_dict["output_hidden_states"]
            config.output_hidden_states = True
            check_hidden_states_output(inputs_dict, config, model_class)
    # removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
    def test_retain_grad_hidden_states_attentions(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.output_hidden_states = True
        config.output_attentions = True
        # no need to test all models as different heads yield the same functionality
        model_class = self.all_model_classes[0]
        model = model_class(config)
        model.to(torch_device)
        inputs = self._prepare_for_class(inputs_dict, model_class)
        outputs = model(**inputs)
        output = outputs[0]
        encoder_hidden_states = outputs.encoder_vision_hidden_states[0]
        encoder_attentions = outputs.encoder_attentions[0][0]
        encoder_hidden_states.retain_grad()
        encoder_attentions.retain_grad()
        cross_attentions = outputs.decoder_attentions[-1][0]
        cross_attentions.retain_grad()
        output.flatten()[0].backward(retain_graph=True)
        self.assertIsNotNone(encoder_hidden_states.grad)
        self.assertIsNotNone(encoder_attentions.grad)
        self.assertIsNotNone(cross_attentions.grad)
    def test_forward_signature(self):
        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
        for model_class in self.all_model_classes:
            model = model_class(config)
            signature = inspect.signature(model.forward)
            # signature.parameters is an OrderedDict => so arg_names order is deterministic
            arg_names = [*signature.parameters.keys()]
            expected_arg_names = ["pixel_values", "input_ids"]
            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
    def test_different_timm_backbone(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        # let's pick a random timm backbone
        config.backbone = "tf_mobilenetv3_small_075"
        config.use_timm_backbone = True
        config.backbone_config = None
        config.backbone_kwargs = {"in_chans": 3, "out_indices": (2, 3, 4)}
        for model_class in self.all_model_classes:
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            if model_class.__name__ == "MMGroundingDinoForObjectDetection":
                expected_shape = (
                    self.model_tester.batch_size,
                    self.model_tester.num_queries,
                    config.max_text_len,
                )
                self.assertEqual(outputs.logits.shape, expected_shape)
            self.assertTrue(outputs)
    @require_timm
    def test_hf_backbone(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        # Load a pretrained HF checkpoint as backbone
        config.backbone = "microsoft/resnet-18"
        config.backbone_config = None
        config.use_timm_backbone = False
        config.use_pretrained_backbone = True
        config.backbone_kwargs = {"out_indices": [2, 3, 4]}
        for model_class in self.all_model_classes:
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            if model_class.__name__ == "MMGroundingDinoForObjectDetection":
                expected_shape = (
                    self.model_tester.batch_size,
                    self.model_tester.num_queries,
                    config.max_text_len,
                )
                self.assertEqual(outputs.logits.shape, expected_shape)
            self.assertTrue(outputs)
    # Ignore copy
    def test_initialization(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        configs_no_init = _config_zero_init(config)
        for model_class in self.all_model_classes:
            model = model_class(config=configs_no_init)
            for name, param in model.named_parameters():
                if param.requires_grad:
                    if (
                        "level_embed" in name
                        or "sampling_offsets.bias" in name
                        or "text_param" in name
                        or "vision_param" in name
                        or "value_proj" in name
                        or "output_proj" in name
                        or "reference_points" in name
                        or "vision_proj" in name
                        or "text_proj" in name
                        or ("class_embed" in name and "bias" in name)
                    ):
                        continue
                    self.assertIn(
                        ((param.data.mean() * 1e9).round() / 1e9).item(),
                        [0.0, 1.0],
                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
                    )
    # Copied from tests.models.deformable_detr.test_modeling_deformable_detr.DeformableDetrModelTest.test_two_stage_training with DeformableDetr->MMGroundingDino
    def test_two_stage_training(self):
        model_class = MMGroundingDinoForObjectDetection
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        config.two_stage = True
        config.auxiliary_loss = True
        config.with_box_refine = True
        model = model_class(config)
        model.to(torch_device)
        model.train()
        inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
        loss = model(**inputs).loss
        loss.backward()
    def test_tied_weights_keys(self):
        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
        config.tie_word_embeddings = True
        for model_class in self.all_model_classes:
            model_tied = model_class(config)
            ptrs = collections.defaultdict(list)
            for name, tensor in model_tied.state_dict().items():
                ptrs[id_tensor_storage(tensor)].append(name)
            # These are all the pointers of shared tensors.
            tied_params = [names for _, names in ptrs.items() if len(names) > 1]
            tied_weight_keys = model_tied._tied_weights_keys if model_tied._tied_weights_keys is not None else []
            # Detect we get a hit for each key
            for key in tied_weight_keys:
                if not any(re.search(key, p) for group in tied_params for p in group):
                    raise ValueError(f"{key} is not a tied weight key for {model_class}.")
            # Removed tied weights found from tied params -> there should only be one left after
            for key in tied_weight_keys:
                for i in range(len(tied_params)):
                    tied_params[i] = [p for p in tied_params[i] if re.search(key, p) is None]
            # MMGroundingDino when sharing weights also uses the shared ones in MMGroundingDinoDecoder
            # Therefore, differently from DeformableDetr, we expect the group lens to be 2
            # one for self.bbox_embed in MMGroundingDinoForObejectDetection and another one
            # in the decoder
            tied_params = [group for group in tied_params if len(group) > 2]
            self.assertListEqual(
                tied_params,
                [],
                f"Missing `_tied_weights_keys` for {model_class}: add all of {tied_params} except one.",
            )
 # We will verify our results on an image of cute cats
 def prepare_img():
    image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
    return image
 def prepare_text():
    text = "a cat."
    return text
@require_timm
@require_vision
@slow
 class MMGroundingDinoModelIntegrationTests(unittest.TestCase):
    @cached_property
    def default_processor(self):
        return (
            AutoProcessor.from_pretrained("openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det")
            if is_vision_available()
            else None
        )
    def test_inference_object_detection_head(self):
        model = MMGroundingDinoForObjectDetection.from_pretrained(
            "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
        ).to(torch_device)
        processor = self.default_processor
        image = prepare_img()
        text = prepare_text()
        encoding = processor(images=image, text=text, return_tensors="pt").to(torch_device)
        with torch.no_grad():
            outputs = model(**encoding)
        expected_shape_logits = torch.Size((1, model.config.num_queries, model.config.d_model))
        self.assertEqual(outputs.logits.shape, expected_shape_logits)
        expected_boxes = torch.tensor(
            [[0.7666, 0.4142, 0.4590], [0.2557, 0.5480, 0.4812], [0.5049, 0.5133, 0.9767]]
        ).to(torch_device)
        expected_logits = torch.tensor(
            [[-5.1160, -0.2143, -0.2089], [-5.0592, -0.4269, -0.4169], [-4.9087, -1.7608, -1.7372]]
        ).to(torch_device)
        torch.testing.assert_close(outputs.logits[0, :3, :3], expected_logits, rtol=1e-3, atol=1e-3)
        expected_shape_boxes = torch.Size((1, model.config.num_queries, 4))
        self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
        torch.testing.assert_close(outputs.pred_boxes[0, :3, :3], expected_boxes, rtol=1e-4, atol=1e-4)
        # verify postprocessing
        results = processor.image_processor.post_process_object_detection(
            outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
        )[0]
        expected_scores = torch.tensor([0.4480, 0.3973]).to(torch_device)
        expected_slice_boxes = torch.tensor([343.7321, 23.8182, 637.5044, 373.8593]).to(torch_device)
        self.assertEqual(len(results["scores"]), 2)
        torch.testing.assert_close(results["scores"], expected_scores, rtol=1e-3, atol=1e-3)
        torch.testing.assert_close(results["boxes"][0, :], expected_slice_boxes, rtol=1e-2, atol=1e-2)
        # verify grounded postprocessing
        expected_labels = ["a cat", "a cat"]
        results = processor.post_process_grounded_object_detection(
            outputs=outputs,
            input_ids=encoding.input_ids,
            threshold=0.35,
            text_threshold=0.3,
            target_sizes=[(image.height, image.width)],
        )[0]
        torch.testing.assert_close(results["scores"], expected_scores, rtol=1e-3, atol=1e-3)
        torch.testing.assert_close(results["boxes"][0, :], expected_slice_boxes, rtol=1e-2, atol=1e-2)
        self.assertListEqual(results["text_labels"], expected_labels)
    @require_torch_accelerator
    @is_flaky()
    def test_inference_object_detection_head_equivalence_cpu_gpu(self):
        processor = self.default_processor
        image = prepare_img()
        text = prepare_text()
        encoding = processor(images=image, text=text, return_tensors="pt")
        # 1. run model on CPU
        model = MMGroundingDinoForObjectDetection.from_pretrained(
            "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
        )
        # HACK: the issue happens during top-k (k=900) after the encoder
        # there are some flips between cpu and gpu query ordering (idxs 195<->196 and 267<->268 on my machine)
        # which causes different query position embedding assingments
        # which in turn significantly changes the decoder pass due to self attention
        model.config.num_queries = 100
        model.model.query_position_embeddings.weight.data = model.model.query_position_embeddings.weight.data[:100]
        with torch.no_grad():
            cpu_outputs = model(**encoding)
        # 2. run model on GPU
        model.to(torch_device)
        encoding = encoding.to(torch_device)
        with torch.no_grad():
            gpu_outputs = model(**encoding)
        # 3. assert equivalence
        for key in cpu_outputs.keys():
            torch.testing.assert_close(cpu_outputs[key], gpu_outputs[key].cpu(), rtol=1e-3, atol=1e-3)
        expected_logits = torch.tensor(
            [[-5.0188, -1.0069, -1.0005], [-5.1177, -1.0537, -1.0444], [-5.3986, -2.4935, -2.4716]]
        )
        torch.testing.assert_close(cpu_outputs.logits[0, :3, :3], expected_logits, rtol=1e-3, atol=1e-3)
        # assert postprocessing
        results_cpu = processor.image_processor.post_process_object_detection(
            cpu_outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
        )[0]
        result_gpu = processor.image_processor.post_process_object_detection(
            gpu_outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
        )[0]
        torch.testing.assert_close(results_cpu["scores"], result_gpu["scores"].cpu(), rtol=1e-3, atol=1e-3)
        torch.testing.assert_close(results_cpu["boxes"], result_gpu["boxes"].cpu(), rtol=1e-3, atol=1e-3)
    @is_flaky()
    def test_cross_attention_mask(self):
        model = MMGroundingDinoForObjectDetection.from_pretrained(
            "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
        ).to(torch_device)
        # HACK: the issue happens during top-k (k=900) after the encoder
        # there are some flips between cpu and gpu query ordering
        # which causes different query position embedding assingments
        # which in turn significantly changes the decoder pass due to self attention
        model.config.num_queries = 100
        model.model.query_position_embeddings.weight.data = model.model.query_position_embeddings.weight.data[:100]
        processor = self.default_processor
        image = prepare_img()
        text1 = "a cat."
        text2 = "a remote control."
        text_batched = [text1, text2]
        encoding1 = processor(images=image, text=text1, return_tensors="pt").to(torch_device)
        encoding2 = processor(images=image, text=text2, return_tensors="pt").to(torch_device)
        # If we batch the text and cross attention masking is working the batched result should be equal to
        # The singe text result
        encoding_batched = processor(
            images=[image] * len(text_batched), text=text_batched, padding="longest", return_tensors="pt"
        ).to(torch_device)
        with torch.no_grad():
            outputs1 = model(**encoding1)
            outputs2 = model(**encoding2)
            outputs_batched = model(**encoding_batched)
        torch.testing.assert_close(outputs1.logits, outputs_batched.logits[:1], rtol=1e-3, atol=1e-3)
        # For some reason 12 elements are > 1e-3, but the rest are fine
        self.assertTrue(torch.allclose(outputs2.logits, outputs_batched.logits[1:], atol=1.8e-3))
    def test_mm_grounding_dino_loss(self):
        ds = load_dataset("EduardoPacheco/aquarium-sample", split="train")
        image_processor = self.default_processor.image_processor
        tokenizer = self.default_processor.tokenizer
        id2label = {0: "fish", 1: "jellyfish", 2: "penguins", 3: "sharks", 4: "puffins", 5: "stingrays", 6: "starfish"}
        prompt = ". ".join(id2label.values()) + "."
        text_inputs = tokenizer([prompt, prompt], return_tensors="pt")
        image_inputs = image_processor(
            images=list(ds["image"]), annotations=list(ds["annotations"]), return_tensors="pt"
        )
        # Passing auxiliary_loss=True to compare with the expected loss
        model = MMGroundingDinoForObjectDetection.from_pretrained(
            "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det",
            auxiliary_loss=True,
        )
        # Interested in the loss only
        model.eval()
        with torch.no_grad():
            outputs = model(**text_inputs, **image_inputs)
        # Loss differs by CPU and GPU, also this can be changed in future.
        expected_loss_dict = {
            "loss_ce": torch.tensor(1.1799),
            "loss_bbox": torch.tensor(0.2348),
            "loss_giou": torch.tensor(0.5834),
            "loss_ce_0": torch.tensor(1.1199),
            "loss_bbox_0": torch.tensor(0.3083),
            "loss_giou_0": torch.tensor(0.6555),
            "loss_ce_1": torch.tensor(1.2075),
            "loss_bbox_1": torch.tensor(0.2641),
            "loss_giou_1": torch.tensor(0.6073),
            "loss_ce_2": torch.tensor(1.2915),
            "loss_bbox_2": torch.tensor(0.2616),
            "loss_giou_2": torch.tensor(0.5730),
            "loss_ce_3": torch.tensor(1.0243),
            "loss_bbox_3": torch.tensor(0.2799),
            "loss_giou_3": torch.tensor(0.6326),
            "loss_ce_4": torch.tensor(1.2019),
            "loss_bbox_4": torch.tensor(0.2430),
            "loss_giou_4": torch.tensor(0.5679),
            "loss_ce_enc": torch.tensor(10.2381),
            "loss_bbox_enc": torch.tensor(0.2886),
            "loss_giou_enc": torch.tensor(0.6335),
        }
        expected_loss = torch.tensor(52.4340)
        for key in expected_loss_dict:
            self.assertTrue(torch.allclose(outputs.loss_dict[key], expected_loss_dict[key], atol=1e-3))
        self.assertTrue(torch.allclose(outputs.loss, expected_loss, atol=1e-3))
--- a/utils/check_config_attributes.py
+++ b/utils/check_config_attributes.py
@@ -221,6 +221,14 @@ SPECIAL_CASES_TO_ALLOW = {
        "giou_cost",
        "giou_loss_coefficient",
    ],
    "MMGroundingDinoConfig": [
        "bbox_cost",
        "bbox_loss_coefficient",
        "class_cost",
        "focal_alpha",
        "giou_cost",
        "giou_loss_coefficient",
    ],
    "RTDetrConfig": [
        "eos_coefficient",
        "focal_loss_alpha",