Add MM Grounding DINO (#37925)

* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
This commit is contained in:
rziga
2025-08-01 16:43:23 +02:00
committed by GitHub
parent 50145474b7
commit 3951d4ad5d
17 changed files with 4884 additions and 1 deletions

View File

@@ -1051,6 +1051,8 @@
title: Mistral3
- local: model_doc/mllama
title: mllama
- local: model_doc/mm-grounding-dino
title: MM Grounding DINO
- local: model_doc/nougat
title: Nougat
- local: model_doc/omdet-turbo

View File

@@ -0,0 +1,124 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>
</div>
# MM Grounding DINO
[MM Grounding DINO](https://arxiv.org/abs/2401.02361) model was proposed in [An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361) by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
MM Grounding DINO improves upon the [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
You can find all the original MM Grounding DINO checkpoints under the [MM Grounding DINO](https://huggingface.co/collections/openmmlab-community/mm-grounding-dino-688cbde05b814c4e2832f9df) collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/iSEE-Laboratory/llmdet-688475906dc235d5f1dc678e) collection.
> [!TIP]
> Click on the MM Grounding DINO models in the right sidebar for more examples of how to apply MM Grounding DINO to different MM Grounding DINO tasks.
The example below demonstrates how to generate text based on an image with the [`AutoModelForZeroShotObjectDetection`] class.
<hfoptions id="usage">
<hfoption id="AutoModel">
```py
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image
# Prepare processor and model
model_id = "openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
# Postprocess outputs
results = processor.post_process_grounded_object_detection(
outputs,
threshold=0.4,
target_sizes=[(image.height, image.width)]
)
# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
box = [round(x, 2) for x in box.tolist()]
print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
```
</hfoption>
</hfoptions>
## Notes
- Here's a table of models and their object detection performance results on COCO (results from [official repo](https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/README.md)):
| Model | Backbone | Pre-Train Data | Style | COCO mAP |
| ------------------------------------------------------------------------------------------------------------------------------ | -------- | ------------------------ | --------- | ---------- |
| [mm_grounding_dino_tiny_o365v1_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg) | Swin-T | O365,GoldG | Zero-shot | 50.4(+2.3) |
| [mm_grounding_dino_tiny_o365v1_goldg_grit](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit) | Swin-T | O365,GoldG,GRIT | Zero-shot | 50.5(+2.1) |
| [mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det) | Swin-T | O365,GoldG,V3Det | Zero-shot | 50.6(+2.2) |
| [mm_grounding_dino_tiny_o365v1_goldg_grit_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit_v3det) | Swin-T | O365,GoldG,GRIT,V3Det | Zero-shot | 50.4(+2.0) |
| [mm_grounding_dino_base_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_base_o365v1_goldg_v3det) | Swin-B | O365,GoldG,V3Det | Zero-shot | 52.5 |
| [mm_grounding_dino_base_all](https://huggingface.co/openmmlab-community/mm_grounding_dino_base_all) | Swin-B | O365,ALL | - | 59.5 |
| [mm_grounding_dino_large_o365v2_oiv6_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_large_o365v2_oiv6_goldg) | Swin-L | O365V2,OpenImageV6,GoldG | Zero-shot | 53.0 |
| [mm_grounding_dino_large_all](https://huggingface.co/openmmlab-community/mm_grounding_dino_large_all) | Swin-L | O365V2,OpenImageV6,ALL | - | 60.3 |
- Here's a table of MM Grounding DINO tiny models and their object detection performance on LVIS (results from [official repo](https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/README.md)):
| Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP |
| ------------------------------------------------------------------------------------------------------------------------------ | --------------------- | ----------- | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
| [mm_grounding_dino_tiny_o365v1_goldg](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg) | O365,GoldG | 28.1 | 30.2 | 42.0 | 35.7(+6.9) | 17.1 | 22.4 | 36.5 | 27.0(+6.9) |
| [mm_grounding_dino_tiny_o365v1_goldg_grit](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit) | O365,GoldG,GRIT | 26.6 | 32.4 | 41.8 | 36.5(+7.7) | 17.3 | 22.6 | 36.4 | 27.1(+7.0) |
| [mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det) | O365,GoldG,V3Det | 33.0 | 36.0 | 45.9 | 40.5(+11.7) | 21.5 | 25.5 | 40.2 | 30.6(+10.5) |
| [mm_grounding_dino_tiny_o365v1_goldg_grit_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_grit_v3det) | O365,GoldG,GRIT,V3Det | 34.2 | 37.4 | 46.2 | 41.4(+12.6) | 23.6 | 27.6 | 40.5 | 31.9(+11.8) |
- This implementation also supports inference for [LLMDet](https://github.com/iSEE-Laboratory/LLMDet). Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):
| Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP |
| --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
| [llmdet_tiny](https://huggingface.co/iSEE-Laboratory/llmdet_tiny) | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 |
| [llmdet_base](https://huggingface.co/iSEE-Laboratory/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 |
| [llmdet_large](https://huggingface.co/iSEE-Laboratory/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 |
## MMGroundingDinoConfig
[[autodoc]] MMGroundingDinoConfig
## MMGroundingDinoModel
[[autodoc]] MMGroundingDinoModel
- forward
## MMGroundingDinoForObjectDetection
[[autodoc]] MMGroundingDinoForObjectDetection
- forward

View File

@@ -158,6 +158,7 @@ LOSS_MAPPING = {
"ConditionalDetrForObjectDetection": DeformableDetrForObjectDetectionLoss,
"DabDetrForObjectDetection": DeformableDetrForObjectDetectionLoss,
"GroundingDinoForObjectDetection": GroundingDinoForObjectDetectionLoss,
"MMGroundingDinoForObjectDetection": GroundingDinoForObjectDetectionLoss,
"ConditionalDetrForSegmentation": DeformableDetrForSegmentationLoss,
"RTDetrForObjectDetection": RTDetrForObjectDetectionLoss,
"RTDetrV2ForObjectDetection": RTDetrForObjectDetectionLoss,

View File

@@ -244,6 +244,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
("mixtral", "MixtralConfig"),
("mlcd", "MLCDVisionConfig"),
("mllama", "MllamaConfig"),
("mm-grounding-dino", "MMGroundingDinoConfig"),
("mobilebert", "MobileBertConfig"),
("mobilenet_v1", "MobileNetV1Config"),
("mobilenet_v2", "MobileNetV2Config"),
@@ -657,6 +658,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
("mlcd", "MLCD"),
("mllama", "Mllama"),
("mluke", "mLUKE"),
("mm-grounding-dino", "MM Grounding DINO"),
("mms", "MMS"),
("mobilebert", "MobileBERT"),
("mobilenet_v1", "MobileNetV1"),

View File

@@ -130,6 +130,7 @@ else:
("mistral3", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
("mlcd", ("CLIPImageProcessor", "CLIPImageProcessorFast")),
("mllama", ("MllamaImageProcessor",)),
("mm-grounding-dino", ("GroundingDinoImageProcessor", "GroundingDinoImageProcessorFast")),
("mobilenet_v1", ("MobileNetV1ImageProcessor", "MobileNetV1ImageProcessorFast")),
("mobilenet_v2", ("MobileNetV2ImageProcessor", "MobileNetV2ImageProcessorFast")),
("mobilevit", ("MobileViTImageProcessor", "MobileViTImageProcessorFast")),

View File

@@ -233,6 +233,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("mixtral", "MixtralModel"),
("mlcd", "MLCDVisionModel"),
("mllama", "MllamaModel"),
("mm-grounding-dino", "MMGroundingDinoModel"),
("mobilebert", "MobileBertModel"),
("mobilenet_v1", "MobileNetV1Model"),
("mobilenet_v2", "MobileNetV2Model"),
@@ -1057,6 +1058,7 @@ MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict(
[
# Model for Zero Shot Object Detection mapping
("grounding-dino", "GroundingDinoForObjectDetection"),
("mm-grounding-dino", "MMGroundingDinoForObjectDetection"),
("omdet-turbo", "OmDetTurboForObjectDetection"),
("owlv2", "Owlv2ForObjectDetection"),
("owlvit", "OwlViTForObjectDetection"),

View File

@@ -100,6 +100,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("mgp-str", "MgpstrProcessor"),
("mistral3", "PixtralProcessor"),
("mllama", "MllamaProcessor"),
("mm-grounding-dino", "GroundingDinoProcessor"),
("moonshine", "Wav2Vec2Processor"),
("oneformer", "OneFormerProcessor"),
("owlv2", "Owlv2Processor"),

View File

@@ -430,6 +430,7 @@ TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
),
("mllama", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mm-grounding-dino", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),

View File

@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_mm_grounding_dino import *
from .modeling_mm_grounding_dino import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,292 @@
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# This file was automatically generated from src/transformers/models/mm_grounding_dino/modular_mm_grounding_dino.py.
# Do NOT edit this file manually as any edits will be overwritten by the generation of
# the file from the modular. If any change should be done, please apply the change to the
# modular_mm_grounding_dino.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ...utils.backbone_utils import verify_backbone_config_arguments
from ..auto import CONFIG_MAPPING
logger = logging.get_logger(__name__)
class MMGroundingDinoConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MMGroundingDinoModel`]. It is used to instantiate a
MM Grounding DINO model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the MM Grounding DINO tiny architecture
[openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
The configuration of the backbone model.
backbone (`str`, *optional*):
Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
Whether to use pretrained weights for the backbone.
use_timm_backbone (`bool`, *optional*, defaults to `False`):
Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
library.
backbone_kwargs (`dict`, *optional*):
Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `BertConfig`):
The config object or dictionary of the text backbone.
num_queries (`int`, *optional*, defaults to 900):
Number of object queries, i.e. detection slots. This is the maximal number of objects
[`MMGroundingDinoModel`] can detect in a single image.
encoder_layers (`int`, *optional*, defaults to 6):
Number of encoder layers.
encoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
encoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_layers (`int`, *optional*, defaults to 6):
Number of decoder layers.
decoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
decoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer decoder.
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
d_model (`int`, *optional*, defaults to 256):
Dimension of the layers.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
auxiliary_loss (`bool`, *optional*, defaults to `False`):
Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
position_embedding_type (`str`, *optional*, defaults to `"sine"`):
Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
num_feature_levels (`int`, *optional*, defaults to 4):
The number of input feature levels.
encoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the encoder.
decoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the decoder.
two_stage (`bool`, *optional*, defaults to `True`):
Whether to apply a two-stage deformable DETR, where the region proposals are also generated by a variant of
Grounding DINO, which are further fed into the decoder for iterative bounding box refinement.
class_cost (`float`, *optional*, defaults to 1.0):
Relative weight of the classification error in the Hungarian matching cost.
bbox_cost (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
giou_cost (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
bbox_loss_coefficient (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 bounding box loss in the object detection loss.
giou_loss_coefficient (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss in the object detection loss.
focal_alpha (`float`, *optional*, defaults to 0.25):
Alpha parameter in the focal loss.
disable_custom_kernels (`bool`, *optional*, defaults to `False`):
Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
kernels are not supported by PyTorch ONNX export.
max_text_len (`int`, *optional*, defaults to 256):
The maximum length of the text input.
text_enhancer_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the text enhancer.
fusion_droppath (`float`, *optional*, defaults to 0.1):
The droppath ratio for the fusion module.
fusion_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the fusion module.
embedding_init_target (`bool`, *optional*, defaults to `True`):
Whether to initialize the target with Embedding weights.
query_dim (`int`, *optional*, defaults to 4):
The dimension of the query vector.
positional_embedding_temperature (`float`, *optional*, defaults to 20):
The temperature for Sine Positional Embedding that is used together with vision backbone.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the layer normalization layers.
Examples:
```python
>>> from transformers import MMGroundingDinoConfig, MMGroundingDinoModel
>>> # Initializing a MM Grounding DINO configuration
>>> configuration = MMGroundingDinoConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = MMGroundingDinoModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "mm-grounding-dino"
attribute_map = {
"hidden_size": "d_model",
"num_attention_heads": "encoder_attention_heads",
}
def __init__(
self,
backbone_config=None,
backbone=None,
use_pretrained_backbone=False,
use_timm_backbone=False,
backbone_kwargs=None,
text_config=None,
num_queries=900,
encoder_layers=6,
encoder_ffn_dim=2048,
encoder_attention_heads=8,
decoder_layers=6,
decoder_ffn_dim=2048,
decoder_attention_heads=8,
is_encoder_decoder=True,
activation_function="relu",
d_model=256,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
auxiliary_loss=False,
position_embedding_type="sine",
num_feature_levels=4,
encoder_n_points=4,
decoder_n_points=4,
two_stage=True,
class_cost=1.0,
bbox_cost=5.0,
giou_cost=2.0,
bbox_loss_coefficient=5.0,
giou_loss_coefficient=2.0,
focal_alpha=0.25,
disable_custom_kernels=False,
# other parameters
max_text_len=256,
text_enhancer_dropout=0.0,
fusion_droppath=0.1,
fusion_dropout=0.0,
embedding_init_target=True,
query_dim=4,
positional_embedding_temperature=20,
init_std=0.02,
layer_norm_eps=1e-5,
**kwargs,
):
super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
if backbone_config is None and backbone is None:
logger.info("`backbone_config` is `None`. Initializing the config with the default `Swin` backbone.")
backbone_config = CONFIG_MAPPING["swin"](
window_size=7,
image_size=224,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
out_indices=[2, 3, 4],
)
elif isinstance(backbone_config, dict):
backbone_model_type = backbone_config.pop("model_type")
config_class = CONFIG_MAPPING[backbone_model_type]
backbone_config = config_class.from_dict(backbone_config)
verify_backbone_config_arguments(
use_timm_backbone=use_timm_backbone,
use_pretrained_backbone=use_pretrained_backbone,
backbone=backbone,
backbone_config=backbone_config,
backbone_kwargs=backbone_kwargs,
)
if text_config is None:
text_config = {}
logger.info("text_config is None. Initializing the text config with default values (`BertConfig`).")
self.backbone_config = backbone_config
self.backbone = backbone
self.use_pretrained_backbone = use_pretrained_backbone
self.use_timm_backbone = use_timm_backbone
self.backbone_kwargs = backbone_kwargs
self.num_queries = num_queries
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.auxiliary_loss = auxiliary_loss
self.position_embedding_type = position_embedding_type
# deformable attributes
self.num_feature_levels = num_feature_levels
self.encoder_n_points = encoder_n_points
self.decoder_n_points = decoder_n_points
self.two_stage = two_stage
# Hungarian matcher
self.class_cost = class_cost
self.bbox_cost = bbox_cost
self.giou_cost = giou_cost
# Loss coefficients
self.bbox_loss_coefficient = bbox_loss_coefficient
self.giou_loss_coefficient = giou_loss_coefficient
self.focal_alpha = focal_alpha
self.disable_custom_kernels = disable_custom_kernels
# Text backbone
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "bert"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["bert"]()
self.text_config = text_config
self.max_text_len = max_text_len
# Text Enhancer
self.text_enhancer_dropout = text_enhancer_dropout
# Fusion
self.fusion_droppath = fusion_droppath
self.fusion_dropout = fusion_dropout
# Others
self.embedding_init_target = embedding_init_target
self.query_dim = query_dim
self.positional_embedding_temperature = positional_embedding_temperature
self.init_std = init_std
self.layer_norm_eps = layer_norm_eps
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model
__all__ = ["MMGroundingDinoConfig"]

View File

@@ -0,0 +1,504 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import re
import requests
import torch
from PIL import Image
from transformers.models.bert.tokenization_bert import BertTokenizer
from transformers.models.grounding_dino.image_processing_grounding_dino import GroundingDinoImageProcessor
from transformers.models.grounding_dino.processing_grounding_dino import GroundingDinoProcessor
from transformers.models.mm_grounding_dino.configuration_mm_grounding_dino import MMGroundingDinoConfig
from transformers.models.mm_grounding_dino.modeling_mm_grounding_dino import MMGroundingDinoForObjectDetection
from transformers.models.swin.configuration_swin import SwinConfig
MODEL_NAME_TO_CHECKPOINT_URL_MAPPING = {
"mm_grounding_dino_tiny_o365v1_goldg": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg/grounding_dino_swin-t_pretrain_obj365_goldg_20231122_132602-4ea751ce.pth",
"mm_grounding_dino_tiny_o365v1_goldg_grit": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_20231128_200818-169cc352.pth",
"mm_grounding_dino_tiny_o365v1_goldg_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth",
"mm_grounding_dino_tiny_o365v1_goldg_grit_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth",
"mm_grounding_dino_base_o365v1_goldg_v3det": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth",
"mm_grounding_dino_base_all": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_all/grounding_dino_swin-b_pretrain_all-f9818a7c.pth",
"mm_grounding_dino_large_o365v2_oiv6_goldg": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth",
"mm_grounding_dino_large_all": "https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_all/grounding_dino_swin-l_pretrain_all-56d69e78.pth",
"llmdet_tiny": "https://huggingface.co/fushh7/LLMDet/resolve/main/tiny.pth?download=true",
"llmdet_base": "https://huggingface.co/fushh7/LLMDet/resolve/main/base.pth?download=true",
"llmdet_large": "https://huggingface.co/fushh7/LLMDet/resolve/main/large.pth?download=true",
}
MODEL_NAME_TO_EXPECTED_OUTPUT_MAPPING = {
"mm_grounding_dino_tiny_o365v1_goldg": {
"scores": torch.tensor([0.7722, 0.7584, 0.7984, 0.7163]),
"boxes": torch.tensor(
[
[0.5212, 0.1594, 0.5792, 0.3895],
[0.5424, 0.0513, 0.9996, 0.7757],
[0.0629, 0.1526, 0.2746, 0.2447],
[0.0091, 0.1127, 0.4945, 0.9911],
]
),
},
"mm_grounding_dino_tiny_o365v1_goldg_grit": {
"scores": torch.tensor([0.7865, 0.7180, 0.7665, 0.8177]),
"boxes": torch.tensor(
[
[0.0084, 0.1129, 0.4940, 0.9895],
[0.5214, 0.1597, 0.5786, 0.3875],
[0.5413, 0.0507, 0.9998, 0.7768],
[0.0631, 0.1527, 0.2740, 0.2449],
]
),
},
"mm_grounding_dino_tiny_o365v1_goldg_v3det": {
"scores": torch.tensor([0.5690, 0.5553, 0.6075, 0.5775]),
"boxes": torch.tensor(
[
[0.5393, 0.0502, 0.9989, 0.7763],
[0.0090, 0.1125, 0.4950, 0.9895],
[0.5207, 0.1589, 0.5794, 0.3889],
[0.0625, 0.1519, 0.2750, 0.2446],
]
),
},
"mm_grounding_dino_tiny_o365v1_goldg_grit_v3det": {
"scores": torch.tensor([0.8381, 0.8204, 0.7970, 0.7175]),
"boxes": torch.tensor(
[
[0.0099, 0.1129, 0.4942, 0.9903],
[0.5413, 0.0506, 0.9998, 0.7753],
[0.0626, 0.1527, 0.2744, 0.2443],
[0.5211, 0.1596, 0.5790, 0.3890],
]
),
},
"mm_grounding_dino_base_o365v1_goldg_v3det": {
"scores": torch.tensor([0.8418, 0.8364, 0.8342, 0.7885]),
"boxes": torch.tensor(
[
[0.5427, 0.0502, 0.9996, 0.7770],
[0.0628, 0.1529, 0.2747, 0.2448],
[0.0085, 0.1132, 0.4947, 0.9898],
[0.5208, 0.1597, 0.5787, 0.3910],
]
),
},
"mm_grounding_dino_base_all": {
"scores": torch.tensor([0.4713]),
"boxes": torch.tensor([[0.5423, 0.0507, 0.9998, 0.7761]]),
},
"mm_grounding_dino_large_o365v2_oiv6_goldg": {
"scores": torch.tensor([0.7824, 0.8275, 0.7715, 0.8211]),
"boxes": torch.tensor(
[
[0.0082, 0.1133, 0.4945, 0.9889],
[0.5410, 0.0508, 0.9998, 0.7771],
[0.0632, 0.1526, 0.2740, 0.2439],
[0.5205, 0.1599, 0.5787, 0.3906],
]
),
},
"mm_grounding_dino_large_all": {
"scores": torch.tensor([0.7373, 0.6208, 0.6913, 0.4523]),
"boxes": torch.tensor(
[
[0.5424, 0.0509, 0.9997, 0.7765],
[0.0632, 0.1529, 0.2744, 0.2447],
[0.0121, 0.1125, 0.4947, 0.9884],
[0.5206, 0.1597, 0.5789, 0.3933],
]
),
},
"llmdet_tiny": {
"scores": torch.tensor([0.7262, 0.7552, 0.7656, 0.8207]),
"boxes": torch.tensor(
[
[0.0114, 0.1132, 0.4947, 0.9854],
[0.5387, 0.0513, 0.9992, 0.7765],
[0.5212, 0.1605, 0.5788, 0.3890],
[0.0634, 0.1536, 0.2743, 0.2440],
]
),
},
"llmdet_base": {
"scores": torch.tensor([0.8646, 0.7567, 0.6978, 0.8084]),
"boxes": torch.tensor(
[
[0.0632, 0.1529, 0.2745, 0.2438],
[0.5420, 0.0512, 0.9989, 0.7774],
[0.0110, 0.1134, 0.4950, 0.9875],
[0.5209, 0.1602, 0.5789, 0.3908],
]
),
},
"llmdet_large": {
"scores": torch.tensor([0.7107, 0.8626, 0.7458, 0.8166]),
"boxes": torch.tensor(
[
[0.0147, 0.1128, 0.4957, 0.9858],
[0.0634, 0.1528, 0.2744, 0.2447],
[0.5414, 0.0511, 0.9997, 0.7776],
[0.5209, 0.1602, 0.5792, 0.3916],
]
),
},
}
# fmt: off
ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
# vision backbone
r"backbone.patch_embed.projection.(weight|bias)": r"model.backbone.conv_encoder.model.embeddings.patch_embeddings.projection.\1",
r"backbone.patch_embed.norm.(weight|bias)": r"model.backbone.conv_encoder.model.embeddings.norm.\1",
r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.(relative_position_bias_table|relative_position_index)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.self.\3",
r"backbone.stages.(\d+).blocks.(\d+).norm1.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.layernorm_before.\3",
r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.(query|key|value).(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.self.\3.\4",
r"backbone.stages.(\d+).blocks.(\d+).attn.w_msa.proj.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.attention.output.dense.\3",
r"backbone.stages.(\d+).blocks.(\d+).norm2.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.layernorm_after.\3",
r"backbone.stages.(\d+).blocks.(\d+).ffn.layers.0.0.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.intermediate.dense.\3",
r"backbone.stages.(\d+).blocks.(\d+).ffn.layers.1.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.blocks.\2.output.dense.\3",
r"backbone.stages.(\d+).downsample.reduction.weight": r"model.backbone.conv_encoder.model.encoder.layers.\1.downsample.reduction.weight",
r"backbone.stages.(\d+).downsample.norm.(weight|bias)": r"model.backbone.conv_encoder.model.encoder.layers.\1.downsample.norm.\2",
r"backbone.norms.(\d+).(weight|bias)": r"model.backbone.conv_encoder.model.hidden_states_norms.stage\1.\2",
r"neck.convs.(\d+).conv.(weight|bias)": r"model.input_proj_vision.\1.0.\2",
r"neck.convs.(\d+).gn.(weight|bias)": r"model.input_proj_vision.\1.1.\2",
r"neck.extra_convs.(\d+).conv.(weight|bias)": r"model.input_proj_vision.\1.0.\2",
r"neck.extra_convs.(\d+).gn.(weight|bias)": r"model.input_proj_vision.\1.1.\2",
# text backbone
r"language_model.language_backbone.body.model.(.*)": r"model.text_backbone.\1",
r"text_feat_map.(weight|bias)": r"model.text_projection.\1",
# encoder
r"encoder.fusion_layers.(\d+).gamma_v": r"model.encoder.layers.\1.fusion_layer.vision_param",
r"encoder.fusion_layers.(\d+).gamma_l": r"model.encoder.layers.\1.fusion_layer.text_param",
r"encoder.fusion_layers.(\d+).layer_norm_v.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.layer_norm_vision.\2",
r"encoder.fusion_layers.(\d+).attn.v_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.vision_proj.\2",
r"encoder.fusion_layers.(\d+).attn.values_v_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.values_vision_proj.\2",
r"encoder.fusion_layers.(\d+).attn.out_v_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.out_vision_proj.\2",
r"encoder.fusion_layers.(\d+).layer_norm_l.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.layer_norm_text.\2",
r"encoder.fusion_layers.(\d+).attn.l_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.text_proj.\2",
r"encoder.fusion_layers.(\d+).attn.values_l_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.values_text_proj.\2",
r"encoder.fusion_layers.(\d+).attn.out_l_proj.(weight|bias)": r"model.encoder.layers.\1.fusion_layer.attn.out_text_proj.\2",
r"encoder.layers.(\d+).self_attn.(sampling_offsets|attention_weights|value_proj|output_proj).(weight|bias)": r"model.encoder.layers.\1.deformable_layer.self_attn.\2.\3",
r"encoder.layers.(\d+).norms.0.(weight|bias)": r"model.encoder.layers.\1.deformable_layer.self_attn_layer_norm.\2",
r"encoder.layers.(\d+).ffn.layers.0.0.(weight|bias)": r"model.encoder.layers.\1.deformable_layer.fc1.\2",
r"encoder.layers.(\d+).ffn.layers.1.(weight|bias)": r"model.encoder.layers.\1.deformable_layer.fc2.\2",
r"encoder.layers.(\d+).norms.1.(weight|bias)": r"model.encoder.layers.\1.deformable_layer.final_layer_norm.\2",
r"encoder.text_layers.(\d+).self_attn.attn.(query|key|value)_proj_(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.self_attn.\2.\3",
r"encoder.text_layers.(\d+).self_attn.attn.out_proj.(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.self_attn.out_proj.\2",
r"encoder.text_layers.(\d+).norms.0.(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.layer_norm_before.\2",
r"encoder.text_layers.(\d+).ffn.layers.0.0.(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.fc1.\2",
r"encoder.text_layers.(\d+).ffn.layers.1.(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.fc2.\2",
r"encoder.text_layers.(\d+).norms.1.(weight|bias)": r"model.encoder.layers.\1.text_enhancer_layer.layer_norm_after.\2",
r"encoder.bbox_head.cls_branch.bias": r"model.encoder_output_class_embed.bias",
r"encoder.bbox_head.reg_branch.0.(weight|bias)": r"model.encoder_output_bbox_embed.layers.0.\1",
r"encoder.bbox_head.reg_branch.2.(weight|bias)": r"model.encoder_output_bbox_embed.layers.1.\1",
r"encoder.bbox_head.reg_branch.4.(weight|bias)": r"model.encoder_output_bbox_embed.layers.2.\1",
# decoder
r"decoder.norm.(weight|bias)": r"model.decoder.layer_norm.\1",
r"decoder.ref_point_head.layers.(\d+).(weight|bias)": r"model.decoder.reference_points_head.layers.\1.\2",
r"decoder.layers.(\d+).self_attn.attn.(query|key|value)_proj_(weight|bias)": r"model.decoder.layers.\1.self_attn.\2.\3",
r"decoder.layers.(\d+).self_attn.attn.out_proj.(weight|bias)": r"model.decoder.layers.\1.self_attn.out_proj.\2",
r"decoder.layers.(\d+).norms.0.(weight|bias)": r"model.decoder.layers.\1.self_attn_layer_norm.\2",
r"decoder.layers.(\d+).cross_attn_text.attn.(query|key|value)_proj_(weight|bias)": r"model.decoder.layers.\1.encoder_attn_text.\2.\3",
r"decoder.layers.(\d+).cross_attn_text.attn.out_proj.(weight|bias)": r"model.decoder.layers.\1.encoder_attn_text.out_proj.\2",
r"decoder.layers.(\d+).norms.1.(weight|bias)": r"model.decoder.layers.\1.encoder_attn_text_layer_norm.\2",
r"decoder.layers.(\d+).cross_attn.(sampling_offsets|attention_weights|value_proj|output_proj).(weight|bias)": r"model.decoder.layers.\1.encoder_attn.\2.\3",
r"decoder.layers.(\d+).norms.2.(weight|bias)": r"model.decoder.layers.\1.encoder_attn_layer_norm.\2",
r"decoder.layers.(\d+).ffn.layers.0.0.(weight|bias)": r"model.decoder.layers.\1.fc1.\2",
r"decoder.layers.(\d+).ffn.layers.1.(weight|bias)": r"model.decoder.layers.\1.fc2.\2",
r"decoder.layers.(\d+).norms.3.(weight|bias)": r"model.decoder.layers.\1.final_layer_norm.\2",
r"decoder.bbox_head.cls_branches.(\d+).bias": r"model.decoder.class_embed.\1.bias",
r"decoder.bbox_head.reg_branches.(\d+).0.(weight|bias)": r"model.decoder.bbox_embed.\1.layers.0.\2",
r"decoder.bbox_head.reg_branches.(\d+).2.(weight|bias)": r"model.decoder.bbox_embed.\1.layers.1.\2",
r"decoder.bbox_head.reg_branches.(\d+).4.(weight|bias)": r"model.decoder.bbox_embed.\1.layers.2.\2",
# other
r"level_embed": r"model.level_embed",
r"query_embedding.weight": r"model.query_position_embeddings.weight",
r"memory_trans_fc.(weight|bias)": r"model.enc_output.\1",
r"memory_trans_norm.(weight|bias)": r"model.enc_output_norm.\1",
r"bbox_head.cls_branches.(\d+).bias": r"class_embed.\1.bias",
r"bbox_head.reg_branches.(\d+).0.(weight|bias)": r"bbox_embed.\1.layers.0.\2",
r"bbox_head.reg_branches.(\d+).2.(weight|bias)": r"bbox_embed.\1.layers.1.\2",
r"bbox_head.reg_branches.(\d+).4.(weight|bias)": r"bbox_embed.\1.layers.2.\2",
}
# fmt: on
def get_mm_grounding_dino_config(model_name: str) -> MMGroundingDinoConfig:
if "tiny" in model_name:
swin_image_size = 224
swin_window_size = 7
swin_embed_dim = 96
swin_depths = (2, 2, 6, 2)
swin_num_heads = (3, 6, 12, 24)
swin_out_features = ["stage2", "stage3", "stage4"]
num_feature_levels = 4
elif "base" in model_name:
swin_image_size = 384
swin_window_size = 12
swin_embed_dim = 128
swin_depths = (2, 2, 18, 2)
swin_num_heads = (4, 8, 16, 32)
swin_out_features = ["stage2", "stage3", "stage4"]
num_feature_levels = 4
elif "large" in model_name:
swin_image_size = 384
swin_window_size = 12
swin_embed_dim = 192
swin_depths = (2, 2, 18, 2)
swin_num_heads = (6, 12, 24, 48)
swin_out_features = ["stage1", "stage2", "stage3", "stage4"]
num_feature_levels = 5
else:
raise ValueError(
f"Model name: {model_name} is not supported. Only `tiny`, `base` and `large` models are currently supported."
)
backbone_config = SwinConfig(
image_size=swin_image_size,
window_size=swin_window_size,
embed_dim=swin_embed_dim,
depths=swin_depths,
num_heads=swin_num_heads,
out_features=swin_out_features,
)
model_config = MMGroundingDinoConfig(
backbone_config=backbone_config,
num_feature_levels=num_feature_levels,
)
return model_config
def get_mm_grounding_dino_processor() -> GroundingDinoProcessor:
img_processor = GroundingDinoImageProcessor()
txt_processor = BertTokenizer.from_pretrained("bert-base-uncased")
processor = GroundingDinoProcessor(img_processor, txt_processor)
return processor
# Copied from: https://github.com/iSEE-Laboratory/LLMDet/blob/96ec8c82a9d97b170db759e043afd5b81445d0f1/hf_model/mmdet2groundingdino_swint.py#L8C1-L13C13
def correct_unfold_reduction_order(x: torch.Tensor) -> torch.Tensor:
out_channel, in_channel = x.shape
x = x.reshape(out_channel, in_channel // 4, 4).transpose(1, 2)
x = x[:, [0, 2, 1, 3], :]
x = x.reshape(out_channel, in_channel)
return x
# Copied from: https://github.com/iSEE-Laboratory/LLMDet/blob/96ec8c82a9d97b170db759e043afd5b81445d0f1/hf_model/mmdet2groundingdino_swint.py#L15C1-L20C13
def correct_unfold_norm_order(x: torch.Tensor) -> torch.Tensor:
in_channel = x.shape[0]
x = x.reshape(in_channel // 4, 4).transpose(0, 1)
x = x[[0, 2, 1, 3], :]
x = x.reshape(in_channel)
return x
def preprocess_old_state(state_dict: dict, config: MMGroundingDinoConfig) -> dict:
"""
Preprocesses old state dict to enable 1-1 mapping:
- split qkv projections in Swin backbone
- reorder reduction and norm parameters in Swin backbone
- shift output norm indices in Swin backbone
- shift output proj indices in neck
- split q,k,v projections in text self and cross attentions in encoder and decoder
- duplicate detection head parameters for decoder and encoder
"""
new_state_dict = state_dict.copy()
for k in state_dict:
if k.startswith("backbone"):
if "downsample.reduction" in k:
new_state_dict[k] = correct_unfold_reduction_order(new_state_dict.pop(k))
elif "downsample.norm" in k:
new_state_dict[k] = correct_unfold_norm_order(new_state_dict.pop(k))
elif "w_msa.qkv" in k:
q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
new_state_dict[k.replace("qkv", "query")] = q_param
new_state_dict[k.replace("qkv", "key")] = k_param
new_state_dict[k.replace("qkv", "value")] = v_param
elif "backbone.norm" in k:
match = re.match(r"backbone.norm(\d+).(weight|bias)", k)
new_state_dict[f"backbone.norms.{int(match.group(1)) + 1}.{match.group(2)}"] = new_state_dict.pop(k)
elif k.startswith("neck.extra_convs"):
num_normal_convs = len(config.backbone_config.out_indices)
if "gn" in k:
match = re.match(r"neck.extra_convs.(\d+).gn.(weight|bias)", k)
new_state_dict[f"neck.extra_convs.{num_normal_convs + int(match.group(1))}.gn.{match.group(2)}"] = (
new_state_dict.pop(k)
)
elif "conv" in k:
match = re.match(r"neck.extra_convs.(\d+).conv.(weight|bias)", k)
new_state_dict[f"neck.extra_convs.{num_normal_convs + int(match.group(1))}.conv.{match.group(2)}"] = (
new_state_dict.pop(k)
)
elif k.startswith("encoder"):
if "self_attn.attn.in_proj" in k:
q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
new_state_dict[k.replace("in", "query")] = q_param
new_state_dict[k.replace("in", "key")] = k_param
new_state_dict[k.replace("in", "value")] = v_param
elif k.startswith("decoder"):
if "self_attn.attn.in_proj" in k or "cross_attn_text.attn.in_proj" in k:
q_param, k_param, v_param = new_state_dict.pop(k).chunk(3)
new_state_dict[k.replace("in", "query")] = q_param
new_state_dict[k.replace("in", "key")] = k_param
new_state_dict[k.replace("in", "value")] = v_param
elif k.startswith("bbox_head"):
num_decoder_layers = config.decoder_layers
match = re.match(r"bbox_head.(cls|reg)_branches.(\d+).(.*)", k)
cls_or_reg = match.group(1)
layer_idx = int(match.group(2))
suffix = match.group(3)
if layer_idx < num_decoder_layers:
new_key = f"decoder.bbox_head.{cls_or_reg}_branches.{layer_idx}.{suffix}"
new_state_dict[new_key] = new_state_dict[k] # copy
else:
new_key = f"encoder.bbox_head.{cls_or_reg}_branch.{suffix}"
new_state_dict[new_key] = new_state_dict.pop(k) # move
# remove unused params
if (
k == "dn_query_generator.label_embedding.weight"
or k == "language_model.language_backbone.body.model.embeddings.position_ids"
or k == "image_seperate.weight"
or k.startswith("lmm")
or k.startswith("connector")
or k.startswith("region_connector")
or k.startswith("ref_point_head")
):
new_state_dict.pop(k)
return new_state_dict
# Copied from transformers/models/siglip2/convert_siglip2_to_hf.py
def convert_old_keys_to_new_keys(state_dict_keys: list) -> dict:
"""
This function should be applied only once, on the concatenated keys to efficiently rename using
the key mappings.
"""
output_dict = {}
if state_dict_keys is not None:
old_text = "\n".join(state_dict_keys)
new_text = old_text
for pattern, replacement in ORIGINAL_TO_CONVERTED_KEY_MAPPING.items():
if replacement is None:
new_text = re.sub(pattern, "", new_text) # an empty line
continue
new_text = re.sub(pattern, replacement, new_text)
output_dict = dict(zip(old_text.split("\n"), new_text.split("\n")))
return output_dict
def convert_mm_to_hf_state(original_state: dict, hf_cfg: MMGroundingDinoConfig) -> dict:
original_state = preprocess_old_state(original_state, hf_cfg)
original_state_keys = list(original_state.keys())
original_to_hf_key_map = convert_old_keys_to_new_keys(original_state_keys)
hf_state = {}
for original_key in original_state_keys:
hf_key = original_to_hf_key_map[original_key]
hf_state[hf_key] = original_state.pop(original_key)
return hf_state
def prepare_test_inputs():
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
text = [["cat", "remote"]]
return image, text
@torch.no_grad()
def convert_mm_grounding_dino_checkpoint(
model_name: str,
verify_outputs: bool,
push_to_hub: bool,
hub_user_name: str,
) -> tuple[MMGroundingDinoConfig, dict]:
# Load original state
checkpoint_url = MODEL_NAME_TO_CHECKPOINT_URL_MAPPING[model_name]
print(f"Loading checkpoint from: {checkpoint_url}")
ckpt = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")
mm_state = ckpt["state_dict"]
# Create hf model and processor
print("Creating model...")
hf_cfg = get_mm_grounding_dino_config(model_name)
hf_state = convert_mm_to_hf_state(mm_state, hf_cfg)
hf_model = MMGroundingDinoForObjectDetection(hf_cfg).eval()
hf_model.load_state_dict(hf_state)
hf_processor = get_mm_grounding_dino_processor()
# Verify outputs if needed
if verify_outputs:
print("Running inference to verify outputs...")
image, text = prepare_test_inputs()
model_inputs = hf_processor(images=image, text=text, return_tensors="pt")
model_outputs = hf_model(**model_inputs)
results = hf_processor.post_process_grounded_object_detection(
model_outputs,
model_inputs.input_ids,
box_threshold=0.4,
text_threshold=0.3,
)
result = results[0]
print(result)
expected = MODEL_NAME_TO_EXPECTED_OUTPUT_MAPPING[model_name]
for key in expected:
torch.testing.assert_close(result[key], expected[key], atol=1e-3, rtol=1e-3)
print("Outputs match.")
# Push to hub if needed
if push_to_hub:
print("Pushing to hub...")
hub_url = f"{hub_user_name}/{model_name}"
hf_model.push_to_hub(hub_url)
hf_processor.push_to_hub(hub_url)
print(f"Pushed to huggingface hub at: {hub_url}.")
return hf_cfg, hf_state
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--model-name",
required=True,
type=str,
choices=list(MODEL_NAME_TO_CHECKPOINT_URL_MAPPING.keys()),
help="URL to the original mm grounding dino checkpoint.",
)
parser.add_argument("--hub-user-name", type=str, help="User name on the huggingface hub.")
parser.add_argument("--push-to-hub", action="store_true", help="Whether to push model to hub or not.")
parser.add_argument(
"--verify-outputs", action="store_true", help="Whether to verify that model output is correct or not."
)
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
convert_mm_grounding_dino_checkpoint(
args.model_name,
args.verify_outputs,
args.push_to_hub,
args.hub_user_name,
)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,434 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import torch
from torch import nn
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ...utils.backbone_utils import verify_backbone_config_arguments
from ..auto import CONFIG_MAPPING
from ..auto.modeling_auto import AutoModel
from ..grounding_dino.configuration_grounding_dino import GroundingDinoConfig
from ..grounding_dino.modeling_grounding_dino import (
GroundingDinoContrastiveEmbedding,
GroundingDinoConvEncoder,
GroundingDinoConvModel,
GroundingDinoDecoder,
GroundingDinoEncoder,
GroundingDinoForObjectDetection,
GroundingDinoMLPPredictionHead,
GroundingDinoModel,
GroundingDinoPreTrainedModel,
build_position_encoding,
)
logger = logging.get_logger(__name__)
class MMGroundingDinoConfig(GroundingDinoConfig, PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MMGroundingDinoModel`]. It is used to instantiate a
MM Grounding DINO model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the MM Grounding DINO tiny architecture
[openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det](https://huggingface.co/openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
The configuration of the backbone model.
backbone (`str`, *optional*):
Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
Whether to use pretrained weights for the backbone.
use_timm_backbone (`bool`, *optional*, defaults to `False`):
Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
library.
backbone_kwargs (`dict`, *optional*):
Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `BertConfig`):
The config object or dictionary of the text backbone.
num_queries (`int`, *optional*, defaults to 900):
Number of object queries, i.e. detection slots. This is the maximal number of objects
[`MMGroundingDinoModel`] can detect in a single image.
encoder_layers (`int`, *optional*, defaults to 6):
Number of encoder layers.
encoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
encoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_layers (`int`, *optional*, defaults to 6):
Number of decoder layers.
decoder_ffn_dim (`int`, *optional*, defaults to 2048):
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
decoder_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer decoder.
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
Whether the model is used as an encoder/decoder or not.
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
d_model (`int`, *optional*, defaults to 256):
Dimension of the layers.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
auxiliary_loss (`bool`, *optional*, defaults to `False`):
Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
position_embedding_type (`str`, *optional*, defaults to `"sine"`):
Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
num_feature_levels (`int`, *optional*, defaults to 4):
The number of input feature levels.
encoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the encoder.
decoder_n_points (`int`, *optional*, defaults to 4):
The number of sampled keys in each feature level for each attention head in the decoder.
two_stage (`bool`, *optional*, defaults to `True`):
Whether to apply a two-stage deformable DETR, where the region proposals are also generated by a variant of
Grounding DINO, which are further fed into the decoder for iterative bounding box refinement.
class_cost (`float`, *optional*, defaults to 1.0):
Relative weight of the classification error in the Hungarian matching cost.
bbox_cost (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
giou_cost (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
bbox_loss_coefficient (`float`, *optional*, defaults to 5.0):
Relative weight of the L1 bounding box loss in the object detection loss.
giou_loss_coefficient (`float`, *optional*, defaults to 2.0):
Relative weight of the generalized IoU loss in the object detection loss.
focal_alpha (`float`, *optional*, defaults to 0.25):
Alpha parameter in the focal loss.
disable_custom_kernels (`bool`, *optional*, defaults to `False`):
Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
kernels are not supported by PyTorch ONNX export.
max_text_len (`int`, *optional*, defaults to 256):
The maximum length of the text input.
text_enhancer_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the text enhancer.
fusion_droppath (`float`, *optional*, defaults to 0.1):
The droppath ratio for the fusion module.
fusion_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the fusion module.
embedding_init_target (`bool`, *optional*, defaults to `True`):
Whether to initialize the target with Embedding weights.
query_dim (`int`, *optional*, defaults to 4):
The dimension of the query vector.
positional_embedding_temperature (`float`, *optional*, defaults to 20):
The temperature for Sine Positional Embedding that is used together with vision backbone.
init_std (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the layer normalization layers.
Examples:
```python
>>> from transformers import MMGroundingDinoConfig, MMGroundingDinoModel
>>> # Initializing a MM Grounding DINO configuration
>>> configuration = MMGroundingDinoConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = MMGroundingDinoModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "mm-grounding-dino"
def __init__(
self,
backbone_config=None,
backbone=None,
use_pretrained_backbone=False,
use_timm_backbone=False,
backbone_kwargs=None,
text_config=None,
num_queries=900,
encoder_layers=6,
encoder_ffn_dim=2048,
encoder_attention_heads=8,
decoder_layers=6,
decoder_ffn_dim=2048,
decoder_attention_heads=8,
is_encoder_decoder=True,
activation_function="relu",
d_model=256,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
auxiliary_loss=False,
position_embedding_type="sine",
num_feature_levels=4,
encoder_n_points=4,
decoder_n_points=4,
two_stage=True,
class_cost=1.0,
bbox_cost=5.0,
giou_cost=2.0,
bbox_loss_coefficient=5.0,
giou_loss_coefficient=2.0,
focal_alpha=0.25,
disable_custom_kernels=False,
# other parameters
max_text_len=256,
text_enhancer_dropout=0.0,
fusion_droppath=0.1,
fusion_dropout=0.0,
embedding_init_target=True,
query_dim=4,
positional_embedding_temperature=20,
init_std=0.02,
layer_norm_eps=1e-5,
**kwargs,
):
PretrainedConfig.__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
if backbone_config is None and backbone is None:
logger.info("`backbone_config` is `None`. Initializing the config with the default `Swin` backbone.")
backbone_config = CONFIG_MAPPING["swin"](
window_size=7,
image_size=224,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
out_indices=[2, 3, 4],
)
elif isinstance(backbone_config, dict):
backbone_model_type = backbone_config.pop("model_type")
config_class = CONFIG_MAPPING[backbone_model_type]
backbone_config = config_class.from_dict(backbone_config)
verify_backbone_config_arguments(
use_timm_backbone=use_timm_backbone,
use_pretrained_backbone=use_pretrained_backbone,
backbone=backbone,
backbone_config=backbone_config,
backbone_kwargs=backbone_kwargs,
)
if text_config is None:
text_config = {}
logger.info("text_config is None. Initializing the text config with default values (`BertConfig`).")
self.backbone_config = backbone_config
self.backbone = backbone
self.use_pretrained_backbone = use_pretrained_backbone
self.use_timm_backbone = use_timm_backbone
self.backbone_kwargs = backbone_kwargs
self.num_queries = num_queries
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.auxiliary_loss = auxiliary_loss
self.position_embedding_type = position_embedding_type
# deformable attributes
self.num_feature_levels = num_feature_levels
self.encoder_n_points = encoder_n_points
self.decoder_n_points = decoder_n_points
self.two_stage = two_stage
# Hungarian matcher
self.class_cost = class_cost
self.bbox_cost = bbox_cost
self.giou_cost = giou_cost
# Loss coefficients
self.bbox_loss_coefficient = bbox_loss_coefficient
self.giou_loss_coefficient = giou_loss_coefficient
self.focal_alpha = focal_alpha
self.disable_custom_kernels = disable_custom_kernels
# Text backbone
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "bert"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["bert"]()
self.text_config = text_config
self.max_text_len = max_text_len
# Text Enhancer
self.text_enhancer_dropout = text_enhancer_dropout
# Fusion
self.fusion_droppath = fusion_droppath
self.fusion_dropout = fusion_dropout
# Others
self.embedding_init_target = embedding_init_target
self.query_dim = query_dim
self.positional_embedding_temperature = positional_embedding_temperature
self.init_std = init_std
self.layer_norm_eps = layer_norm_eps
class MMGroundingDinoContrastiveEmbedding(GroundingDinoContrastiveEmbedding):
def __init__(self, config):
super().__init__(config)
self.bias = nn.Parameter(torch.tensor(0.0))
def forward(
self,
vision_hidden_state: torch.FloatTensor,
text_hidden_state: torch.FloatTensor,
text_token_mask: torch.BoolTensor,
) -> torch.FloatTensor:
res = vision_hidden_state @ text_hidden_state.transpose(-1, -2)
res = res / math.sqrt(vision_hidden_state.shape[-1])
res = res + self.bias
res.masked_fill_(~text_token_mask[:, None, :], float("-inf"))
# padding to max_text_len
new_res = torch.full((*res.shape[:-1], self.max_text_len), float("-inf"), device=res.device)
new_res[..., : res.shape[-1]] = res
return new_res
class MMGroundingDinoPreTrainedModel(GroundingDinoPreTrainedModel):
def _init_weights(self, module):
super()._init_weights(module)
if isinstance(module, MMGroundingDinoContrastiveEmbedding):
nn.init.constant_(module.bias, -math.log((1 - 0.01) / 0.01))
class MMGroundingDinoConvEncoder(GroundingDinoConvEncoder):
pass
class MMGroundingDinoConvModel(GroundingDinoConvModel):
pass
class MMGroundingDinoEncoder(GroundingDinoEncoder):
pass
class MMGroundingDinoDecoder(GroundingDinoDecoder):
pass
class MMGroundingDinoModel(GroundingDinoModel, MMGroundingDinoPreTrainedModel):
def __init__(self, config: MMGroundingDinoConfig):
MMGroundingDinoPreTrainedModel.__init__(config)
# Create backbone + positional encoding
backbone = MMGroundingDinoConvEncoder(config)
position_embeddings = build_position_encoding(config)
self.backbone = MMGroundingDinoConvModel(backbone, position_embeddings)
# Create input projection layers
num_backbone_outs = len(backbone.intermediate_channel_sizes)
input_proj_list = []
for i in range(num_backbone_outs):
in_channels = backbone.intermediate_channel_sizes[i]
input_proj_list.append(
nn.Sequential(
nn.Conv2d(in_channels, config.d_model, kernel_size=1),
nn.GroupNorm(32, config.d_model),
)
)
for _ in range(config.num_feature_levels - num_backbone_outs):
input_proj_list.append(
nn.Sequential(
nn.Conv2d(in_channels, config.d_model, kernel_size=3, stride=2, padding=1),
nn.GroupNorm(32, config.d_model),
)
)
in_channels = config.d_model
self.input_proj_vision = nn.ModuleList(input_proj_list)
# Create text backbone
self.text_backbone = AutoModel.from_config(config.text_config, add_pooling_layer=False)
self.text_projection = nn.Linear(config.text_config.hidden_size, config.d_model)
if config.embedding_init_target or not config.two_stage:
self.query_position_embeddings = nn.Embedding(config.num_queries, config.d_model)
self.encoder = MMGroundingDinoEncoder(config)
self.decoder = MMGroundingDinoDecoder(config)
self.level_embed = nn.Parameter(torch.Tensor(config.num_feature_levels, config.d_model))
self.enc_output = nn.Linear(config.d_model, config.d_model)
self.enc_output_norm = nn.LayerNorm(config.d_model, config.layer_norm_eps)
self.encoder_output_bbox_embed = MMGroundingDinoMLPPredictionHead(
input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
)
self.encoder_output_class_embed = MMGroundingDinoContrastiveEmbedding(config)
self.post_init()
class MMGroundingDinoMLPPredictionHead(GroundingDinoMLPPredictionHead):
pass
class MMGroundingDinoForObjectDetection(GroundingDinoForObjectDetection, MMGroundingDinoPreTrainedModel):
_tied_weights_keys = [
r"bbox_embed\.[1-9]\d*",
r"model\.decoder\.bbox_embed\.[0-9]\d*",
r"class_embed\.[1-9]\d*",
r"model\.decoder\.class_embed\.[0-9]\d*",
]
def __init__(self, config: MMGroundingDinoConfig):
MMGroundingDinoPreTrainedModel.__init__(config)
self.model = MMGroundingDinoModel(config)
self.class_embed = nn.ModuleList(
[MMGroundingDinoContrastiveEmbedding(config) for _ in range(config.decoder_layers)]
)
self.bbox_embed = nn.ModuleList(
[
MMGroundingDinoMLPPredictionHead(
input_dim=config.d_model, hidden_dim=config.d_model, output_dim=4, num_layers=3
)
for _ in range(config.decoder_layers)
]
)
# hack for box-refinement
self.model.decoder.bbox_embed = self.bbox_embed
# hack implementation for two-stage
self.model.decoder.class_embed = self.class_embed
# Initialize weights and apply final processing
self.post_init()
__all__ = [
"MMGroundingDinoConfig",
"MMGroundingDinoForObjectDetection",
"MMGroundingDinoModel",
"MMGroundingDinoPreTrainedModel",
]

View File

@@ -818,7 +818,9 @@ class GroundingDinoModelIntegrationTests(unittest.TestCase):
prompt = ". ".join(id2label.values()) + "."
text_inputs = tokenizer([prompt, prompt], return_tensors="pt")
image_inputs = image_processor(images=ds["image"], annotations=ds["annotations"], return_tensors="pt")
image_inputs = image_processor(
images=list(ds["image"]), annotations=list(ds["annotations"]), return_tensors="pt"
)
# Passing auxiliary_loss=True to compare with the expected loss
model = GroundingDinoForObjectDetection.from_pretrained(

View File

@@ -0,0 +1,871 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch MM Grounding DINO model."""
import collections
import inspect
import math
import re
import unittest
from datasets import load_dataset
from transformers import (
MMGroundingDinoConfig,
SwinConfig,
is_torch_available,
is_vision_available,
)
from transformers.file_utils import cached_property
from transformers.testing_utils import (
is_flaky,
require_timm,
require_torch,
require_torch_accelerator,
require_vision,
slow,
torch_device,
)
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
from ...test_pipeline_mixin import PipelineTesterMixin
if is_torch_available():
import torch
from transformers import MMGroundingDinoConfig, MMGroundingDinoForObjectDetection, MMGroundingDinoModel
from transformers.pytorch_utils import id_tensor_storage
if is_vision_available():
from PIL import Image
from transformers import AutoProcessor
# Copied from tests.models.grounding_dino.test_modeling_grounding_dino.generate_fake_bounding_boxes
def generate_fake_bounding_boxes(n_boxes):
"""Generate bounding boxes in the format (center_x, center_y, width, height)"""
# Validate the input
if not isinstance(n_boxes, int):
raise TypeError("n_boxes must be an integer")
if n_boxes <= 0:
raise ValueError("n_boxes must be a positive integer")
# Generate random bounding boxes in the format (center_x, center_y, width, height)
bounding_boxes = torch.rand((n_boxes, 4))
# Extract the components
center_x = bounding_boxes[:, 0]
center_y = bounding_boxes[:, 1]
width = bounding_boxes[:, 2]
height = bounding_boxes[:, 3]
# Ensure width and height do not exceed bounds
width = torch.min(width, torch.tensor(1.0))
height = torch.min(height, torch.tensor(1.0))
# Ensure the bounding box stays within the normalized space
center_x = torch.where(center_x - width / 2 < 0, width / 2, center_x)
center_x = torch.where(center_x + width / 2 > 1, 1 - width / 2, center_x)
center_y = torch.where(center_y - height / 2 < 0, height / 2, center_y)
center_y = torch.where(center_y + height / 2 > 1, 1 - height / 2, center_y)
# Combine back into bounding boxes
bounding_boxes = torch.stack([center_x, center_y, width, height], dim=1)
return bounding_boxes
# Copied from tests.models.grounding_dino.test_modeling_grounding_dino.GroundingDinoModelTester with GroundingDino->MMGroundingDino
class MMGroundingDinoModelTester:
def __init__(
self,
parent,
batch_size=4,
is_training=True,
use_labels=True,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=4,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
num_queries=2,
num_channels=3,
image_size=98,
n_targets=8,
num_labels=2,
num_feature_levels=4,
encoder_n_points=2,
decoder_n_points=6,
max_text_len=7,
):
self.parent = parent
self.batch_size = batch_size
self.is_training = is_training
self.use_labels = use_labels
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.num_queries = num_queries
self.num_channels = num_channels
self.image_size = image_size
self.n_targets = n_targets
self.num_labels = num_labels
self.num_feature_levels = num_feature_levels
self.encoder_n_points = encoder_n_points
self.decoder_n_points = decoder_n_points
self.max_text_len = max_text_len
# we also set the expected seq length for both encoder and decoder
self.encoder_seq_length_vision = (
math.ceil(self.image_size / 8) ** 2
+ math.ceil(self.image_size / 16) ** 2
+ math.ceil(self.image_size / 32) ** 2
+ math.ceil(self.image_size / 64) ** 2
)
self.encoder_seq_length_text = self.max_text_len
self.decoder_seq_length = self.num_queries
def prepare_config_and_inputs(self):
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
pixel_mask = torch.ones([self.batch_size, self.image_size, self.image_size], device=torch_device)
# When using `MMGroundingDino` the text input template is '{label1}. {label2}. {label3. ... {labelN}.'
# Therefore to avoid errors when running tests with `labels` `input_ids` have to follow this structure.
# Otherwise when running `build_label_maps` it will throw an error when trying to split the input_ids into segments.
input_ids = torch.tensor([101, 3869, 1012, 11420, 3869, 1012, 102], device=torch_device)
input_ids = input_ids.unsqueeze(0).expand(self.batch_size, -1)
labels = None
if self.use_labels:
# labels is a list of Dict (each Dict being the labels for a given example in the batch)
labels = []
for i in range(self.batch_size):
target = {}
target["class_labels"] = torch.randint(
high=self.num_labels, size=(self.n_targets,), device=torch_device
)
target["boxes"] = generate_fake_bounding_boxes(self.n_targets).to(torch_device)
target["masks"] = torch.rand(self.n_targets, self.image_size, self.image_size, device=torch_device)
labels.append(target)
config = self.get_config()
return config, pixel_values, pixel_mask, input_ids, labels
def get_config(self):
swin_config = SwinConfig(
window_size=7,
embed_dim=8,
depths=[1, 1, 1, 1],
num_heads=[1, 1, 1, 1],
image_size=self.image_size,
out_features=["stage2", "stage3", "stage4"],
out_indices=[2, 3, 4],
)
text_backbone = {
"hidden_size": 8,
"num_hidden_layers": 2,
"num_attention_heads": 2,
"intermediate_size": 8,
"max_position_embeddings": 8,
"model_type": "bert",
}
return MMGroundingDinoConfig(
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
num_queries=self.num_queries,
num_labels=self.num_labels,
num_feature_levels=self.num_feature_levels,
encoder_n_points=self.encoder_n_points,
decoder_n_points=self.decoder_n_points,
use_timm_backbone=False,
backbone_config=swin_config,
max_text_len=self.max_text_len,
text_config=text_backbone,
)
def prepare_config_and_inputs_for_common(self):
config, pixel_values, pixel_mask, input_ids, labels = self.prepare_config_and_inputs()
inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask, "input_ids": input_ids}
return config, inputs_dict
def create_and_check_model(self, config, pixel_values, pixel_mask, input_ids, labels):
model = MMGroundingDinoModel(config=config)
model.to(torch_device)
model.eval()
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.num_queries, self.hidden_size))
def create_and_check_object_detection_head_model(self, config, pixel_values, pixel_mask, input_ids, labels):
model = MMGroundingDinoForObjectDetection(config=config)
model.to(torch_device)
model.eval()
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, config.max_text_len))
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, input_ids=input_ids, labels=labels)
self.parent.assertEqual(result.loss.shape, ())
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, config.max_text_len))
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
@require_torch
# Copied from tests.models.grounding_dino.test_modeling_grounding_dino.GroundingDinoModelTest with Grounding->MMGrounding
class MMGroundingDinoModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
all_model_classes = (MMGroundingDinoModel, MMGroundingDinoForObjectDetection) if is_torch_available() else ()
is_encoder_decoder = True
test_torchscript = False
test_pruning = False
test_head_masking = False
test_missing_keys = False
pipeline_model_mapping = (
{
"image-feature-extraction": MMGroundingDinoModel,
"zero-shot-object-detection": MMGroundingDinoForObjectDetection,
}
if is_torch_available()
else {}
)
# special case for head models
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
if return_labels:
if model_class.__name__ == "MMGroundingDinoForObjectDetection":
labels = []
for i in range(self.model_tester.batch_size):
target = {}
target["class_labels"] = torch.ones(
size=(self.model_tester.n_targets,), device=torch_device, dtype=torch.long
)
target["boxes"] = torch.ones(
self.model_tester.n_targets, 4, device=torch_device, dtype=torch.float
)
target["masks"] = torch.ones(
self.model_tester.n_targets,
self.model_tester.image_size,
self.model_tester.image_size,
device=torch_device,
dtype=torch.float,
)
labels.append(target)
inputs_dict["labels"] = labels
return inputs_dict
def setUp(self):
self.model_tester = MMGroundingDinoModelTester(self)
self.config_tester = ConfigTester(
self,
config_class=MMGroundingDinoConfig,
has_text_modality=False,
common_properties=["d_model", "encoder_attention_heads", "decoder_attention_heads"],
)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_object_detection_head_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_object_detection_head_model(*config_and_inputs)
@unittest.skip(reason="MMGrounding DINO does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="MMGrounding DINO does not have a get_input_embeddings method")
def test_model_get_set_embeddings(self):
pass
@unittest.skip(reason="MMGrounding DINO does not use token embeddings")
def test_resize_tokens_embeddings(self):
pass
@unittest.skip(reason="Feed forward chunking is not implemented")
def test_feed_forward_chunking(self):
pass
def test_attention_outputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class._from_config(config, attn_implementation="eager")
config = model.config
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions[-1]
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
attentions = outputs.encoder_attentions[-1]
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(attentions[0].shape[-3:]),
[
self.model_tester.num_attention_heads,
self.model_tester.num_feature_levels,
self.model_tester.encoder_n_points,
],
)
out_len = len(outputs)
correct_outlen = 12
# loss is at first position
if "labels" in inputs_dict:
correct_outlen += 1 # loss is added to beginning
# Object Detection model returns pred_logits and pred_boxes and input_ids
if model_class.__name__ == "MMGroundingDinoForObjectDetection":
correct_outlen += 3
self.assertEqual(out_len, correct_outlen)
# decoder attentions
decoder_attentions = outputs.decoder_attentions[0]
self.assertIsInstance(decoder_attentions, (list, tuple))
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(decoder_attentions[0].shape[-3:]),
[self.model_tester.num_attention_heads, self.model_tester.num_queries, self.model_tester.num_queries],
)
# cross attentions
cross_attentions = outputs.decoder_attentions[-1]
self.assertIsInstance(cross_attentions, (list, tuple))
self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(cross_attentions[0].shape[-3:]),
[
self.model_tester.num_attention_heads,
self.model_tester.num_feature_levels,
self.model_tester.decoder_n_points,
],
)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(out_len + 3, len(outputs))
self_attentions = outputs.encoder_attentions[-1]
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[
self.model_tester.num_attention_heads,
self.model_tester.num_feature_levels,
self.model_tester.encoder_n_points,
],
)
# overwrite since hidden_states are called encoder_text_hidden_states
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.encoder_vision_hidden_states
expected_num_layers = getattr(
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
)
self.assertEqual(len(hidden_states), expected_num_layers)
seq_len = self.model_tester.encoder_seq_length_vision
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_len, self.model_tester.hidden_size],
)
hidden_states = outputs.encoder_text_hidden_states
self.assertEqual(len(hidden_states), expected_num_layers)
seq_len = self.model_tester.encoder_seq_length_text
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_len, self.model_tester.hidden_size],
)
hidden_states = outputs.decoder_hidden_states
self.assertIsInstance(hidden_states, (list, tuple))
self.assertEqual(len(hidden_states), expected_num_layers)
seq_len = getattr(self.model_tester, "seq_length", None)
decoder_seq_length = getattr(self.model_tester, "decoder_seq_length", seq_len)
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[decoder_seq_length, self.model_tester.hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
# removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
def test_retain_grad_hidden_states_attentions(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.output_hidden_states = True
config.output_attentions = True
# no need to test all models as different heads yield the same functionality
model_class = self.all_model_classes[0]
model = model_class(config)
model.to(torch_device)
inputs = self._prepare_for_class(inputs_dict, model_class)
outputs = model(**inputs)
output = outputs[0]
encoder_hidden_states = outputs.encoder_vision_hidden_states[0]
encoder_attentions = outputs.encoder_attentions[0][0]
encoder_hidden_states.retain_grad()
encoder_attentions.retain_grad()
cross_attentions = outputs.decoder_attentions[-1][0]
cross_attentions.retain_grad()
output.flatten()[0].backward(retain_graph=True)
self.assertIsNotNone(encoder_hidden_states.grad)
self.assertIsNotNone(encoder_attentions.grad)
self.assertIsNotNone(cross_attentions.grad)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values", "input_ids"]
self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
def test_different_timm_backbone(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# let's pick a random timm backbone
config.backbone = "tf_mobilenetv3_small_075"
config.use_timm_backbone = True
config.backbone_config = None
config.backbone_kwargs = {"in_chans": 3, "out_indices": (2, 3, 4)}
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if model_class.__name__ == "MMGroundingDinoForObjectDetection":
expected_shape = (
self.model_tester.batch_size,
self.model_tester.num_queries,
config.max_text_len,
)
self.assertEqual(outputs.logits.shape, expected_shape)
self.assertTrue(outputs)
@require_timm
def test_hf_backbone(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# Load a pretrained HF checkpoint as backbone
config.backbone = "microsoft/resnet-18"
config.backbone_config = None
config.use_timm_backbone = False
config.use_pretrained_backbone = True
config.backbone_kwargs = {"out_indices": [2, 3, 4]}
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if model_class.__name__ == "MMGroundingDinoForObjectDetection":
expected_shape = (
self.model_tester.batch_size,
self.model_tester.num_queries,
config.max_text_len,
)
self.assertEqual(outputs.logits.shape, expected_shape)
self.assertTrue(outputs)
# Ignore copy
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
if (
"level_embed" in name
or "sampling_offsets.bias" in name
or "text_param" in name
or "vision_param" in name
or "value_proj" in name
or "output_proj" in name
or "reference_points" in name
or "vision_proj" in name
or "text_proj" in name
or ("class_embed" in name and "bias" in name)
):
continue
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
# Copied from tests.models.deformable_detr.test_modeling_deformable_detr.DeformableDetrModelTest.test_two_stage_training with DeformableDetr->MMGroundingDino
def test_two_stage_training(self):
model_class = MMGroundingDinoForObjectDetection
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
config.two_stage = True
config.auxiliary_loss = True
config.with_box_refine = True
model = model_class(config)
model.to(torch_device)
model.train()
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
loss = model(**inputs).loss
loss.backward()
def test_tied_weights_keys(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
config.tie_word_embeddings = True
for model_class in self.all_model_classes:
model_tied = model_class(config)
ptrs = collections.defaultdict(list)
for name, tensor in model_tied.state_dict().items():
ptrs[id_tensor_storage(tensor)].append(name)
# These are all the pointers of shared tensors.
tied_params = [names for _, names in ptrs.items() if len(names) > 1]
tied_weight_keys = model_tied._tied_weights_keys if model_tied._tied_weights_keys is not None else []
# Detect we get a hit for each key
for key in tied_weight_keys:
if not any(re.search(key, p) for group in tied_params for p in group):
raise ValueError(f"{key} is not a tied weight key for {model_class}.")
# Removed tied weights found from tied params -> there should only be one left after
for key in tied_weight_keys:
for i in range(len(tied_params)):
tied_params[i] = [p for p in tied_params[i] if re.search(key, p) is None]
# MMGroundingDino when sharing weights also uses the shared ones in MMGroundingDinoDecoder
# Therefore, differently from DeformableDetr, we expect the group lens to be 2
# one for self.bbox_embed in MMGroundingDinoForObejectDetection and another one
# in the decoder
tied_params = [group for group in tied_params if len(group) > 2]
self.assertListEqual(
tied_params,
[],
f"Missing `_tied_weights_keys` for {model_class}: add all of {tied_params} except one.",
)
# We will verify our results on an image of cute cats
def prepare_img():
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
return image
def prepare_text():
text = "a cat."
return text
@require_timm
@require_vision
@slow
class MMGroundingDinoModelIntegrationTests(unittest.TestCase):
@cached_property
def default_processor(self):
return (
AutoProcessor.from_pretrained("openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det")
if is_vision_available()
else None
)
def test_inference_object_detection_head(self):
model = MMGroundingDinoForObjectDetection.from_pretrained(
"openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
).to(torch_device)
processor = self.default_processor
image = prepare_img()
text = prepare_text()
encoding = processor(images=image, text=text, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**encoding)
expected_shape_logits = torch.Size((1, model.config.num_queries, model.config.d_model))
self.assertEqual(outputs.logits.shape, expected_shape_logits)
expected_boxes = torch.tensor(
[[0.7666, 0.4142, 0.4590], [0.2557, 0.5480, 0.4812], [0.5049, 0.5133, 0.9767]]
).to(torch_device)
expected_logits = torch.tensor(
[[-5.1160, -0.2143, -0.2089], [-5.0592, -0.4269, -0.4169], [-4.9087, -1.7608, -1.7372]]
).to(torch_device)
torch.testing.assert_close(outputs.logits[0, :3, :3], expected_logits, rtol=1e-3, atol=1e-3)
expected_shape_boxes = torch.Size((1, model.config.num_queries, 4))
self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
torch.testing.assert_close(outputs.pred_boxes[0, :3, :3], expected_boxes, rtol=1e-4, atol=1e-4)
# verify postprocessing
results = processor.image_processor.post_process_object_detection(
outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
)[0]
expected_scores = torch.tensor([0.4480, 0.3973]).to(torch_device)
expected_slice_boxes = torch.tensor([343.7321, 23.8182, 637.5044, 373.8593]).to(torch_device)
self.assertEqual(len(results["scores"]), 2)
torch.testing.assert_close(results["scores"], expected_scores, rtol=1e-3, atol=1e-3)
torch.testing.assert_close(results["boxes"][0, :], expected_slice_boxes, rtol=1e-2, atol=1e-2)
# verify grounded postprocessing
expected_labels = ["a cat", "a cat"]
results = processor.post_process_grounded_object_detection(
outputs=outputs,
input_ids=encoding.input_ids,
threshold=0.35,
text_threshold=0.3,
target_sizes=[(image.height, image.width)],
)[0]
torch.testing.assert_close(results["scores"], expected_scores, rtol=1e-3, atol=1e-3)
torch.testing.assert_close(results["boxes"][0, :], expected_slice_boxes, rtol=1e-2, atol=1e-2)
self.assertListEqual(results["text_labels"], expected_labels)
@require_torch_accelerator
@is_flaky()
def test_inference_object_detection_head_equivalence_cpu_gpu(self):
processor = self.default_processor
image = prepare_img()
text = prepare_text()
encoding = processor(images=image, text=text, return_tensors="pt")
# 1. run model on CPU
model = MMGroundingDinoForObjectDetection.from_pretrained(
"openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
)
# HACK: the issue happens during top-k (k=900) after the encoder
# there are some flips between cpu and gpu query ordering (idxs 195<->196 and 267<->268 on my machine)
# which causes different query position embedding assingments
# which in turn significantly changes the decoder pass due to self attention
model.config.num_queries = 100
model.model.query_position_embeddings.weight.data = model.model.query_position_embeddings.weight.data[:100]
with torch.no_grad():
cpu_outputs = model(**encoding)
# 2. run model on GPU
model.to(torch_device)
encoding = encoding.to(torch_device)
with torch.no_grad():
gpu_outputs = model(**encoding)
# 3. assert equivalence
for key in cpu_outputs.keys():
torch.testing.assert_close(cpu_outputs[key], gpu_outputs[key].cpu(), rtol=1e-3, atol=1e-3)
expected_logits = torch.tensor(
[[-5.0188, -1.0069, -1.0005], [-5.1177, -1.0537, -1.0444], [-5.3986, -2.4935, -2.4716]]
)
torch.testing.assert_close(cpu_outputs.logits[0, :3, :3], expected_logits, rtol=1e-3, atol=1e-3)
# assert postprocessing
results_cpu = processor.image_processor.post_process_object_detection(
cpu_outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
)[0]
result_gpu = processor.image_processor.post_process_object_detection(
gpu_outputs, threshold=0.35, target_sizes=[(image.height, image.width)]
)[0]
torch.testing.assert_close(results_cpu["scores"], result_gpu["scores"].cpu(), rtol=1e-3, atol=1e-3)
torch.testing.assert_close(results_cpu["boxes"], result_gpu["boxes"].cpu(), rtol=1e-3, atol=1e-3)
@is_flaky()
def test_cross_attention_mask(self):
model = MMGroundingDinoForObjectDetection.from_pretrained(
"openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det"
).to(torch_device)
# HACK: the issue happens during top-k (k=900) after the encoder
# there are some flips between cpu and gpu query ordering
# which causes different query position embedding assingments
# which in turn significantly changes the decoder pass due to self attention
model.config.num_queries = 100
model.model.query_position_embeddings.weight.data = model.model.query_position_embeddings.weight.data[:100]
processor = self.default_processor
image = prepare_img()
text1 = "a cat."
text2 = "a remote control."
text_batched = [text1, text2]
encoding1 = processor(images=image, text=text1, return_tensors="pt").to(torch_device)
encoding2 = processor(images=image, text=text2, return_tensors="pt").to(torch_device)
# If we batch the text and cross attention masking is working the batched result should be equal to
# The singe text result
encoding_batched = processor(
images=[image] * len(text_batched), text=text_batched, padding="longest", return_tensors="pt"
).to(torch_device)
with torch.no_grad():
outputs1 = model(**encoding1)
outputs2 = model(**encoding2)
outputs_batched = model(**encoding_batched)
torch.testing.assert_close(outputs1.logits, outputs_batched.logits[:1], rtol=1e-3, atol=1e-3)
# For some reason 12 elements are > 1e-3, but the rest are fine
self.assertTrue(torch.allclose(outputs2.logits, outputs_batched.logits[1:], atol=1.8e-3))
def test_mm_grounding_dino_loss(self):
ds = load_dataset("EduardoPacheco/aquarium-sample", split="train")
image_processor = self.default_processor.image_processor
tokenizer = self.default_processor.tokenizer
id2label = {0: "fish", 1: "jellyfish", 2: "penguins", 3: "sharks", 4: "puffins", 5: "stingrays", 6: "starfish"}
prompt = ". ".join(id2label.values()) + "."
text_inputs = tokenizer([prompt, prompt], return_tensors="pt")
image_inputs = image_processor(
images=list(ds["image"]), annotations=list(ds["annotations"]), return_tensors="pt"
)
# Passing auxiliary_loss=True to compare with the expected loss
model = MMGroundingDinoForObjectDetection.from_pretrained(
"openmmlab-community/mm_grounding_dino_tiny_o365v1_goldg_v3det",
auxiliary_loss=True,
)
# Interested in the loss only
model.eval()
with torch.no_grad():
outputs = model(**text_inputs, **image_inputs)
# Loss differs by CPU and GPU, also this can be changed in future.
expected_loss_dict = {
"loss_ce": torch.tensor(1.1799),
"loss_bbox": torch.tensor(0.2348),
"loss_giou": torch.tensor(0.5834),
"loss_ce_0": torch.tensor(1.1199),
"loss_bbox_0": torch.tensor(0.3083),
"loss_giou_0": torch.tensor(0.6555),
"loss_ce_1": torch.tensor(1.2075),
"loss_bbox_1": torch.tensor(0.2641),
"loss_giou_1": torch.tensor(0.6073),
"loss_ce_2": torch.tensor(1.2915),
"loss_bbox_2": torch.tensor(0.2616),
"loss_giou_2": torch.tensor(0.5730),
"loss_ce_3": torch.tensor(1.0243),
"loss_bbox_3": torch.tensor(0.2799),
"loss_giou_3": torch.tensor(0.6326),
"loss_ce_4": torch.tensor(1.2019),
"loss_bbox_4": torch.tensor(0.2430),
"loss_giou_4": torch.tensor(0.5679),
"loss_ce_enc": torch.tensor(10.2381),
"loss_bbox_enc": torch.tensor(0.2886),
"loss_giou_enc": torch.tensor(0.6335),
}
expected_loss = torch.tensor(52.4340)
for key in expected_loss_dict:
self.assertTrue(torch.allclose(outputs.loss_dict[key], expected_loss_dict[key], atol=1e-3))
self.assertTrue(torch.allclose(outputs.loss, expected_loss, atol=1e-3))

View File

@@ -221,6 +221,14 @@ SPECIAL_CASES_TO_ALLOW = {
"giou_cost",
"giou_loss_coefficient",
],
"MMGroundingDinoConfig": [
"bbox_cost",
"bbox_loss_coefficient",
"class_cost",
"focal_alpha",
"giou_cost",
"giou_loss_coefficient",
],
"RTDetrConfig": [
"eos_coefficient",
"focal_loss_alpha",