Add MM Grounding DINO (#37925)

* first commit

Added modular implementation for MM Grounding DINO from starting point created by add-new-model-like. Added conversion script from mmdetection to huggingface.

TODO: Some tests are failing so that needs to be fixed.

* fixed a bug with modular definition of MMGroundingDinoForObjectDetection where box and class heads were not correctly assigned to inner model

* cleaned up a hack in the conversion script

* Fixed the expected values in integration tests

Cross att masking and cpu-gpu consistency tests are still failing however.

* changes for make style and quality

* add documentation

* clean up contrastive embedding

* add mm grounding dino to loss mapping

* add model link to config docstring

* hack fix for mm grounding dino consistency tests

* add special cases for unused config attr check

* add all models and update docs

* update model doc to the new style

* Use super_kwargs for modular config

* Move init to the _init_weights function

* Add copied from for tests

* fixup

* update typehints

* Fix-copies for tests

* fix-copies

* Fix init test

* fix snippets in docs

* fix consistency

* fix consistency

* update conversion script

* fix nits in readme and remove old comments from conversion script

* add license

* remove unused config args

* remove unnecessary if/else in model init

* fix quality

* Update references

* fix test

* fixup

---------

Co-authored-by: qubvel <qubvel@gmail.com>
This commit is contained in:
rziga
2025-08-01 16:43:23 +02:00
committed by GitHub
parent 50145474b7
commit 3951d4ad5d
17 changed files with 4884 additions and 1 deletions

View File

@@ -818,7 +818,9 @@ class GroundingDinoModelIntegrationTests(unittest.TestCase):
prompt = ". ".join(id2label.values()) + "."
text_inputs = tokenizer([prompt, prompt], return_tensors="pt")
image_inputs = image_processor(images=ds["image"], annotations=ds["annotations"], return_tensors="pt")
image_inputs = image_processor(
images=list(ds["image"]), annotations=list(ds["annotations"]), return_tensors="pt"
)
# Passing auxiliary_loss=True to compare with the expected loss
model = GroundingDinoForObjectDetection.from_pretrained(