Grounding DINO Processor standardization (#34853)

* Add input ids to model output

* Add text preprocessing for processor

* Fix snippet

* Add test for equivalence

* Add type checking guard

* Fixing typehint

* Fix test for added `input_ids` in output

* Add deprecations and "text_labels" to output

* Adjust tests

* Fix test

* Update code examples

* Minor docs and code improvement

* Remove one-liner functions and rename class to CamelCase

* Update docstring

* Fixup
This commit is contained in:
Pavel Iakubovskii
2025-01-17 14:18:16 +00:00
committed by GitHub
parent 42b2857b01
commit 099d93d2e9
5 changed files with 217 additions and 80 deletions

View File

@@ -56,25 +56,26 @@ Here's how to use the model for zero-shot object detection:
>>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(image_url, stream=True).raw)
>>> # Check for cats and remote controls
>>> text = "a cat. a remote control."
>>> text_labels = [["a cat", "a remote control"]]
>>> inputs = processor(images=image, text=text, return_tensors="pt").to(device)
>>> inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> results = processor.post_process_grounded_object_detection(
... outputs,
... inputs.input_ids,
... box_threshold=0.4,
... threshold=0.4,
... text_threshold=0.3,
... target_sizes=[image.size[::-1]]
... target_sizes=[(image.height, image.width)]
... )
>>> print(results)
[{'boxes': tensor([[344.6959, 23.1090, 637.1833, 374.2751],
[ 12.2666, 51.9145, 316.8582, 472.4392],
[ 38.5742, 70.0015, 176.7838, 118.1806]], device='cuda:0'),
'labels': ['a cat', 'a cat', 'a remote control'],
'scores': tensor([0.4785, 0.4381, 0.4776], device='cuda:0')}]
>>> # Retrieve the first image result
>>> result = results[0]
>>> for box, score, text_label in zip(result["boxes"], result["scores"], result["text_labels"]):
... box = [round(x, 2) for x in box.tolist()]
... print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
Detected a cat with confidence 0.479 at location [344.7, 23.11, 637.18, 374.28]
Detected a cat with confidence 0.438 at location [12.27, 51.91, 316.86, 472.44]
Detected a remote control with confidence 0.478 at location [38.57, 70.0, 176.78, 118.18]
```
## Grounded SAM