Grounding DINO Processor standardization (#34853)

* Add input ids to model output * Add text preprocessing for processor * Fix snippet * Add test for equivalence * Add type checking guard * Fixing typehint * Fix test for added `input_ids` in output * Add deprecations and "text_labels" to output * Adjust tests * Fix test * Update code examples * Minor docs and code improvement * Remove one-liner functions and rename class to CamelCase * Update docstring * Fixup
2025-01-17 14:18:16 +00:00
parent 42b2857b01
commit 099d93d2e9
5 changed files with 217 additions and 80 deletions
--- a/docs/source/en/model_doc/grounding-dino.md
+++ b/docs/source/en/model_doc/grounding-dino.md
@@ -56,25 +56,26 @@ Here's how to use the model for zero-shot object detection:
 >>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image = Image.open(requests.get(image_url, stream=True).raw)
 >>> # Check for cats and remote controls
->>> text = "a cat. a remote control."
+>>> text_labels = [["a cat", "a remote control"]]

->>> inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+>>> inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
 >>> with torch.no_grad():
 ...     outputs = model(**inputs)

 >>> results = processor.post_process_grounded_object_detection(
 ...     outputs,
-...     inputs.input_ids,
-...     box_threshold=0.4,
+...     threshold=0.4,
 ...     text_threshold=0.3,
-...     target_sizes=[image.size[::-1]]
+...     target_sizes=[(image.height, image.width)]
 ... )
->>> print(results)
-[{'boxes': tensor([[344.6959,  23.1090, 637.1833, 374.2751],
-        [ 12.2666,  51.9145, 316.8582, 472.4392],
-        [ 38.5742,  70.0015, 176.7838, 118.1806]], device='cuda:0'),
-  'labels': ['a cat', 'a cat', 'a remote control'],
-  'scores': tensor([0.4785, 0.4381, 0.4776], device='cuda:0')}]
+>>> # Retrieve the first image result
+>>> result = results[0]
+>>> for box, score, text_label in zip(result["boxes"], result["scores"], result["text_labels"]):
+...     box = [round(x, 2) for x in box.tolist()]
+...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
+Detected a cat with confidence 0.479 at location [344.7, 23.11, 637.18, 374.28]
+Detected a cat with confidence 0.438 at location [12.27, 51.91, 316.86, 472.44]
+Detected a remote control with confidence 0.478 at location [38.57, 70.0, 176.78, 118.18]
 ```

 ## Grounded SAM