From 516ee6adc2a6ac2f4800790cabaad66a1cb4dcf4 Mon Sep 17 00:00:00 2001 From: Sergio Paniego Blanco Date: Thu, 12 Sep 2024 11:25:44 +0200 Subject: [PATCH] Fix incomplete sentence in `Zero-shot object detection` documentation (#33430) Rephrase sentence in zero-shot object detection docs --- docs/source/en/tasks/zero_shot_object_detection.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md index 03e849a6c7..5ac4706bff 100644 --- a/docs/source/en/tasks/zero_shot_object_detection.md +++ b/docs/source/en/tasks/zero_shot_object_detection.md @@ -26,8 +26,8 @@ is an open-vocabulary object detector. It means that it can detect objects in im the need to fine-tune the model on labeled datasets. OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with -lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads. -associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors +lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads, +which associate images with their corresponding textual descriptions, while ViT processes image patches as inputs. The authors of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using a bipartite matching loss.