From 516ee6adc2a6ac2f4800790cabaad66a1cb4dcf4 Mon Sep 17 00:00:00 2001
From: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Thu, 12 Sep 2024 11:25:44 +0200
Subject: [PATCH] Fix incomplete sentence in `Zero-shot object detection`
 documentation (#33430)

Rephrase sentence in zero-shot object detection docs
---
 docs/source/en/tasks/zero_shot_object_detection.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md
index 03e849a6c7..5ac4706bff 100644
--- a/docs/source/en/tasks/zero_shot_object_detection.md
+++ b/docs/source/en/tasks/zero_shot_object_detection.md
@@ -26,8 +26,8 @@ is an open-vocabulary object detector. It means that it can detect objects in im
 the need to fine-tune the model on labeled datasets.
 
 OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with
-lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads.
-associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors
+lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads,
+which associate images with their corresponding textual descriptions, while ViT processes image patches as inputs. The authors
 of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using
 a bipartite matching loss.