Fix incomplete sentence in Zero-shot object detection documentation (#33430)

Rephrase sentence in zero-shot object detection docs
This commit is contained in:
Sergio Paniego Blanco
2024-09-12 11:25:44 +02:00
committed by GitHub
parent e0ff4321d1
commit 516ee6adc2

View File

@@ -26,8 +26,8 @@ is an open-vocabulary object detector. It means that it can detect objects in im
the need to fine-tune the model on labeled datasets. the need to fine-tune the model on labeled datasets.
OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with
lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads. lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads,
associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors which associate images with their corresponding textual descriptions, while ViT processes image patches as inputs. The authors
of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using
a bipartite matching loss. a bipartite matching loss.