Add ViTDet (#25524)
* First draft * Fix READMEs * Update return_dict * Add more tests * Fix docstrings * Address comments * Address more comments * Address more comments * Address more comments, fix test * Fix test
This commit is contained in:
@@ -558,6 +558,8 @@
|
||||
title: Vision Transformer (ViT)
|
||||
- local: model_doc/vit_hybrid
|
||||
title: ViT Hybrid
|
||||
- local: model_doc/vitdet
|
||||
title: ViTDet
|
||||
- local: model_doc/vit_mae
|
||||
title: ViTMAE
|
||||
- local: model_doc/vit_msn
|
||||
|
||||
@@ -252,6 +252,7 @@ The documentation is organized into five sections:
|
||||
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
1. **[ViT Hybrid](model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||
@@ -471,6 +472,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| VisualBERT | ✅ | ❌ | ❌ |
|
||||
| ViT | ✅ | ✅ | ✅ |
|
||||
| ViT Hybrid | ✅ | ❌ | ❌ |
|
||||
| VitDet | ✅ | ❌ | ❌ |
|
||||
| ViTMAE | ✅ | ✅ | ❌ |
|
||||
| ViTMSN | ✅ | ❌ | ❌ |
|
||||
| ViViT | ✅ | ❌ | ❌ |
|
||||
|
||||
39
docs/source/en/model_doc/vitdet.md
Normal file
39
docs/source/en/model_doc/vitdet.md
Normal file
@@ -0,0 +1,39 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# ViTDet
|
||||
|
||||
## Overview
|
||||
|
||||
The ViTDet model was proposed in [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||
VitDet leverages the plain [Vision Transformer](vit) for the task of object detection.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors.*
|
||||
|
||||
Tips:
|
||||
|
||||
- For the moment, only the backbone is available.
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||
The original code can be found [here](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet).
|
||||
|
||||
|
||||
## VitDetConfig
|
||||
|
||||
[[autodoc]] VitDetConfig
|
||||
|
||||
## VitDetModel
|
||||
|
||||
[[autodoc]] VitDetModel
|
||||
- forward
|
||||
Reference in New Issue
Block a user