Pvt model (#24720)
* pull and push updates * add docs * fix modeling * Add and run test * make copies * add task * fix tests and fix small issues * Checks on a Pull Request * fix docs * add desc pvt.md
This commit is contained in:
@@ -514,6 +514,8 @@
|
||||
title: NAT
|
||||
- local: model_doc/poolformer
|
||||
title: PoolFormer
|
||||
- local: model_doc/pvt
|
||||
title: Pyramid Vision Transformer (PVT)
|
||||
- local: model_doc/regnet
|
||||
title: RegNet
|
||||
- local: model_doc/resnet
|
||||
|
||||
@@ -199,6 +199,7 @@ The documentation is organized into five sections:
|
||||
1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
|
||||
1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[PVT](model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
|
||||
1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
|
||||
1. **[RAG](model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela.
|
||||
1. **[REALM](model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.
|
||||
@@ -411,6 +412,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| PLBart | ✅ | ❌ | ❌ |
|
||||
| PoolFormer | ✅ | ❌ | ❌ |
|
||||
| ProphetNet | ✅ | ❌ | ❌ |
|
||||
| PVT | ✅ | ❌ | ❌ |
|
||||
| QDQBert | ✅ | ❌ | ❌ |
|
||||
| RAG | ✅ | ✅ | ❌ |
|
||||
| REALM | ✅ | ❌ | ❌ |
|
||||
|
||||
71
docs/source/en/model_doc/pvt.md
Normal file
71
docs/source/en/model_doc/pvt.md
Normal file
@@ -0,0 +1,71 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Pyramid Vision Transformer (PVT)
|
||||
|
||||
## Overview
|
||||
|
||||
The PVT model was proposed in
|
||||
[Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/abs/2102.12122)
|
||||
by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. The PVT is a type of
|
||||
vision transformer that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically
|
||||
it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length
|
||||
of the Transformer as it deepens - reducing the computational cost. Additionally, a spatial-reduction attention (SRA) layer
|
||||
is used to further reduce the resource consumption when learning high-resolution features.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a
|
||||
simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision
|
||||
Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer
|
||||
(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several
|
||||
merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and
|
||||
incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high
|
||||
output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the
|
||||
computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified
|
||||
backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones.
|
||||
We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including
|
||||
object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet
|
||||
achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope
|
||||
that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.*
|
||||
|
||||
This model was contributed by [Xrenya](<https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT).
|
||||
|
||||
|
||||
- PVTv1 on ImageNet-1K
|
||||
|
||||
| **Model variant** |**Size** |**Acc@1**|**Params (M)**|
|
||||
|--------------------|:-------:|:-------:|:------------:|
|
||||
| PVT-Tiny | 224 | 75.1 | 13.2 |
|
||||
| PVT-Small | 224 | 79.8 | 24.5 |
|
||||
| PVT-Medium | 224 | 81.2 | 44.2 |
|
||||
| PVT-Large | 224 | 81.7 | 61.4 |
|
||||
|
||||
|
||||
## PvtConfig
|
||||
|
||||
[[autodoc]] PvtConfig
|
||||
|
||||
## PvtImageProcessor
|
||||
|
||||
[[autodoc]] PvtImageProcessor
|
||||
- preprocess
|
||||
|
||||
## PvtForImageClassification
|
||||
|
||||
[[autodoc]] PvtForImageClassification
|
||||
- forward
|
||||
|
||||
## PvtModel
|
||||
|
||||
[[autodoc]] PvtModel
|
||||
- forward
|
||||
@@ -34,7 +34,8 @@ The task illustrated in this tutorial is supported by the following model archit
|
||||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
|
||||
[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn)
|
||||
[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [PVT](../model_doc/pvt), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn)
|
||||
|
||||
<!--End of the generated tip-->
|
||||
|
||||
</Tip>
|
||||
|
||||
Reference in New Issue
Block a user