Add Neighborhood Attention Transformer (NAT) and Dilated NAT (DiNAT) models (#20219)
* Add DiNAT * Adds DiNAT + tests * Minor fixes * Added HF model * Add natten to dependencies. * Cleanup * Minor fixup * Reformat * Optional NATTEN import. * Reformat & add doc to _toctree * Reformat (finally) * Dummy objects for DiNAT * Add NAT + minor changes Adds NAT as its own independent model + docs, tests Adds NATTEN to ext deps to ensure ci picks it up. * Remove natten from `all` and `dev-torch` deps, add manual pip install to ci tests * Minor fixes. * Fix READMEs. * Requested changes to docs + minor fixes. * Requested changes. * Add NAT/DiNAT tests to layoutlm_job * Correction to Dinat doc. * Requested changes.
This commit is contained in:
@@ -396,6 +396,8 @@
|
||||
title: DeiT
|
||||
- local: model_doc/detr
|
||||
title: DETR
|
||||
- local: model_doc/dinat
|
||||
title: DiNAT
|
||||
- local: model_doc/dit
|
||||
title: DiT
|
||||
- local: model_doc/dpt
|
||||
@@ -412,6 +414,8 @@
|
||||
title: MobileNetV2
|
||||
- local: model_doc/mobilevit
|
||||
title: MobileViT
|
||||
- local: model_doc/nat
|
||||
title: NAT
|
||||
- local: model_doc/poolformer
|
||||
title: PoolFormer
|
||||
- local: model_doc/regnet
|
||||
|
||||
@@ -83,6 +83,7 @@ The documentation is organized into five sections:
|
||||
1. **[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
|
||||
1. **[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
|
||||
1. **[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||
1. **[DiNAT](model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi.
|
||||
1. **[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT.
|
||||
1. **[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
|
||||
1. **[Donut](model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
|
||||
@@ -136,6 +137,7 @@ The documentation is organized into five sections:
|
||||
1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
|
||||
1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MVP](model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
|
||||
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
|
||||
@@ -244,6 +246,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| Deformable DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| DeiT | ❌ | ❌ | ✅ | ✅ | ❌ |
|
||||
| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| DiNAT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| DonutSwin | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
@@ -290,6 +293,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| MPNet | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| MT5 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| MVP | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
| NAT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Nezha | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Nyströmformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
|
||||
78
docs/source/en/model_doc/dinat.mdx
Normal file
78
docs/source/en/model_doc/dinat.mdx
Normal file
@@ -0,0 +1,78 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Dilated Neighborhood Attention Transformer
|
||||
|
||||
## Overview
|
||||
|
||||
DiNAT was proposed in [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001)
|
||||
by Ali Hassani and Humphrey Shi.
|
||||
|
||||
It extends [NAT](nat) by adding a Dilated Neighborhood Attention pattern to capture global context,
|
||||
and shows significant performance improvements over it.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities,
|
||||
domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have
|
||||
also gained significant attention, thanks to their performance and easy integration into existing frameworks.
|
||||
These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA)
|
||||
or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity,
|
||||
local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling,
|
||||
and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and
|
||||
efficient extension to NA that can capture more global context and expand receptive fields exponentially at no
|
||||
additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we
|
||||
introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both.
|
||||
DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt.
|
||||
Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection,
|
||||
1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation.
|
||||
Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ)
|
||||
and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data).
|
||||
It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU),
|
||||
and ranks second on Cityscapes (84.5 mIoU) (no extra data). *
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
- DiNAT can be used as a *backbone*. When `output_hidden_states = True`,
|
||||
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`.
|
||||
|
||||
Notes:
|
||||
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention.
|
||||
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`.
|
||||
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
|
||||
- Patch size of 4 is only supported at the moment.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Neighborhood Attention with different dilation values.
|
||||
Taken from the <a href="https://arxiv.org/abs/2209.15001">original paper</a>.</small>
|
||||
|
||||
This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr).
|
||||
The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer).
|
||||
|
||||
|
||||
## DinatConfig
|
||||
|
||||
[[autodoc]] DinatConfig
|
||||
|
||||
|
||||
## DinatModel
|
||||
|
||||
[[autodoc]] DinatModel
|
||||
- forward
|
||||
|
||||
## DinatForImageClassification
|
||||
|
||||
[[autodoc]] DinatForImageClassification
|
||||
- forward
|
||||
73
docs/source/en/model_doc/nat.mdx
Normal file
73
docs/source/en/model_doc/nat.mdx
Normal file
@@ -0,0 +1,73 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Neighborhood Attention Transformer
|
||||
|
||||
## Overview
|
||||
|
||||
NAT was proposed in [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143)
|
||||
by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
|
||||
It is a hierarchical vision transformer based on Neighborhood Attention, a sliding-window self attention pattern.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision.
|
||||
NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a
|
||||
linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's
|
||||
receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike
|
||||
Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package
|
||||
with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less
|
||||
memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA
|
||||
that boosts image classification and downstream vision performance. Experimental results on NAT are competitive;
|
||||
NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9%
|
||||
ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. *
|
||||
|
||||
Tips:
|
||||
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||
- NAT can be used as a *backbone*. When `output_hidden_states = True`,
|
||||
it will output both `hidden_states` and `reshaped_hidden_states`.
|
||||
The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than
|
||||
`(batch_size, height, width, num_channels)`.
|
||||
|
||||
Notes:
|
||||
- NAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention.
|
||||
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten),
|
||||
or build on your system by running `pip install natten`.
|
||||
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet.
|
||||
- Patch size of 4 is only supported at the moment.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/neighborhood-attention-pattern.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> Neighborhood Attention compared to other attention patterns.
|
||||
Taken from the <a href="https://arxiv.org/abs/2204.07143">original paper</a>.</small>
|
||||
|
||||
This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr).
|
||||
The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer).
|
||||
|
||||
|
||||
## NatConfig
|
||||
|
||||
[[autodoc]] NatConfig
|
||||
|
||||
|
||||
## NatModel
|
||||
|
||||
[[autodoc]] NatModel
|
||||
- forward
|
||||
|
||||
## NatForImageClassification
|
||||
|
||||
[[autodoc]] NatForImageClassification
|
||||
- forward
|
||||
Reference in New Issue
Block a user