Add OneFormer Model (#20577)

* Add Oneformer Model

* Add OneFormer Tests

* Add UNIVERSAL_SEGMENTATION_MAPPING

* Fix config

* 🐛 Fix error encountered while writing tests

* 🔨 Fix instance segmentation post processing

* Format Files and Add Documentation

* Add Documentation mdx file

* Run make fixup

* Run make fix-copies

* Remove unnecessary code

* Format modeling_oneformer.py

* Add OneFormer to ImageSegmentationPipeline

* Format files

* Add Demo link to Readme

* Fix fomatting errors

* Fix test failures

* Update Table in index.mdx

* Fix version

* Fix style

* Remove OneFormer from TF

* Fix Imports

* Fix dummy objects

* Fix tests

* Add newline

* Remove OneFormerFeatureExtractor

* Remove CUDA Kernels

* Use AutoBackbone for Swin

* Fix description

* Use Image Processor

* Fix copies

* Fix formatting

* Fix import order

* Fix flake8 errors

* Fix doc errors

* Add Hindi Readme entry

* Update supported backbones

* Update supported backbones

* Undo Changes

* Fix type of config

* Fix isort

* Fix auto.mdx

* Fix swin config

* Replace DinatBackbone with AutoBackbone

* Use SwinBackbone

* Use SwinBackbone

* Fix conversion script

* Fix arguments

* Add argument description

* Fix style

* Add OneFormerProcessor

* Fix OneFormerProcessor Tests

* Fix mapping

* Fix imports

* Fix inits

* Fix style

* Fix comment

* Fix docstring

* Move OneFormer to MultiModal

* Fix Copies

* Remove size divisor

* Fix check_repo.py

* Fix copies

* Add Processor for Testing Pipeline

* Fix padding for tokens

* Fix variables

* Fix formatting with correct black version

* Add Image Processor Test

* Apply suggestions

* Revert common modeling

* Add check for task

* Fix conversion script

* Fix initialization order

* Fix tests

* Undo Pipeline Changes

* Fix layers in MLP

* Fix copies

* Update image paths

* Fix copies

* Apply suggestions
This commit is contained in:
Jitesh Jain
2023-01-19 14:01:07 +05:30
committed by GitHub
parent 6d67664380
commit 5b949623c7
35 changed files with 8161 additions and 0 deletions

View File

@@ -94,6 +94,7 @@ In Computer Vision:
- [Panoptic Segmentation with MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco)
- [Depth Estimation with DPT](https://huggingface.co/docs/transformers/model_doc/dpt)
- [Video Classification with VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)
- [Universal Segmentation with OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large)
In Audio:
- [Automatic Speech Recognition with Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h)
@@ -371,6 +372,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -92,6 +92,7 @@ En visión de ordenador:
- [Detección de objetos con DETR](https://huggingface.co/facebook/detr-resnet-50)
- [Segmentación semántica con SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512)
- [Segmentación panóptica con DETR](https://huggingface.co/facebook/detr-resnet-50-panoptic)
- [Segmentación Universal con OneFormer (Segmentación Semántica, de Instancia y Panóptica con un solo modelo)](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large)
En Audio:
- [Reconocimiento de voz automático con Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h)
@@ -364,6 +365,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -337,6 +337,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https :/ /arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा।
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (फ्रॉम मेटा) साथ में पेपर [नो लैंग्वेज लेफ्ट बिहाइंड: स्केलिंग ह्यूमन-सेंटेड मशीन ट्रांसलेशन] (https://arxiv.org/abs/2207.04672) एनएलएलबी टीम द्वारा प्रकाशित।
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में कागज [Nyströmformer: A Nyström- आधारित एल्गोरिथम आत्म-ध्यान का अनुमान लगाने के लिए ](https://arxiv.org/abs/2102.03902) युनयांग ज़िओंग, झानपेंग ज़ेंग, रुद्रसिस चक्रवर्ती, मिंगक्सिंग टैन, ग्लेन फंग, यिन ली, विकास सिंह द्वारा पोस्ट किया गया।
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (SHI Labs से) पेपर [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) जितेश जैन, जिआचेन ली, मांगटिक चिउ, अली हसनी, निकिता ओरलोव, हम्फ्री शि के द्वारा जारी किया गया है।
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI से) साथ में कागज [विज़न ट्रांसफॉर्मर्स के साथ सिंपल ओपन-वोकैबुलरी ऑब्जेक्ट डिटेक्शन](https:/ /arxiv.org/abs/2205.06230) मैथियास मिंडरर, एलेक्सी ग्रिट्सेंको, ऑस्टिन स्टोन, मैक्सिम न्यूमैन, डिर्क वीसेनबोर्न, एलेक्सी डोसोवित्स्की, अरविंद महेंद्रन, अनुराग अर्नब, मुस्तफा देहघानी, ज़ुओरन शेन, जिओ वांग, ज़ियाओहुआ झाई, थॉमस किफ़, और नील हॉल्सबी द्वारा पोस्ट किया गया।
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -399,6 +399,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noahs Ark Lab から) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu から公開された研究論文: [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204)
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta から) the NLLB team から公開された研究論文: [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison から) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh から公開された研究論文: [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902)
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (SHI Labs から) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi から公開された研究論文: [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (Meta AI から) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al から公開された研究論文: [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068)
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI から) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby から公開された研究論文: [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230)
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (Google から) Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu から公開された研究論文: [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)

View File

@@ -314,6 +314,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noahs Ark Lab 에서) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 의 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 논문과 함께 발표했습니다.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta 에서) the NLLB team 의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 논문과 함께 발표했습니다.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison 에서) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 의 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 논문과 함께 발표했습니다.
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (SHI Labs 에서) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 의 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 논문과 함께 발표했습니다.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (Meta AI 에서) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al 의 [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 논문과 함께 발표했습니다.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI 에서) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby 의 [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) 논문과 함께 발표했습니다.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (Google 에서) Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 의 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 논문과 함께 발표했습니다.

View File

@@ -338,6 +338,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (来自华为诺亚方舟实验室) 伴随论文 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 由 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 发布。
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (来自 the University of Wisconsin - Madison) 伴随论文 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 由 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 发布。
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (来自 SHI Labs) 伴随论文 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 由 Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 发布。
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (来自 Meta AI) 伴随论文 [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 由 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al 发布。
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (来自 Google AI) 伴随论文 [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) 由 Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby 发布。
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。

View File

@@ -350,6 +350,7 @@ conda install -c huggingface transformers
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](https://huggingface.co/docs/transformers/main/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -130,6 +130,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen,
1. **[Nezha](model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -538,6 +538,8 @@
title: LayoutXLM
- local: model_doc/lxmert
title: LXMERT
- local: model_doc/oneformer
title: OneFormer
- local: model_doc/owlvit
title: OWL-ViT
- local: model_doc/perceiver

View File

@@ -151,6 +151,7 @@ The documentation is organized into five sections:
1. **[Nezha](model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
@@ -322,6 +323,7 @@ Flax), PyTorch, and/or TensorFlow.
| NAT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Nezha | ❌ | ❌ | ✅ | ❌ | ❌ |
| Nyströmformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| OneFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
| OPT | ❌ | ❌ | ✅ | ✅ | ✅ |

View File

@@ -0,0 +1,72 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# OneFormer
## Overview
The OneFormer model was proposed in [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/oneformer_teaser.png"/>
The abstract from the paper is the following:
*Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.*
Tips:
- OneFormer requires two inputs during inference: *image* and *task token*.
- During training, OneFormer only uses panoptic annotations.
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
`get_num_masks` function inside in the `OneFormerLoss` class of `modeling_oneformer.py`. When training on multiple nodes, this should be
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/SHI-Labs/OneFormer/blob/33ebb56ed34f970a30ae103e786c0cb64c653d9a/oneformer/modeling/criterion.py#L287).
- One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneformerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs.
- To get the final segmentation, depending on the task, you can call [`~OneFormerProcessor.post_process_semantic_segmentation`] or [`~OneFormerImageProcessor.post_process_instance_segmentation`] or [`~OneFormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`OneFormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
The figure below illustrates the architecture of OneFormer. Taken from the [original paper](https://arxiv.org/abs/2211.06220).
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/oneformer_architecture.png"/>
This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3). The original code can be found [here](https://github.com/SHI-Labs/OneFormer).
## OneFormer specific outputs
[[autodoc]] models.oneformer.modeling_oneformer.OneFormerModelOutput
[[autodoc]] models.oneformer.modeling_oneformer.OneFormerForUniversalSegmentationOutput
## OneFormerConfig
[[autodoc]] OneFormerConfig
## OneFormerImageProcessor
[[autodoc]] OneFormerImageProcessor
- preprocess
- encode_inputs
- post_process_semantic_segmentation
- post_process_instance_segmentation
- post_process_panoptic_segmentation
## OneFormerProcessor
[[autodoc]] OneFormerProcessor
## OneFormerModel
[[autodoc]] OneFormerModel
- forward
## OneFormerForUniversalSegmentation
[[autodoc]] OneFormerForUniversalSegmentation
- forward

View File

@@ -109,6 +109,7 @@ La biblioteca actualmente contiene implementaciones de JAX, PyTorch y TensorFlow
1. **[MPNet](model_doc/mpnet)** (de Microsoft Research) publicado con el paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) por Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](model_doc/mt5)** (de Google AI) publicado con el paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) por Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Nyströmformer](model_doc/nystromformer)** (de la Universidad de Wisconsin - Madison) publicado con el paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) por Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (de la SHI Labs) publicado con el paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) por Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[Pegasus](model_doc/pegasus)** (de Google) publicado con el paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) por Jingqing Zhang, Yao Zhao, Mohammad Saleh y Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (de Deepmind) publicado con el paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) por Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](model_doc/phobert)** (de VinAI Research) publicado con el paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) por Dat Quoc Nguyen y Anh Tuan Nguyen.

View File

@@ -118,6 +118,7 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p
1. **[MPNet](model_doc/mpnet)** (da Microsoft Research) rilasciato con il paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) da Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](model_doc/mt5)** (da Google AI) rilasciato con il paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) da Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Nyströmformer](model_doc/nystromformer)** (dalla Università del Wisconsin - Madison) rilasciato con il paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) da Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (da SHI Labs) rilasciato con il paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) da Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](master/model_doc/opt)** (da Meta AI) rilasciato con il paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) da Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[Pegasus](model_doc/pegasus)** (da Google) rilasciato con il paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) da Jingqing Zhang, Yao Zhao, Mohammad Saleh e Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (da Deepmind) rilasciato con il paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) da Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

View File

@@ -139,6 +139,7 @@ specific language governing permissions and limitations under the License.
1. **[Nezha](model_doc/nezha)** (from Huawei Noahs Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

View File

@@ -123,6 +123,7 @@ Atualmente a biblioteca contém implementações do PyTorch, TensorFlow e JAX, p
1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.

View File

@@ -348,6 +348,7 @@ _import_structure = {
"NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"NystromformerConfig",
],
"models.oneformer": ["ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "OneFormerConfig", "OneFormerProcessor"],
"models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"],
"models.opt": ["OPTConfig"],
"models.owlvit": [
@@ -799,6 +800,7 @@ else:
_import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"])
_import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"])
_import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"])
_import_structure["models.oneformer"].extend(["OneFormerImageProcessor"])
_import_structure["models.owlvit"].extend(["OwlViTFeatureExtractor", "OwlViTImageProcessor"])
_import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"])
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
@@ -1871,6 +1873,14 @@ else:
"NystromformerPreTrainedModel",
]
)
_import_structure["models.oneformer"].extend(
[
"ONEFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"OneFormerForUniversalSegmentation",
"OneFormerModel",
"OneFormerPreTrainedModel",
]
)
_import_structure["models.openai"].extend(
[
"OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -3729,6 +3739,7 @@ if TYPE_CHECKING:
from .models.nat import NAT_PRETRAINED_CONFIG_ARCHIVE_MAP, NatConfig
from .models.nezha import NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP, NezhaConfig
from .models.nystromformer import NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, NystromformerConfig
from .models.oneformer import ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, OneFormerConfig, OneFormerProcessor
from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
from .models.opt import OPTConfig
from .models.owlvit import (
@@ -4122,6 +4133,7 @@ if TYPE_CHECKING:
from .models.mobilenet_v1 import MobileNetV1FeatureExtractor, MobileNetV1ImageProcessor
from .models.mobilenet_v2 import MobileNetV2FeatureExtractor, MobileNetV2ImageProcessor
from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor
from .models.oneformer import OneFormerImageProcessor
from .models.owlvit import OwlViTFeatureExtractor, OwlViTImageProcessor
from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor
from .models.poolformer import PoolFormerFeatureExtractor, PoolFormerImageProcessor
@@ -5000,6 +5012,12 @@ if TYPE_CHECKING:
NystromformerModel,
NystromformerPreTrainedModel,
)
from .models.oneformer import (
ONEFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
OneFormerForUniversalSegmentation,
OneFormerModel,
OneFormerPreTrainedModel,
)
from .models.openai import (
OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST,
OpenAIGPTDoubleHeadsModel,

View File

@@ -122,6 +122,7 @@ from . import (
nezha,
nllb,
nystromformer,
oneformer,
openai,
opt,
owlvit,

View File

@@ -120,6 +120,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("nat", "NatConfig"),
("nezha", "NezhaConfig"),
("nystromformer", "NystromformerConfig"),
("oneformer", "OneFormerConfig"),
("openai-gpt", "OpenAIGPTConfig"),
("opt", "OPTConfig"),
("owlvit", "OwlViTConfig"),
@@ -276,6 +277,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("nat", "NAT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("nezha", "NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("oneformer", "ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("openai-gpt", "OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("opt", "OPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("owlvit", "OWLVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -447,6 +449,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("nezha", "Nezha"),
("nllb", "NLLB"),
("nystromformer", "Nyströmformer"),
("oneformer", "OneFormer"),
("openai-gpt", "OpenAI GPT"),
("opt", "OPT"),
("owlvit", "OWL-ViT"),

View File

@@ -69,6 +69,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
("mobilevit", "MobileViTImageProcessor"),
("mobilevit", "MobileViTImageProcessor"),
("nat", "ViTImageProcessor"),
("oneformer", "OneFormerImageProcessor"),
("owlvit", "OwlViTImageProcessor"),
("perceiver", "PerceiverImageProcessor"),
("poolformer", "PoolFormerImageProcessor"),

View File

@@ -120,6 +120,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("nezha", "NezhaModel"),
("nllb", "M2M100Model"),
("nystromformer", "NystromformerModel"),
("oneformer", "OneFormerModel"),
("openai-gpt", "OpenAIGPTModel"),
("opt", "OPTModel"),
("owlvit", "OwlViTModel"),
@@ -457,6 +458,7 @@ MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING_NAMES = OrderedDict(
("detr", "DetrForSegmentation"),
("mask2former", "Mask2FormerForUniversalSegmentation"),
("maskformer", "MaskFormerForInstanceSegmentation"),
("oneformer", "OneFormerForUniversalSegmentation"),
]
)

View File

@@ -53,6 +53,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("layoutlmv3", "LayoutLMv3Processor"),
("layoutxlm", "LayoutXLMProcessor"),
("markuplm", "MarkupLMProcessor"),
("oneformer", "OneFormerProcessor"),
("owlvit", "OwlViTProcessor"),
("sew", "Wav2Vec2Processor"),
("sew-d", "Wav2Vec2Processor"),

View File

@@ -208,6 +208,7 @@ else:
"AlbertTokenizerFast" if is_tokenizers_available() else None,
),
),
("oneformer", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
("openai-gpt", ("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None)),
("opt", ("GPT2Tokenizer", None)),
("owlvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),

View File

@@ -0,0 +1,77 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2022 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_oneformer": ["ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "OneFormerConfig"],
"processing_oneformer": ["OneFormerProcessor"],
}
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_oneformer"] = ["OneFormerImageProcessor"]
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_oneformer"] = [
"ONEFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"OneFormerForUniversalSegmentation",
"OneFormerModel",
"OneFormerPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_oneformer import ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, OneFormerConfig
from .processing_oneformer import OneFormerProcessor
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_oneformer import OneFormerImageProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_oneformer import (
ONEFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
OneFormerForUniversalSegmentation,
OneFormerModel,
OneFormerPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)

View File

@@ -0,0 +1,266 @@
# coding=utf-8
# Copyright 2022 SHI Labs and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""OneFormer model configuration"""
import copy
from typing import Dict, Optional
from ...configuration_utils import PretrainedConfig
from ...utils import logging
from ..auto import CONFIG_MAPPING
logger = logging.get_logger(__name__)
ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"shi-labs/oneformer_ade20k_swin_tiny": (
"https://huggingface.co/shi-labs/oneformer_ade20k_swin_tiny/blob/main/config.json"
),
# See all OneFormer models at https://huggingface.co/models?filter=oneformer
}
class OneFormerConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`OneFormerModel`]. It is used to instantiate a
OneFormer model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the OneFormer
[shi-labs/oneformer_ade20k_swin_tiny](https://huggingface.co/shi-labs/oneformer_ade20k_swin_tiny) architecture
trained on [ADE20k-150](https://huggingface.co/datasets/scene_parse_150).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
backbone_config (`PretrainedConfig`, *optional*, defaults to `SwinConfig`)
The configuration of the backbone model.
ignore_value (`int`, *optional*, defaults to 255)
Values to be ignored in GT label while calculating loss.
num_queries (`int`, *optional*, defaults to 150)
Number of object queries.
no_object_weight (`float`, *optional*, defaults to 0.1)
Weight for no-object class predictions.
class_weight (`float`, *optional*, defaults to 2.0)
Weight for Classification CE loss.
mask_weight (`float`, *optional*, defaults to 5.0)
Weight for binary CE loss.
dice_weight (`float`, *optional*, defaults to 5.0)
Weight for dice loss.
contrastive_weight (`float`, *optional*, defaults to 0.5)
Weight for contrastive loss.
contrastive_temperature (`float`, *optional*, defaults to 0.07)
Initial value for scaling the contrastive logits.
train_num_points (`int`, *optional*, defaults to 12544)
Number of points to sample while calculating losses on mask predictions.
oversample_ratio (`float`, *optional*, defaults to 3.0)
Ratio to decide how many points to oversample.
importance_sample_ratio (`float`, *optional*, defaults to 0.75)
Ratio of points that are sampled via importance sampling.
init_std (`float`, *optional*, defaults to 0.02)
Standard deviation for normal intialization.
init_xavier_std (`float`, *optional*, defaults to 0.02)
Standard deviation for xavier uniform initialization.
layer_norm_eps (`float`, *optional*, defaults to 1e-05)
Epsilon for layer normalization.
is_training (`bool`, *optional*, defaults to False)
Whether to run in training or inference mode.
use_auxiliary_loss (`bool`, *optional*, defaults to True)
Whether to calculate loss using intermediate predictions from transformer decoder.
output_auxiliary_logits (`bool`, *optional*, defaults to True)
Whether to return intermediate predictions from transformer decoder.
strides (`list`, *optional*, defaults to [4, 8, 16, 32])
List containing the strides for feature maps in the encoder.
task_seq_len (`int`, *optional*, defaults to 77)
Sequence length for tokenizing text list input.
max_seq_len (`int`, *optional*, defaults to 77)
Sequence length for tokenizing task input.
text_encoder_width (`int`, *optional*, defaults to 256)
Hidden size for text encoder.
text_encoder_context_length (`int`, *optional*, defaults to 77):
Input sequence length for text encoder.
text_encoder_num_layers (`int`, *optional*, defaults to 6)
Number of layers for transformer in text encoder.
text_encoder_vocab_size (`int`, *optional*, defaults to 49408)
Vocabulary size for tokenizer.
text_encoder_proj_layers (`int`, *optional*, defaults to 2)
Number of layers in MLP for project text queries.
text_encoder_n_ctx (`int`, *optional*, defaults to 16)
Number of learnable text context queries.
conv_dim (`int`, *optional*, defaults to 256)
Feature map dimension to map outputs from the backbone.
mask_dim (`int`, *optional*, defaults to 256)
Dimension for feature maps in pixel decoder.
hidden_dim (`int`, *optional*, defaults to 256)
Dimension for hidden states in transformer decoder.
encoder_feedforward_dim (`int`, *optional*, defaults to 1024)
Dimension for FFN layer in pixel decoder.
norm (`str`, *optional*, defaults to `GN`)
Type of normalization.
encoder_layers (`int`, *optional*, defaults to 6)
Number of layers in pixel decoder.
decoder_layers (`int`, *optional*, defaults to 10)
Number of layers in transformer decoder.
use_task_norm (`bool`, *optional*, defaults to `True`)
Whether to normalize the task token.
num_attention_heads (`int`, *optional*, defaults to 8)
Number of attention heads in transformer layers in the pixel and transformer decoders.
dropout (`float`, *optional*, defaults to 0.1)
Dropout probability for pixel and transformer decoders.
dim_feedforward (`int`, *optional*, defaults to 2048)
Dimension for FFN layer in transformer decoder.
pre_norm (`bool`, *optional*, defaults to `False`)
Whether to normalize hidden states before attention layers in transformer decoder.
enforce_input_proj (`bool`, *optional*, defaults to `False`)
Whether to project hidden states in transformer decoder.
query_dec_layers (`int`, *optional*, defaults to 2)
Number of layers in query transformer.
common_stride (`int`, *optional*, defaults to 4)
Common stride used for features in pixel decoder.
Examples:
```python
>>> from transformers import OneFormerConfig, OneFormerModel
>>> # Initializing a OneFormer shi-labs/oneformer_ade20k_swin_tiny configuration
>>> configuration = OneFormerConfig()
>>> # Initializing a model (with random weights) from the shi-labs/oneformer_ade20k_swin_tiny style configuration
>>> model = OneFormerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "oneformer"
attribute_map = {"hidden_size": "hidden_dim"}
def __init__(
self,
backbone_config: Optional[Dict] = None,
ignore_value: int = 255,
num_queries: int = 150,
no_object_weight: int = 0.1,
class_weight: float = 2.0,
mask_weight: float = 5.0,
dice_weight: float = 5.0,
contrastive_weight: float = 0.5,
contrastive_temperature: float = 0.07,
train_num_points: int = 12544,
oversample_ratio: float = 3.0,
importance_sample_ratio: float = 0.75,
init_std: float = 0.02,
init_xavier_std: float = 1.0,
layer_norm_eps: float = 1e-05,
is_training: bool = False,
use_auxiliary_loss: bool = True,
output_auxiliary_logits: bool = True,
strides: Optional[list] = [4, 8, 16, 32],
task_seq_len: int = 77,
max_seq_len: int = 77,
text_encoder_width: int = 256,
text_encoder_context_length: int = 77,
text_encoder_num_layers: int = 6,
text_encoder_vocab_size: int = 49408,
text_encoder_proj_layers: int = 2,
text_encoder_n_ctx: int = 16,
conv_dim: int = 256,
mask_dim: int = 256,
hidden_dim: int = 256,
encoder_feedforward_dim: int = 1024,
norm: str = "GN",
encoder_layers: int = 6,
decoder_layers: int = 10,
use_task_norm: bool = True,
num_attention_heads: int = 8,
dropout: float = 0.1,
dim_feedforward: int = 2048,
pre_norm: bool = False,
enforce_input_proj: bool = False,
query_dec_layers: int = 2,
common_stride: int = 4,
**kwargs,
):
if backbone_config is None:
logger.info("`backbone_config` is unset. Initializing the config with the default `Swin` backbone.")
backbone_config = CONFIG_MAPPING["swin"](
image_size=224,
in_channels=3,
patch_size=4,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
drop_path_rate=0.3,
use_absolute_embeddings=False,
out_features=["stage1", "stage2", "stage3", "stage4"],
)
elif isinstance(backbone_config, dict):
backbone_model_type = backbone_config.get("model_type")
config_class = CONFIG_MAPPING[backbone_model_type]
backbone_config = config_class.from_dict(backbone_config)
self.backbone_config = backbone_config
self.ignore_value = ignore_value
self.num_queries = num_queries
self.no_object_weight = no_object_weight
self.class_weight = class_weight
self.mask_weight = mask_weight
self.dice_weight = dice_weight
self.contrastive_weight = contrastive_weight
self.contrastive_temperature = contrastive_temperature
self.train_num_points = train_num_points
self.oversample_ratio = oversample_ratio
self.importance_sample_ratio = importance_sample_ratio
self.init_std = init_std
self.init_xavier_std = init_xavier_std
self.layer_norm_eps = layer_norm_eps
self.is_training = is_training
self.use_auxiliary_loss = use_auxiliary_loss
self.output_auxiliary_logits = output_auxiliary_logits
self.strides = strides
self.task_seq_len = task_seq_len
self.max_seq_len = max_seq_len
self.text_encoder_width = text_encoder_width
self.text_encoder_context_length = text_encoder_context_length
self.text_encoder_num_layers = text_encoder_num_layers
self.text_encoder_vocab_size = text_encoder_vocab_size
self.text_encoder_proj_layers = text_encoder_proj_layers
self.text_encoder_n_ctx = text_encoder_n_ctx
self.conv_dim = conv_dim
self.mask_dim = mask_dim
self.hidden_dim = hidden_dim
self.encoder_feedforward_dim = encoder_feedforward_dim
self.norm = norm
self.encoder_layers = encoder_layers
self.decoder_layers = decoder_layers
self.use_task_norm = use_task_norm
self.num_attention_heads = num_attention_heads
self.dropout = dropout
self.dim_feedforward = dim_feedforward
self.pre_norm = pre_norm
self.enforce_input_proj = enforce_input_proj
self.query_dec_layers = query_dec_layers
self.common_stride = common_stride
self.num_hidden_layers = decoder_layers
super().__init__(**kwargs)
def to_dict(self) -> Dict[str, any]:
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["backbone_config"] = self.backbone_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,205 @@
# coding=utf-8
# Copyright 2022 SHI Labs and The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Image/Text processor class for OneFormer
"""
from typing import List
from transformers.utils import is_torch_available
from ...processing_utils import ProcessorMixin
if is_torch_available():
import torch
class OneFormerProcessor(ProcessorMixin):
r"""
Constructs an OneFormer processor which wraps [`OneFormerImageProcessor`] and
[`CLIPTokenizer`]/[`CLIPTokenizerFast`] into a single processor that inherits both the image processor and
tokenizer functionalities.
Args:
image_processor ([`OneFormerImageProcessor`]):
The image processor is a required input.
tokenizer ([`CLIPTokenizer`, `CLIPTokenizerFast`]):
The tokenizer is a required input.
max_seq_len (`int`, *optional*, defaults to 77)):
Sequence length for input text list.
task_seq_len (`int`, *optional*, defaults to 77):
Sequence length for input task token.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "OneFormerImageProcessor"
tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")
def __init__(
self, image_processor=None, tokenizer=None, max_seq_length: int = 77, task_seq_length: int = 77, **kwargs
):
if image_processor is None:
raise ValueError("You need to specify an `image_processor`.")
if tokenizer is None:
raise ValueError("You need to specify a `tokenizer`.")
self.max_seq_length = max_seq_length
self.task_seq_length = task_seq_length
super().__init__(image_processor, tokenizer)
def _preprocess_text(self, text_list=None, max_length=77):
if text_list is None:
raise ValueError("tokens cannot be None.")
tokens = self.tokenizer(text_list, padding="max_length", max_length=max_length, truncation=True)
attention_masks, input_ids = tokens["attention_mask"], tokens["input_ids"]
token_inputs = []
for attn_mask, input_id in zip(attention_masks, input_ids):
token = torch.tensor(attn_mask) * torch.tensor(input_id)
token_inputs.append(token.unsqueeze(0))
token_inputs = torch.cat(token_inputs, dim=0)
return token_inputs
def __call__(self, images=None, task_inputs=None, segmentation_maps=None, **kwargs):
"""
Main method to prepare for the model one or several task input(s) and image(s). This method forwards the
`task_inputs` and `kwargs` arguments to CLIPTokenizer's [`~CLIPTokenizer.__call__`] if `task_inputs` is not
`None` to encode. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
OneFormerImageProcessor's [`~OneFormerImageProcessor.__call__`] if `images` is not `None`. Please refer to the
doctsring of the above two methods for more information.
Args:
task_inputs (`str`, `List[str]`):
The sequence or batch of task_inputs sequences to be encoded. Each sequence can be a string or a list
of strings of the template "the task is {task}".
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`,
`List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width.
segmentation_maps (`ImageInput`, *optional*):
The corresponding semantic segmentation maps with the pixel-wise annotations.
(`bool`, *optional*, defaults to `True`):
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
If left to the default, will return a pixel mask that is:
- 1 for pixels that are real (i.e. **not masked**),
- 0 for pixels that are padding (i.e. **masked**).
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **task_inputs** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if task_inputs is None:
raise ValueError("You have to specify the task_input. Found None.")
elif images is None:
raise ValueError("You have to specify the image. Found None.")
if not all(task in ["semantic", "instance", "panoptic"] for task in task_inputs):
raise ValueError("task_inputs must be semantic, instance, or panoptic.")
encoded_inputs = self.image_processor(images, task_inputs, segmentation_maps, **kwargs)
if isinstance(task_inputs, str):
task_inputs = [task_inputs]
if isinstance(task_inputs, List) and all(isinstance(task_input, str) for task_input in task_inputs):
task_token_inputs = []
for task in task_inputs:
task_input = f"the task is {task}"
task_token_inputs.append(task_input)
encoded_inputs["task_inputs"] = self._preprocess_text(task_token_inputs, max_length=self.task_seq_length)
else:
raise TypeError("Task Inputs should be a string or a list of strings.")
if hasattr(encoded_inputs, "text_inputs"):
texts_list = encoded_inputs.text_inputs
text_inputs = []
for texts in texts_list:
text_input_list = self._preprocess_text(texts, max_length=self.max_seq_length)
text_inputs.append(text_input_list.unsqueeze(0))
encoded_inputs["text_inputs"] = torch.cat(text_inputs, dim=0)
return encoded_inputs
def encode_inputs(self, images=None, task_inputs=None, segmentation_maps=None, **kwargs):
"""
This method forwards all its arguments to [`OneFormerImageProcessor.encode_inputs`] and then tokenizes the
task_inputs. Please refer to the docstring of this method for more information.
"""
if task_inputs is None:
raise ValueError("You have to specify the task_input. Found None.")
elif images is None:
raise ValueError("You have to specify the image. Found None.")
if not all(task in ["semantic", "instance", "panoptic"] for task in task_inputs):
raise ValueError("task_inputs must be semantic, instance, or panoptic.")
encoded_inputs = self.image_processor.encode_inputs(images, task_inputs, segmentation_maps, **kwargs)
if isinstance(task_inputs, str):
task_inputs = [task_inputs]
if isinstance(task_inputs, List) and all(isinstance(task_input, str) for task_input in task_inputs):
task_token_inputs = []
for task in task_inputs:
task_input = f"the task is {task}"
task_token_inputs.append(task_input)
encoded_inputs["task_inputs"] = self._preprocess_text(task_token_inputs, max_length=self.task_seq_length)
else:
raise TypeError("Task Inputs should be a string or a list of strings.")
if hasattr(encoded_inputs, "text_inputs"):
texts_list = encoded_inputs.text_inputs
text_inputs = []
for texts in texts_list:
text_input_list = self._preprocess_text(texts, max_length=self.max_seq_length)
text_inputs.append(text_input_list.unsqueeze(0))
encoded_inputs["text_inputs"] = torch.cat(text_inputs, dim=0)
return encoded_inputs
def post_process_semantic_segmentation(self, *args, **kwargs):
"""
This method forwards all its arguments to [`OneFormerImageProcessor.post_process_semantic_segmentation`].
Please refer to the docstring of this method for more information.
"""
return self.image_processor.post_process_semantic_segmentation(*args, **kwargs)
def post_process_instance_segmentation(self, *args, **kwargs):
"""
This method forwards all its arguments to [`OneFormerImageProcessor.post_process_instance_segmentation`].
Please refer to the docstring of this method for more information.
"""
return self.image_processor.post_process_instance_segmentation(*args, **kwargs)
def post_process_panoptic_segmentation(self, *args, **kwargs):
"""
This method forwards all its arguments to [`OneFormerImageProcessor.post_process_panoptic_segmentation`].
Please refer to the docstring of this method for more information.
"""
return self.image_processor.post_process_panoptic_segmentation(*args, **kwargs)

View File

@@ -4242,6 +4242,30 @@ class NystromformerPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"])
ONEFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
class OneFormerForUniversalSegmentation(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class OneFormerModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class OneFormerPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

@@ -318,6 +318,13 @@ class MobileViTImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class OneFormerImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class OwlViTFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]

View File

View File

@@ -0,0 +1,456 @@
# coding=utf-8
# Copyright 2022 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import unittest
import numpy as np
from huggingface_hub import hf_hub_download
from transformers.testing_utils import require_torch, require_vision
from transformers.utils import is_torch_available, is_vision_available
from ...test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
if is_torch_available():
import torch
if is_vision_available():
from transformers import OneFormerImageProcessor
from transformers.models.oneformer.image_processing_oneformer import binary_mask_to_rle
from transformers.models.oneformer.modeling_oneformer import OneFormerForUniversalSegmentationOutput
if is_vision_available():
from PIL import Image
def prepare_metadata(class_info_file, repo_path="shi-labs/oneformer_demo"):
with open(hf_hub_download(repo_path, class_info_file, repo_type="dataset"), "r") as f:
class_info = json.load(f)
metadata = {}
class_names = []
thing_ids = []
for key, info in class_info.items():
metadata[key] = info["name"]
class_names.append(info["name"])
if info["isthing"]:
thing_ids.append(int(key))
metadata["thing_ids"] = thing_ids
metadata["class_names"] = class_names
return metadata
class OneFormerImageProcessorTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
min_resolution=30,
max_resolution=400,
size=None,
do_resize=True,
do_normalize=True,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
num_labels=10,
reduce_labels=False,
ignore_index=255,
repo_path="shi-labs/oneformer_demo",
class_info_file="ade20k_panoptic.json",
num_text=10,
):
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_resize = do_resize
self.size = {"shortest_edge": 32, "longest_edge": 1333} if size is None else size
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
self.class_info_file = class_info_file
self.metadata = prepare_metadata(class_info_file, repo_path)
self.num_text = num_text
self.repo_path = repo_path
# for the post_process_functions
self.batch_size = 2
self.num_queries = 10
self.num_classes = 10
self.height = 3
self.width = 4
self.num_labels = num_labels
self.reduce_labels = reduce_labels
self.ignore_index = ignore_index
def prepare_feat_extract_dict(self):
return {
"do_resize": self.do_resize,
"size": self.size,
"do_normalize": self.do_normalize,
"image_mean": self.image_mean,
"image_std": self.image_std,
"num_labels": self.num_labels,
"reduce_labels": self.reduce_labels,
"ignore_index": self.ignore_index,
"class_info_file": self.class_info_file,
"metadata": self.metadata,
"num_text": self.num_text,
}
def get_expected_values(self, image_inputs, batched=False):
"""
This function computes the expected height and width when providing images to OneFormerImageProcessor,
assuming do_resize is set to True with a scalar size.
"""
if not batched:
image = image_inputs[0]
if isinstance(image, Image.Image):
w, h = image.size
else:
h, w = image.shape[1], image.shape[2]
if w < h:
expected_height = int(self.size["shortest_edge"] * h / w)
expected_width = self.size["shortest_edge"]
elif w > h:
expected_height = self.size["shortest_edge"]
expected_width = int(self.size["shortest_edge"] * w / h)
else:
expected_height = self.size["shortest_edge"]
expected_width = self.size["shortest_edge"]
else:
expected_values = []
for image in image_inputs:
expected_height, expected_width = self.get_expected_values([image])
expected_values.append((expected_height, expected_width))
expected_height = max(expected_values, key=lambda item: item[0])[0]
expected_width = max(expected_values, key=lambda item: item[1])[1]
return expected_height, expected_width
def get_fake_oneformer_outputs(self):
return OneFormerForUniversalSegmentationOutput(
# +1 for null class
class_queries_logits=torch.randn((self.batch_size, self.num_queries, self.num_classes + 1)),
masks_queries_logits=torch.randn((self.batch_size, self.num_queries, self.height, self.width)),
)
@require_torch
@require_vision
class OneFormerImageProcessingTest(FeatureExtractionSavingTestMixin, unittest.TestCase):
image_processing_class = OneFormerImageProcessor if (is_vision_available() and is_torch_available()) else None
# only for test_feat_extracttion_common.test_feat_extract_to_json_string
feature_extraction_class = image_processing_class
def setUp(self):
self.image_processing_tester = OneFormerImageProcessorTester(self)
@property
def feat_extract_dict(self):
return self.image_processing_tester.prepare_feat_extract_dict()
def test_feat_extract_properties(self):
image_processor = self.image_processing_class(**self.feat_extract_dict)
self.assertTrue(hasattr(image_processor, "image_mean"))
self.assertTrue(hasattr(image_processor, "image_std"))
self.assertTrue(hasattr(image_processor, "do_normalize"))
self.assertTrue(hasattr(image_processor, "do_resize"))
self.assertTrue(hasattr(image_processor, "size"))
self.assertTrue(hasattr(image_processor, "ignore_index"))
self.assertTrue(hasattr(image_processor, "class_info_file"))
self.assertTrue(hasattr(image_processor, "num_text"))
self.assertTrue(hasattr(image_processor, "repo_path"))
self.assertTrue(hasattr(image_processor, "metadata"))
self.assertTrue(hasattr(image_processor, "reduce_labels"))
def test_batch_feature(self):
pass
def test_call_pil(self):
# Initialize image_processor
image_processor = self.image_processing_class(**self.feat_extract_dict)
# create random PIL images
image_inputs = prepare_image_inputs(self.image_processing_tester, equal_resolution=False)
for image in image_inputs:
self.assertIsInstance(image, Image.Image)
# Test not batched input
encoded_images = image_processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs)
self.assertEqual(
encoded_images.shape,
(1, self.image_processing_tester.num_channels, expected_height, expected_width),
)
# Test batched
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs, batched=True)
encoded_images = image_processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).pixel_values
self.assertEqual(
encoded_images.shape,
(
self.image_processing_tester.batch_size,
self.image_processing_tester.num_channels,
expected_height,
expected_width,
),
)
def test_call_numpy(self):
# Initialize image_processor
image_processor = self.image_processing_class(**self.feat_extract_dict)
# create random numpy tensors
image_inputs = prepare_image_inputs(self.image_processing_tester, equal_resolution=False, numpify=True)
for image in image_inputs:
self.assertIsInstance(image, np.ndarray)
# Test not batched input
encoded_images = image_processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs)
self.assertEqual(
encoded_images.shape,
(1, self.image_processing_tester.num_channels, expected_height, expected_width),
)
# Test batched
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs, batched=True)
encoded_images = image_processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).pixel_values
self.assertEqual(
encoded_images.shape,
(
self.image_processing_tester.batch_size,
self.image_processing_tester.num_channels,
expected_height,
expected_width,
),
)
def test_call_pytorch(self):
# Initialize image_processor
image_processor = self.image_processing_class(**self.feat_extract_dict)
# create random PyTorch tensors
image_inputs = prepare_image_inputs(self.image_processing_tester, equal_resolution=False, torchify=True)
for image in image_inputs:
self.assertIsInstance(image, torch.Tensor)
# Test not batched input
encoded_images = image_processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs)
self.assertEqual(
encoded_images.shape,
(1, self.image_processing_tester.num_channels, expected_height, expected_width),
)
# Test batched
expected_height, expected_width = self.image_processing_tester.get_expected_values(image_inputs, batched=True)
encoded_images = image_processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).pixel_values
self.assertEqual(
encoded_images.shape,
(
self.image_processing_tester.batch_size,
self.image_processing_tester.num_channels,
expected_height,
expected_width,
),
)
def test_equivalence_pad_and_create_pixel_mask(self):
# Initialize image_processors
image_processor_1 = self.image_processing_class(**self.feat_extract_dict)
image_processor_2 = self.image_processing_class(
do_resize=False,
do_normalize=False,
do_rescale=False,
num_labels=self.image_processing_tester.num_classes,
class_info_file="ade20k_panoptic.json",
num_text=self.image_processing_tester.num_text,
repo_path="shi-labs/oneformer_demo",
)
# create random PyTorch tensors
image_inputs = prepare_image_inputs(self.image_processing_tester, equal_resolution=False, torchify=True)
for image in image_inputs:
self.assertIsInstance(image, torch.Tensor)
# Test whether the method "pad_and_return_pixel_mask" and calling the image processor return the same tensors
encoded_images_with_method = image_processor_1.encode_inputs(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
)
encoded_images = image_processor_2(image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt")
self.assertTrue(
torch.allclose(encoded_images_with_method["pixel_values"], encoded_images["pixel_values"], atol=1e-4)
)
self.assertTrue(
torch.allclose(encoded_images_with_method["pixel_mask"], encoded_images["pixel_mask"], atol=1e-4)
)
def comm_get_image_processor_inputs(
self, with_segmentation_maps=False, is_instance_map=False, segmentation_type="np"
):
image_processor = self.image_processing_class(**self.feat_extract_dict)
# prepare image and target
num_labels = self.image_processing_tester.num_labels
annotations = None
instance_id_to_semantic_id = None
image_inputs = prepare_image_inputs(self.image_processing_tester, equal_resolution=False)
if with_segmentation_maps:
high = num_labels
if is_instance_map:
labels_expanded = list(range(num_labels)) * 2
instance_id_to_semantic_id = {
instance_id: label_id for instance_id, label_id in enumerate(labels_expanded)
}
annotations = [
np.random.randint(0, high * 2, (img.size[1], img.size[0])).astype(np.uint8) for img in image_inputs
]
if segmentation_type == "pil":
annotations = [Image.fromarray(annotation) for annotation in annotations]
inputs = image_processor(
image_inputs,
["semantic"] * len(image_inputs),
annotations,
return_tensors="pt",
instance_id_to_semantic_id=instance_id_to_semantic_id,
pad_and_return_pixel_mask=True,
)
return inputs
def test_init_without_params(self):
pass
def test_call_with_segmentation_maps(self):
def common(is_instance_map=False, segmentation_type=None):
inputs = self.comm_get_image_processor_inputs(
with_segmentation_maps=True, is_instance_map=is_instance_map, segmentation_type=segmentation_type
)
mask_labels = inputs["mask_labels"]
class_labels = inputs["class_labels"]
pixel_values = inputs["pixel_values"]
text_inputs = inputs["text_inputs"]
# check the batch_size
for mask_label, class_label, text_input in zip(mask_labels, class_labels, text_inputs):
self.assertEqual(mask_label.shape[0], class_label.shape[0])
# this ensure padding has happened
self.assertEqual(mask_label.shape[1:], pixel_values.shape[2:])
self.assertEqual(len(text_input), self.image_processing_tester.num_text)
common()
common(is_instance_map=True)
common(is_instance_map=False, segmentation_type="pil")
common(is_instance_map=True, segmentation_type="pil")
def test_binary_mask_to_rle(self):
fake_binary_mask = np.zeros((20, 50))
fake_binary_mask[0, 20:] = 1
fake_binary_mask[1, :15] = 1
fake_binary_mask[5, :10] = 1
rle = binary_mask_to_rle(fake_binary_mask)
self.assertEqual(len(rle), 4)
self.assertEqual(rle[0], 21)
self.assertEqual(rle[1], 45)
def test_post_process_semantic_segmentation(self):
fature_extractor = self.image_processing_class(
num_labels=self.image_processing_tester.num_classes,
max_seq_length=77,
task_seq_length=77,
class_info_file="ade20k_panoptic.json",
num_text=self.image_processing_tester.num_text,
repo_path="shi-labs/oneformer_demo",
)
outputs = self.image_processing_tester.get_fake_oneformer_outputs()
segmentation = fature_extractor.post_process_semantic_segmentation(outputs)
self.assertEqual(len(segmentation), self.image_processing_tester.batch_size)
self.assertEqual(
segmentation[0].shape,
(
self.image_processing_tester.height,
self.image_processing_tester.width,
),
)
target_sizes = [(1, 4) for i in range(self.image_processing_tester.batch_size)]
segmentation = fature_extractor.post_process_semantic_segmentation(outputs, target_sizes=target_sizes)
self.assertEqual(segmentation[0].shape, target_sizes[0])
def test_post_process_instance_segmentation(self):
image_processor = self.image_processing_class(
num_labels=self.image_processing_tester.num_classes,
max_seq_length=77,
task_seq_length=77,
class_info_file="ade20k_panoptic.json",
num_text=self.image_processing_tester.num_text,
repo_path="shi-labs/oneformer_demo",
)
outputs = self.image_processing_tester.get_fake_oneformer_outputs()
segmentation = image_processor.post_process_instance_segmentation(outputs, threshold=0)
self.assertTrue(len(segmentation) == self.image_processing_tester.batch_size)
for el in segmentation:
self.assertTrue("segmentation" in el)
self.assertTrue("segments_info" in el)
self.assertEqual(type(el["segments_info"]), list)
self.assertEqual(
el["segmentation"].shape, (self.image_processing_tester.height, self.image_processing_tester.width)
)
def test_post_process_panoptic_segmentation(self):
image_processor = self.image_processing_class(
num_labels=self.image_processing_tester.num_classes,
max_seq_length=77,
task_seq_length=77,
class_info_file="ade20k_panoptic.json",
num_text=self.image_processing_tester.num_text,
repo_path="shi-labs/oneformer_demo",
)
outputs = self.image_processing_tester.get_fake_oneformer_outputs()
segmentation = image_processor.post_process_panoptic_segmentation(outputs, threshold=0)
self.assertTrue(len(segmentation) == self.image_processing_tester.batch_size)
for el in segmentation:
self.assertTrue("segmentation" in el)
self.assertTrue("segments_info" in el)
self.assertEqual(type(el["segments_info"]), list)
self.assertEqual(
el["segmentation"].shape, (self.image_processing_tester.height, self.image_processing_tester.width)
)

View File

@@ -0,0 +1,536 @@
# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch OneFormer model. """
import copy
import inspect
import unittest
import numpy as np
from tests.test_modeling_common import floats_tensor
from transformers import OneFormerConfig, is_torch_available, is_vision_available
from transformers.testing_utils import require_torch, require_torch_multi_gpu, require_vision, slow, torch_device
from transformers.utils import cached_property
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin
if is_torch_available():
import torch
from transformers import OneFormerForUniversalSegmentation, OneFormerModel
if is_vision_available():
from transformers import OneFormerProcessor
if is_vision_available():
from PIL import Image
def _config_zero_init(config):
configs_no_init = copy.deepcopy(config)
for key in configs_no_init.__dict__.keys():
if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
setattr(configs_no_init, key, 1e-10)
return configs_no_init
class OneFormerModelTester:
def __init__(
self,
parent,
batch_size=2,
is_training=True,
use_auxiliary_loss=False,
num_queries=10,
num_channels=3,
min_size=32 * 8,
max_size=32 * 8,
num_labels=4,
hidden_dim=64,
sequence_length=77,
n_ctx=4,
):
self.parent = parent
self.batch_size = batch_size
self.is_training = is_training
self.use_auxiliary_loss = use_auxiliary_loss
self.num_queries = num_queries
self.num_channels = num_channels
self.min_size = min_size
self.max_size = max_size
self.num_labels = num_labels
self.hidden_dim = hidden_dim
self.sequence_length = sequence_length
self.n_ctx = n_ctx
def prepare_config_and_inputs(self):
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.min_size, self.max_size]).to(
torch_device
)
task_inputs = torch.randint(high=49408, size=(self.batch_size, self.sequence_length)).to(torch_device).long()
pixel_mask = torch.ones([self.batch_size, self.min_size, self.max_size], device=torch_device)
text_inputs = (
torch.randint(high=49408, size=(self.batch_size, self.num_queries - self.n_ctx, self.sequence_length))
.to(torch_device)
.long()
)
mask_labels = (
torch.rand([self.batch_size, self.num_labels, self.min_size, self.max_size], device=torch_device) > 0.5
).float()
class_labels = (torch.rand((self.batch_size, self.num_labels), device=torch_device) > 0.5).long()
config = self.get_config()
return config, pixel_values, task_inputs, text_inputs, pixel_mask, mask_labels, class_labels
def get_config(self):
config = OneFormerConfig(
hidden_size=self.hidden_dim,
)
config.num_queries = self.num_queries
config.num_labels = self.num_labels
config.backbone_config.depths = [1, 1, 1, 1]
config.backbone_config.num_channels = self.num_channels
config.encoder_feedforward_dim = 64
config.dim_feedforward = 128
config.hidden_dim = self.hidden_dim
config.mask_dim = self.hidden_dim
config.conv_dim = self.hidden_dim
config.text_encoder_width = self.hidden_dim
config.task_seq_len = self.sequence_length
config.max_seq_len = self.sequence_length
config.text_encoder_context_length = self.sequence_length
config.text_encoder_n_ctx = self.n_ctx
return config
def prepare_config_and_inputs_for_common(self):
config, pixel_values, task_inputs, pixel_mask, _, _, _ = self.prepare_config_and_inputs()
inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask, "task_inputs": task_inputs}
return config, inputs_dict
def check_output_hidden_state(self, output, config):
encoder_hidden_states = output.encoder_hidden_states
pixel_decoder_hidden_states = output.pixel_decoder_hidden_states
transformer_decoder_hidden_states = output.transformer_decoder_hidden_states
self.parent.assertTrue(len(encoder_hidden_states), len(config.backbone_config.depths))
self.parent.assertTrue(len(pixel_decoder_hidden_states), config.encoder_layers)
self.parent.assertTrue(len(transformer_decoder_hidden_states), config.decoder_layers - 1)
def create_and_check_oneformer_model(
self, config, pixel_values, task_inputs, pixel_mask, output_hidden_states=False
):
with torch.no_grad():
model = OneFormerModel(config=config)
model.to(torch_device)
model.eval()
output = model(pixel_values=pixel_values, task_inputs=task_inputs, pixel_mask=pixel_mask)
output = model(pixel_values, task_inputs=task_inputs, output_hidden_states=True)
# the correct shape of output.transformer_decoder_hidden_states ensure the correcteness of the
# encoder and pixel decoder
self.parent.assertEqual(
output.transformer_decoder_object_queries.shape,
(self.batch_size, self.num_queries, self.hidden_dim),
)
# let's ensure the other two hidden state exists
self.parent.assertTrue(output.pixel_decoder_hidden_states is not None)
self.parent.assertTrue(output.encoder_hidden_states is not None)
if output_hidden_states:
self.check_output_hidden_state(output, config)
def create_and_check_oneformer_universal_segmentation_head_model(
self, config, pixel_values, task_inputs, text_inputs, pixel_mask, mask_labels, class_labels
):
model = OneFormerForUniversalSegmentation(config=config)
model.to(torch_device)
model.eval()
def comm_check_on_output(result):
# let's still check that all the required stuff is there
self.parent.assertTrue(result.transformer_decoder_hidden_states is not None)
self.parent.assertTrue(result.pixel_decoder_hidden_states is not None)
self.parent.assertTrue(result.encoder_hidden_states is not None)
# okay, now we need to check the logits shape
# due to the encoder compression, masks have a //4 spatial size
self.parent.assertEqual(
result.masks_queries_logits.shape,
(self.batch_size, self.num_queries, self.min_size // 4, self.max_size // 4),
)
# + 1 for null class
self.parent.assertEqual(
result.class_queries_logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1)
)
with torch.no_grad():
result = model(pixel_values=pixel_values, task_inputs=task_inputs, pixel_mask=pixel_mask)
result = model(pixel_values, task_inputs)
comm_check_on_output(result)
config.is_training = True
model = OneFormerForUniversalSegmentation(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(
pixel_values=pixel_values,
task_inputs=task_inputs,
pixel_mask=pixel_mask,
mask_labels=mask_labels,
class_labels=class_labels,
text_inputs=text_inputs,
)
comm_check_on_output(result)
self.parent.assertTrue(result.loss is not None)
self.parent.assertEqual(result.loss.shape, torch.Size([1]))
@require_torch
class OneFormerModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (OneFormerModel, OneFormerForUniversalSegmentation) if is_torch_available() else ()
is_encoder_decoder = False
test_pruning = False
test_head_masking = False
test_missing_keys = False
def setUp(self):
self.model_tester = OneFormerModelTester(self)
self.config_tester = ConfigTester(self, config_class=OneFormerConfig, has_text_modality=False)
def test_config(self):
self.config_tester.run_common_tests()
def test_oneformer_model(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.create_and_check_oneformer_model(config, **inputs, output_hidden_states=False)
def test_oneformer_universal_segmentation_head_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_oneformer_universal_segmentation_head_model(*config_and_inputs)
def test_model_main_input_name(self):
for model_class in self.all_model_classes:
model_signature = inspect.signature(getattr(model_class, "forward"))
# The main input is the name of the argument after `self`
observed_main_input_name = list(model_signature.parameters.keys())[1:3]
self.assertEqual(model_class.main_input_name, observed_main_input_name)
@unittest.skip(reason="OneFormer uses two main inputs")
def test_torchscript_simple(self):
pass
@unittest.skip(reason="OneFormer uses two main inputs")
def test_torchscript_output_attentions(self):
pass
@unittest.skip(reason="OneFormer uses two main inputs")
def test_torchscript_output_hidden_state(self):
pass
@unittest.skip(reason="OneFormer does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="OneFormer does not have a get_input_embeddings method")
def test_model_common_attributes(self):
pass
@unittest.skip(reason="OneFormer is not a generative model")
def test_generate_without_input_ids(self):
pass
@unittest.skip(reason="OneFormer does not use token embeddings")
def test_resize_tokens_embeddings(self):
pass
@require_torch_multi_gpu
@unittest.skip(
reason="OneFormer has some layers using `add_module` which doesn't work well with `nn.DataParallel`"
)
def test_multi_gpu_data_parallel_forward(self):
pass
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values", "task_inputs"]
self.assertListEqual(arg_names[:2], expected_arg_names)
@slow
def test_model_from_pretrained(self):
for model_name in ["shi-labs/oneformer_ade20k_swin_tiny"]:
model = OneFormerModel.from_pretrained(model_name)
self.assertIsNotNone(model)
def test_model_with_labels(self):
size = (self.model_tester.min_size,) * 2
inputs = {
"pixel_values": torch.randn((2, 3, *size), device=torch_device),
"task_inputs": torch.randint(high=49408, size=(2, 77), device=torch_device).long(),
"text_inputs": torch.randint(high=49408, size=(2, 134, 77), device=torch_device).long(),
"mask_labels": torch.randn((2, 150, *size), device=torch_device),
"class_labels": torch.zeros(2, 150, device=torch_device).long(),
}
config = OneFormerConfig()
config.is_training = True
model = OneFormerForUniversalSegmentation(config).to(torch_device)
outputs = model(**inputs)
self.assertTrue(outputs.loss is not None)
def test_hidden_states_output(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.create_and_check_oneformer_model(config, **inputs, output_hidden_states=True)
def test_attention_outputs(self):
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config).to(torch_device)
outputs = model(**inputs, output_attentions=True)
self.assertTrue(outputs.attentions is not None)
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.contrastive_temperature = 1
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
def test_training(self):
if not self.model_tester.is_training:
return
# only OneFormerForUniversalSegmentation has the loss
model_class = self.all_model_classes[1]
(
config,
pixel_values,
task_inputs,
text_inputs,
pixel_mask,
mask_labels,
class_labels,
) = self.model_tester.prepare_config_and_inputs()
config.is_training = True
model = model_class(config)
model.to(torch_device)
model.train()
loss = model(
pixel_values, task_inputs, text_inputs=text_inputs, mask_labels=mask_labels, class_labels=class_labels
).loss
loss.backward()
def test_retain_grad_hidden_states_attentions(self):
# only OneFormerForUniversalSegmentation has the loss
model_class = self.all_model_classes[1]
(
config,
pixel_values,
task_inputs,
text_inputs,
pixel_mask,
mask_labels,
class_labels,
) = self.model_tester.prepare_config_and_inputs()
config.output_hidden_states = True
config.output_attentions = True
config.is_training = True
model = model_class(config)
model.to(torch_device)
model.train()
outputs = model(
pixel_values, task_inputs, text_inputs=text_inputs, mask_labels=mask_labels, class_labels=class_labels
)
encoder_hidden_states = outputs.encoder_hidden_states[0]
encoder_hidden_states.retain_grad()
pixel_decoder_hidden_states = outputs.pixel_decoder_hidden_states[0]
pixel_decoder_hidden_states.retain_grad()
transformer_decoder_class_predictions = outputs.transformer_decoder_class_predictions
transformer_decoder_class_predictions.retain_grad()
transformer_decoder_mask_predictions = outputs.transformer_decoder_mask_predictions
transformer_decoder_mask_predictions.retain_grad()
attentions = outputs.attentions[0][0]
attentions.retain_grad()
outputs.loss.backward(retain_graph=True)
self.assertIsNotNone(encoder_hidden_states.grad)
self.assertIsNotNone(pixel_decoder_hidden_states.grad)
self.assertIsNotNone(transformer_decoder_class_predictions.grad)
self.assertIsNotNone(transformer_decoder_mask_predictions.grad)
self.assertIsNotNone(attentions.grad)
TOLERANCE = 1e-4
# We will verify our results on an image of cute cats
def prepare_img():
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
return image
@require_vision
@slow
class OneFormerModelIntegrationTest(unittest.TestCase):
@cached_property
def model_checkpoints(self):
return "shi-labs/oneformer_ade20k_swin_tiny"
@cached_property
def default_processor(self):
return OneFormerProcessor.from_pretrained(self.model_checkpoints) if is_vision_available() else None
def test_inference_no_head(self):
model = OneFormerModel.from_pretrained(self.model_checkpoints).to(torch_device)
processor = self.default_processor
image = prepare_img()
inputs = processor(image, ["semantic"], return_tensors="pt").to(torch_device)
inputs_shape = inputs["pixel_values"].shape
# check size
self.assertEqual(inputs_shape, (1, 3, 512, 682))
task_inputs_shape = inputs["task_inputs"].shape
# check size
self.assertEqual(task_inputs_shape, (1, 77))
with torch.no_grad():
outputs = model(**inputs)
expected_slice_hidden_state = torch.tensor(
[[0.2724, 0.8287, 0.6025], [1.2706, 1.1252, 1.1445], [1.1357, 0.6150, 0.4185]]
).to(torch_device)
self.assertTrue(
torch.allclose(
outputs.encoder_hidden_states[-1][0, 0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
)
)
expected_slice_hidden_state = torch.tensor(
[[1.0581, 1.2275, 1.2000], [1.1901, 1.2925, 1.2861], [1.1578, 1.2558, 1.3212]]
).to(torch_device)
self.assertTrue(
torch.allclose(
outputs.pixel_decoder_hidden_states[0][0, 0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
)
)
expected_slice_hidden_state = torch.tensor(
[[3.0711, -1.1855, -5.1095], [3.5536, -3.2710, -5.2052], [2.6020, -4.3605, -4.1422]]
).to(torch_device)
self.assertTrue(
torch.allclose(
outputs.transformer_decoder_class_predictions[0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
)
)
def test_inference_universal_segmentation_head(self):
model = OneFormerForUniversalSegmentation.from_pretrained(self.model_checkpoints).to(torch_device).eval()
processor = self.default_processor
image = prepare_img()
inputs = processor(image, ["semantic"], return_tensors="pt").to(torch_device)
inputs_shape = inputs["pixel_values"].shape
# check size
self.assertEqual(inputs_shape, (1, 3, 512, 682))
with torch.no_grad():
outputs = model(**inputs)
# masks_queries_logits
masks_queries_logits = outputs.masks_queries_logits
self.assertEqual(
masks_queries_logits.shape,
(1, model.config.num_queries, inputs_shape[-2] // 4, (inputs_shape[-1] + 2) // 4),
)
expected_slice = [[[3.1215, 4.1250, 4.1106], [2.8183, 3.4623, 3.5512], [2.4550, 2.9841, 3.5081]]]
expected_slice = torch.tensor(expected_slice).to(torch_device)
self.assertTrue(torch.allclose(masks_queries_logits[0, 0, :3, :3], expected_slice, atol=TOLERANCE))
# class_queries_logits
class_queries_logits = outputs.class_queries_logits
self.assertEqual(
class_queries_logits.shape,
(1, model.config.num_queries, model.config.num_labels + 1),
)
expected_slice = torch.tensor(
[[3.0711, -1.1855, -5.1095], [3.5536, -3.2710, -5.2052], [2.6020, -4.3605, -4.1422]]
).to(torch_device)
self.assertTrue(torch.allclose(class_queries_logits[0, :3, :3], expected_slice, atol=TOLERANCE))
def test_with_segmentation_maps_and_loss(self):
dummy_model = OneFormerForUniversalSegmentation.from_pretrained(self.model_checkpoints)
processor = self.default_processor
processor.image_processor.num_text = dummy_model.config.num_queries - dummy_model.config.text_encoder_n_ctx
dummy_model.config.is_training = True
model = OneFormerForUniversalSegmentation(dummy_model.config).to(torch_device).eval()
del dummy_model
inputs = processor(
[np.zeros((3, 512, 640)), np.zeros((3, 512, 640))],
["semantic", "semantic"],
segmentation_maps=[np.zeros((384, 384)).astype(np.float32), np.zeros((384, 384)).astype(np.float32)],
return_tensors="pt",
)
inputs["pixel_values"] = inputs["pixel_values"].to(torch_device)
inputs["task_inputs"] = inputs["task_inputs"].to(torch_device)
inputs["text_inputs"] = inputs["text_inputs"].to(torch_device)
inputs["mask_labels"] = [el.to(torch_device) for el in inputs["mask_labels"]]
inputs["class_labels"] = [el.to(torch_device) for el in inputs["class_labels"]]
with torch.no_grad():
outputs = model(**inputs)
self.assertTrue(outputs.loss is not None)

View File

@@ -0,0 +1,833 @@
# coding=utf-8
# Copyright 2022 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import tempfile
import unittest
import numpy as np
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from transformers.testing_utils import check_json_file_has_correct_format, require_torch, require_vision
from transformers.utils import is_torch_available, is_vision_available
from ...test_feature_extraction_common import prepare_image_inputs
if is_torch_available():
import torch
if is_vision_available():
from transformers import CLIPTokenizer, OneFormerImageProcessor, OneFormerProcessor
from transformers.models.oneformer.image_processing_oneformer import binary_mask_to_rle
from transformers.models.oneformer.modeling_oneformer import OneFormerForUniversalSegmentationOutput
if is_vision_available():
from PIL import Image
def prepare_metadata(class_info_file, repo_path="shi-labs/oneformer_demo"):
with open(hf_hub_download(repo_path, class_info_file, repo_type="dataset"), "r") as f:
class_info = json.load(f)
metadata = {}
class_names = []
thing_ids = []
for key, info in class_info.items():
metadata[key] = info["name"]
class_names.append(info["name"])
if info["isthing"]:
thing_ids.append(int(key))
metadata["thing_ids"] = thing_ids
metadata["class_names"] = class_names
return metadata
class OneFormerProcessorTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
num_channels=3,
min_resolution=30,
max_resolution=400,
size=None,
do_resize=True,
do_normalize=True,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
num_labels=10,
reduce_labels=False,
ignore_index=255,
max_seq_length=77,
task_seq_length=77,
model_repo="shi-labs/oneformer_ade20k_swin_tiny",
class_info_file="ade20k_panoptic.json",
num_text=10,
):
self.parent = parent
self.batch_size = batch_size
self.num_channels = num_channels
self.min_resolution = min_resolution
self.max_resolution = max_resolution
self.do_resize = do_resize
self.size = {"shortest_edge": 32, "longest_edge": 1333} if size is None else size
self.do_normalize = do_normalize
self.image_mean = image_mean
self.image_std = image_std
self.max_seq_length = max_seq_length
self.task_seq_length = task_seq_length
self.class_info_file = class_info_file
self.metadata = prepare_metadata(class_info_file)
self.num_text = num_text
self.model_repo = model_repo
# for the post_process_functions
self.batch_size = 2
self.num_queries = 10
self.num_classes = 10
self.height = 3
self.width = 4
self.num_labels = num_labels
self.reduce_labels = reduce_labels
self.ignore_index = ignore_index
def prepare_processor_dict(self):
image_processor_dict = {
"do_resize": self.do_resize,
"size": self.size,
"do_normalize": self.do_normalize,
"image_mean": self.image_mean,
"image_std": self.image_std,
"num_labels": self.num_labels,
"reduce_labels": self.reduce_labels,
"ignore_index": self.ignore_index,
"class_info_file": self.class_info_file,
"metadata": self.metadata,
"num_text": self.num_text,
}
image_processor = OneFormerImageProcessor(**image_processor_dict)
tokenizer = CLIPTokenizer.from_pretrained(self.model_repo)
return {
"image_processor": image_processor,
"tokenizer": tokenizer,
"max_seq_length": self.max_seq_length,
"task_seq_length": self.task_seq_length,
}
def get_expected_values(self, image_inputs, batched=False):
"""
This function computes the expected height and width when providing images to OneFormerProcessor,
assuming do_resize is set to True with a scalar size. It also provides the expected sequence length
for the task_inputs and text_list_input.
"""
if not batched:
image = image_inputs[0]
if isinstance(image, Image.Image):
w, h = image.size
else:
h, w = image.shape[1], image.shape[2]
if w < h:
expected_height = int(self.size["shortest_edge"] * h / w)
expected_width = self.size["shortest_edge"]
elif w > h:
expected_height = self.size["shortest_edge"]
expected_width = int(self.size["shortest_edge"] * w / h)
else:
expected_height = self.size["shortest_edge"]
expected_width = self.size["shortest_edge"]
else:
expected_values = []
for image in image_inputs:
expected_height, expected_width, expected_sequence_length = self.get_expected_values([image])
expected_values.append((expected_height, expected_width, expected_sequence_length))
expected_height = max(expected_values, key=lambda item: item[0])[0]
expected_width = max(expected_values, key=lambda item: item[1])[1]
expected_sequence_length = self.max_seq_length
return expected_height, expected_width, expected_sequence_length
def get_fake_oneformer_outputs(self):
return OneFormerForUniversalSegmentationOutput(
# +1 for null class
class_queries_logits=torch.randn((self.batch_size, self.num_queries, self.num_classes + 1)),
masks_queries_logits=torch.randn((self.batch_size, self.num_queries, self.height, self.width)),
)
@require_torch
@require_vision
class OneFormerProcessingTest(unittest.TestCase):
processing_class = OneFormerProcessor if (is_vision_available() and is_torch_available()) else None
# only for test_feat_extracttion_common.test_feat_extract_to_json_string
feature_extraction_class = processing_class
def setUp(self):
self.processing_tester = OneFormerProcessorTester(self)
@property
def processor_dict(self):
return self.processing_tester.prepare_processor_dict()
def test_feat_extract_properties(self):
processor = self.processing_class(**self.processor_dict)
self.assertTrue(hasattr(processor, "image_processor"))
self.assertTrue(hasattr(processor, "tokenizer"))
self.assertTrue(hasattr(processor, "max_seq_length"))
self.assertTrue(hasattr(processor, "task_seq_length"))
def test_batch_feature(self):
pass
def test_call_pil(self):
# Initialize processor
processor = self.processing_class(**self.processor_dict)
# create random PIL images
image_inputs = prepare_image_inputs(self.processing_tester, equal_resolution=False)
for image in image_inputs:
self.assertIsInstance(image, Image.Image)
# Test not batched input
encoded_images = processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs
)
self.assertEqual(
encoded_images.shape,
(1, self.processing_tester.num_channels, expected_height, expected_width),
)
tokenized_task_inputs = processor(image_inputs[0], ["semantic"], return_tensors="pt").task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(1, expected_sequence_length),
)
# Test batched
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs, batched=True
)
encoded_images = processor(image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt").pixel_values
self.assertEqual(
encoded_images.shape,
(
self.processing_tester.batch_size,
self.processing_tester.num_channels,
expected_height,
expected_width,
),
)
tokenized_task_inputs = processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(self.processing_tester.batch_size, expected_sequence_length),
)
def test_call_numpy(self):
# Initialize processor
processor = self.processing_class(**self.processor_dict)
# create random numpy tensors
image_inputs = prepare_image_inputs(self.processing_tester, equal_resolution=False, numpify=True)
for image in image_inputs:
self.assertIsInstance(image, np.ndarray)
# Test not batched input
encoded_images = processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs
)
self.assertEqual(
encoded_images.shape,
(1, self.processing_tester.num_channels, expected_height, expected_width),
)
tokenized_task_inputs = processor(image_inputs[0], ["semantic"], return_tensors="pt").task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(1, expected_sequence_length),
)
# Test batched
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs, batched=True
)
encoded_images = processor(image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt").pixel_values
self.assertEqual(
encoded_images.shape,
(
self.processing_tester.batch_size,
self.processing_tester.num_channels,
expected_height,
expected_width,
),
)
tokenized_task_inputs = processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(self.processing_tester.batch_size, expected_sequence_length),
)
def test_call_pytorch(self):
# Initialize processor
processor = self.processing_class(**self.processor_dict)
# create random PyTorch tensors
image_inputs = prepare_image_inputs(self.processing_tester, equal_resolution=False, torchify=True)
for image in image_inputs:
self.assertIsInstance(image, torch.Tensor)
# Test not batched input
encoded_images = processor(image_inputs[0], ["semantic"], return_tensors="pt").pixel_values
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs
)
self.assertEqual(
encoded_images.shape,
(1, self.processing_tester.num_channels, expected_height, expected_width),
)
tokenized_task_inputs = processor(image_inputs[0], ["semantic"], return_tensors="pt").task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(1, expected_sequence_length),
)
# Test batched
expected_height, expected_width, expected_sequence_length = self.processing_tester.get_expected_values(
image_inputs, batched=True
)
encoded_images = processor(image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt").pixel_values
self.assertEqual(
encoded_images.shape,
(
self.processing_tester.batch_size,
self.processing_tester.num_channels,
expected_height,
expected_width,
),
)
tokenized_task_inputs = processor(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
).task_inputs
self.assertEqual(
tokenized_task_inputs.shape,
(self.processing_tester.batch_size, expected_sequence_length),
)
def test_equivalence_pad_and_create_pixel_mask(self):
# Initialize processors
processor_1 = self.processing_class(**self.processor_dict)
image_processor = OneFormerImageProcessor(
do_resize=False,
do_normalize=False,
do_rescale=False,
num_labels=self.processing_tester.num_classes,
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor_2 = self.processing_class(
image_processor=image_processor, tokenizer=tokenizer, max_seq_length=77, task_seq_length=77
)
# create random PyTorch tensors
image_inputs = prepare_image_inputs(self.processing_tester, equal_resolution=False, torchify=True)
for image in image_inputs:
self.assertIsInstance(image, torch.Tensor)
# Test whether the method "pad_and_return_pixel_mask" and calling the image processor return the same tensors
encoded_images_with_method = processor_1.encode_inputs(
image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt"
)
encoded_images = processor_2(image_inputs, ["semantic"] * len(image_inputs), return_tensors="pt")
self.assertTrue(
torch.allclose(encoded_images_with_method["pixel_values"], encoded_images["pixel_values"], atol=1e-4)
)
self.assertTrue(
torch.allclose(encoded_images_with_method["pixel_mask"], encoded_images["pixel_mask"], atol=1e-4)
)
def comm_get_processor_inputs(self, with_segmentation_maps=False, is_instance_map=False, segmentation_type="np"):
processor = self.processing_class(**self.processor_dict)
# prepare image and target
num_labels = self.processing_tester.num_labels
annotations = None
instance_id_to_semantic_id = None
image_inputs = prepare_image_inputs(self.processing_tester, equal_resolution=False)
if with_segmentation_maps:
high = num_labels
if is_instance_map:
labels_expanded = list(range(num_labels)) * 2
instance_id_to_semantic_id = {
instance_id: label_id for instance_id, label_id in enumerate(labels_expanded)
}
annotations = [
np.random.randint(0, high * 2, (img.size[1], img.size[0])).astype(np.uint8) for img in image_inputs
]
if segmentation_type == "pil":
annotations = [Image.fromarray(annotation) for annotation in annotations]
inputs = processor(
image_inputs,
["semantic"] * len(image_inputs),
annotations,
return_tensors="pt",
instance_id_to_semantic_id=instance_id_to_semantic_id,
pad_and_return_pixel_mask=True,
)
return inputs
def test_init_without_params(self):
pass
def test_feat_extract_from_and_save_pretrained(self):
feat_extract_first = self.feature_extraction_class(**self.processor_dict)
with tempfile.TemporaryDirectory() as tmpdirname:
feat_extract_first.save_pretrained(tmpdirname)
check_json_file_has_correct_format(os.path.join(tmpdirname, "preprocessor_config.json"))
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
self.assertEqual(feat_extract_second.image_processor.to_dict(), feat_extract_first.image_processor.to_dict())
self.assertIsInstance(feat_extract_first.image_processor, OneFormerImageProcessor)
self.assertIsInstance(feat_extract_first.tokenizer, CLIPTokenizer)
def test_call_with_segmentation_maps(self):
def common(is_instance_map=False, segmentation_type=None):
inputs = self.comm_get_processor_inputs(
with_segmentation_maps=True, is_instance_map=is_instance_map, segmentation_type=segmentation_type
)
mask_labels = inputs["mask_labels"]
class_labels = inputs["class_labels"]
pixel_values = inputs["pixel_values"]
text_inputs = inputs["text_inputs"]
# check the batch_size
for mask_label, class_label, text_input in zip(mask_labels, class_labels, text_inputs):
self.assertEqual(mask_label.shape[0], class_label.shape[0])
# this ensure padding has happened
self.assertEqual(mask_label.shape[1:], pixel_values.shape[2:])
self.assertEqual(text_input.shape[0], self.processing_tester.num_text)
common()
common(is_instance_map=True)
common(is_instance_map=False, segmentation_type="pil")
common(is_instance_map=True, segmentation_type="pil")
def test_integration_semantic_segmentation(self):
# load 2 images and corresponding panoptic annotations from the hub
dataset = load_dataset("nielsr/ade20k-panoptic-demo")
image1 = dataset["train"][0]["image"]
image2 = dataset["train"][1]["image"]
segments_info1 = dataset["train"][0]["segments_info"]
segments_info2 = dataset["train"][1]["segments_info"]
annotation1 = dataset["train"][0]["label"]
annotation2 = dataset["train"][1]["label"]
def rgb_to_id(color):
if isinstance(color, np.ndarray) and len(color.shape) == 3:
if color.dtype == np.uint8:
color = color.astype(np.int32)
return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
def create_panoptic_map(annotation, segments_info):
annotation = np.array(annotation)
# convert RGB to segment IDs per pixel
# 0 is the "ignore" label, for which we don't need to make binary masks
panoptic_map = rgb_to_id(annotation)
# create mapping between segment IDs and semantic classes
inst2class = {segment["id"]: segment["category_id"] for segment in segments_info}
return panoptic_map, inst2class
panoptic_map1, inst2class1 = create_panoptic_map(annotation1, segments_info1)
panoptic_map2, inst2class2 = create_panoptic_map(annotation2, segments_info2)
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
# prepare the images and annotations
pixel_values_list = [np.moveaxis(np.array(image1), -1, 0), np.moveaxis(np.array(image2), -1, 0)]
inputs = processor.encode_inputs(
pixel_values_list,
["semantic", "semantic"],
[panoptic_map1, panoptic_map2],
instance_id_to_semantic_id=[inst2class1, inst2class2],
return_tensors="pt",
)
# verify the pixel values, task inputs, text inputs and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 711))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 711))
self.assertEqual(inputs["task_inputs"].shape, (2, 77))
self.assertEqual(inputs["text_inputs"].shape, (2, self.processing_tester.num_text, 77))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
# fmt: off
expected_class_labels = torch.tensor([4, 17, 32, 42, 12, 3, 5, 0, 43, 96, 104, 31, 125, 138, 87, 149]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][0], expected_class_labels))
# fmt: off
expected_class_labels = torch.tensor([19, 67, 82, 17, 12, 42, 3, 14, 5, 0, 115, 43, 8, 138, 125, 143]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][1], expected_class_labels))
# verify the task inputs
self.assertEqual(len(inputs["task_inputs"]), 2)
self.assertEqual(inputs["task_inputs"][0].sum().item(), 141082)
self.assertEqual(inputs["task_inputs"][0].sum().item(), inputs["task_inputs"][1].sum().item())
# verify the text inputs
self.assertEqual(len(inputs["text_inputs"]), 2)
self.assertEqual(inputs["text_inputs"][0].sum().item(), 1095752)
self.assertEqual(inputs["text_inputs"][1].sum().item(), 1062468)
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (16, 512, 711))
self.assertEqual(inputs["mask_labels"][1].shape, (16, 512, 711))
self.assertEqual(inputs["mask_labels"][0].sum().item(), 315193.0)
self.assertEqual(inputs["mask_labels"][1].sum().item(), 350747.0)
def test_integration_instance_segmentation(self):
# load 2 images and corresponding panoptic annotations from the hub
dataset = load_dataset("nielsr/ade20k-panoptic-demo")
image1 = dataset["train"][0]["image"]
image2 = dataset["train"][1]["image"]
segments_info1 = dataset["train"][0]["segments_info"]
segments_info2 = dataset["train"][1]["segments_info"]
annotation1 = dataset["train"][0]["label"]
annotation2 = dataset["train"][1]["label"]
def rgb_to_id(color):
if isinstance(color, np.ndarray) and len(color.shape) == 3:
if color.dtype == np.uint8:
color = color.astype(np.int32)
return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
def create_panoptic_map(annotation, segments_info):
annotation = np.array(annotation)
# convert RGB to segment IDs per pixel
# 0 is the "ignore" label, for which we don't need to make binary masks
panoptic_map = rgb_to_id(annotation)
# create mapping between segment IDs and semantic classes
inst2class = {segment["id"]: segment["category_id"] for segment in segments_info}
return panoptic_map, inst2class
panoptic_map1, inst2class1 = create_panoptic_map(annotation1, segments_info1)
panoptic_map2, inst2class2 = create_panoptic_map(annotation2, segments_info2)
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
# prepare the images and annotations
pixel_values_list = [np.moveaxis(np.array(image1), -1, 0), np.moveaxis(np.array(image2), -1, 0)]
inputs = processor.encode_inputs(
pixel_values_list,
["instance", "instance"],
[panoptic_map1, panoptic_map2],
instance_id_to_semantic_id=[inst2class1, inst2class2],
return_tensors="pt",
)
# verify the pixel values, task inputs, text inputs and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 711))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 711))
self.assertEqual(inputs["task_inputs"].shape, (2, 77))
self.assertEqual(inputs["text_inputs"].shape, (2, self.processing_tester.num_text, 77))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
# fmt: off
expected_class_labels = torch.tensor([32, 42, 42, 42, 42, 42, 42, 42, 32, 12, 12, 12, 12, 12, 42, 42, 12, 12, 12, 42, 12, 12, 12, 12, 12, 12, 12, 12, 12, 42, 42, 42, 12, 42, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 43, 43, 43, 43, 104, 43, 31, 125, 31, 125, 138, 87, 125, 149, 138, 125, 87, 87]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][0], expected_class_labels))
# fmt: off
expected_class_labels = torch.tensor([19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 67, 82, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 12, 12, 42, 12, 12, 12, 12, 14, 12, 12, 12, 12, 12, 12, 12, 12, 14, 12, 12, 115, 43, 43, 115, 43, 43, 43, 8, 8, 8, 138, 138, 125, 143]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][1], expected_class_labels))
# verify the task inputs
self.assertEqual(len(inputs["task_inputs"]), 2)
self.assertEqual(inputs["task_inputs"][0].sum().item(), 144985)
self.assertEqual(inputs["task_inputs"][0].sum().item(), inputs["task_inputs"][1].sum().item())
# verify the text inputs
self.assertEqual(len(inputs["text_inputs"]), 2)
self.assertEqual(inputs["text_inputs"][0].sum().item(), 1037040)
self.assertEqual(inputs["text_inputs"][1].sum().item(), 1044078)
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (73, 512, 711))
self.assertEqual(inputs["mask_labels"][1].shape, (57, 512, 711))
self.assertEqual(inputs["mask_labels"][0].sum().item(), 35040.0)
self.assertEqual(inputs["mask_labels"][1].sum().item(), 98228.0)
def test_integration_panoptic_segmentation(self):
# load 2 images and corresponding panoptic annotations from the hub
dataset = load_dataset("nielsr/ade20k-panoptic-demo")
image1 = dataset["train"][0]["image"]
image2 = dataset["train"][1]["image"]
segments_info1 = dataset["train"][0]["segments_info"]
segments_info2 = dataset["train"][1]["segments_info"]
annotation1 = dataset["train"][0]["label"]
annotation2 = dataset["train"][1]["label"]
def rgb_to_id(color):
if isinstance(color, np.ndarray) and len(color.shape) == 3:
if color.dtype == np.uint8:
color = color.astype(np.int32)
return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
def create_panoptic_map(annotation, segments_info):
annotation = np.array(annotation)
# convert RGB to segment IDs per pixel
# 0 is the "ignore" label, for which we don't need to make binary masks
panoptic_map = rgb_to_id(annotation)
# create mapping between segment IDs and semantic classes
inst2class = {segment["id"]: segment["category_id"] for segment in segments_info}
return panoptic_map, inst2class
panoptic_map1, inst2class1 = create_panoptic_map(annotation1, segments_info1)
panoptic_map2, inst2class2 = create_panoptic_map(annotation2, segments_info2)
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
# prepare the images and annotations
pixel_values_list = [np.moveaxis(np.array(image1), -1, 0), np.moveaxis(np.array(image2), -1, 0)]
inputs = processor.encode_inputs(
pixel_values_list,
["panoptic", "panoptic"],
[panoptic_map1, panoptic_map2],
instance_id_to_semantic_id=[inst2class1, inst2class2],
return_tensors="pt",
)
# verify the pixel values, task inputs, text inputs and pixel mask
self.assertEqual(inputs["pixel_values"].shape, (2, 3, 512, 711))
self.assertEqual(inputs["pixel_mask"].shape, (2, 512, 711))
self.assertEqual(inputs["task_inputs"].shape, (2, 77))
self.assertEqual(inputs["text_inputs"].shape, (2, self.processing_tester.num_text, 77))
# verify the class labels
self.assertEqual(len(inputs["class_labels"]), 2)
# fmt: off
expected_class_labels = torch.tensor([4, 17, 32, 42, 42, 42, 42, 42, 42, 42, 32, 12, 12, 12, 12, 12, 42, 42, 12, 12, 12, 42, 12, 12, 12, 12, 12, 3, 12, 12, 12, 12, 42, 42, 42, 12, 42, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 5, 12, 12, 12, 12, 12, 12, 12, 0, 43, 43, 43, 96, 43, 104, 43, 31, 125, 31, 125, 138, 87, 125, 149, 138, 125, 87, 87]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][0], expected_class_labels))
# fmt: off
expected_class_labels = torch.tensor([19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 67, 82, 19, 19, 17, 19, 19, 19, 19, 19, 19, 19, 19, 19, 12, 12, 42, 12, 12, 12, 12, 3, 14, 12, 12, 12, 12, 12, 12, 12, 12, 14, 5, 12, 12, 0, 115, 43, 43, 115, 43, 43, 43, 8, 8, 8, 138, 138, 125, 143]) # noqa: E231
# fmt: on
self.assertTrue(torch.allclose(inputs["class_labels"][1], expected_class_labels))
# verify the task inputs
self.assertEqual(len(inputs["task_inputs"]), 2)
self.assertEqual(inputs["task_inputs"][0].sum().item(), 136240)
self.assertEqual(inputs["task_inputs"][0].sum().item(), inputs["task_inputs"][1].sum().item())
# verify the text inputs
self.assertEqual(len(inputs["text_inputs"]), 2)
self.assertEqual(inputs["text_inputs"][0].sum().item(), 1048653)
self.assertEqual(inputs["text_inputs"][1].sum().item(), 1067160)
# verify the mask labels
self.assertEqual(len(inputs["mask_labels"]), 2)
self.assertEqual(inputs["mask_labels"][0].shape, (79, 512, 711))
self.assertEqual(inputs["mask_labels"][1].shape, (61, 512, 711))
self.assertEqual(inputs["mask_labels"][0].sum().item(), 315193.0)
self.assertEqual(inputs["mask_labels"][1].sum().item(), 350747.0)
def test_binary_mask_to_rle(self):
fake_binary_mask = np.zeros((20, 50))
fake_binary_mask[0, 20:] = 1
fake_binary_mask[1, :15] = 1
fake_binary_mask[5, :10] = 1
rle = binary_mask_to_rle(fake_binary_mask)
self.assertEqual(len(rle), 4)
self.assertEqual(rle[0], 21)
self.assertEqual(rle[1], 45)
def test_post_process_semantic_segmentation(self):
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
outputs = self.processing_tester.get_fake_oneformer_outputs()
segmentation = processor.post_process_semantic_segmentation(outputs)
self.assertEqual(len(segmentation), self.processing_tester.batch_size)
self.assertEqual(
segmentation[0].shape,
(
self.processing_tester.height,
self.processing_tester.width,
),
)
target_sizes = [(1, 4) for i in range(self.processing_tester.batch_size)]
segmentation = processor.post_process_semantic_segmentation(outputs, target_sizes=target_sizes)
self.assertEqual(segmentation[0].shape, target_sizes[0])
def test_post_process_instance_segmentation(self):
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
outputs = self.processing_tester.get_fake_oneformer_outputs()
segmentation = processor.post_process_instance_segmentation(outputs, threshold=0)
self.assertTrue(len(segmentation) == self.processing_tester.batch_size)
for el in segmentation:
self.assertTrue("segmentation" in el)
self.assertTrue("segments_info" in el)
self.assertEqual(type(el["segments_info"]), list)
self.assertEqual(el["segmentation"].shape, (self.processing_tester.height, self.processing_tester.width))
def test_post_process_panoptic_segmentation(self):
image_processor = OneFormerImageProcessor(
reduce_labels=True,
ignore_index=0,
size=(512, 512),
class_info_file="ade20k_panoptic.json",
num_text=self.processing_tester.num_text,
)
tokenizer = CLIPTokenizer.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")
processor = OneFormerProcessor(
image_processor=image_processor,
tokenizer=tokenizer,
max_seq_length=77,
task_seq_length=77,
)
outputs = self.processing_tester.get_fake_oneformer_outputs()
segmentation = processor.post_process_panoptic_segmentation(outputs, threshold=0)
self.assertTrue(len(segmentation) == self.processing_tester.batch_size)
for el in segmentation:
self.assertTrue("segmentation" in el)
self.assertTrue("segments_info" in el)
self.assertEqual(type(el["segments_info"]), list)
self.assertEqual(el["segmentation"].shape, (self.processing_tester.height, self.processing_tester.width))

View File

@@ -124,6 +124,8 @@ src/transformers/models/mobilevit/modeling_tf_mobilevit.py
src/transformers/models/nat/configuration_nat.py
src/transformers/models/nat/modeling_nat.py
src/transformers/models/nezha/configuration_nezha.py
src/transformers/models/oneformer/configuration_oneformer.py
src/transformers/models/oneformer/modeling_oneformer.py
src/transformers/models/openai/configuration_openai.py
src/transformers/models/opt/configuration_opt.py
src/transformers/models/opt/modeling_opt.py