Maskformer (#15682)
* maskformer * conflicts * conflicts * minor fixes * feature extractor test fix refactor MaskFormerLoss following conversation MaskFormer related types should not trigger a module time import error missed one removed all the types that are not used update config mapping minor updates in the doc resolved conversation that doesn't need a discussion minor changes resolved conversations fixed DetrDecoder * minor changes minor changes fixed mdx file test feature_extractor return types functional losses -> classes removed the return type test for the feature extractor minor changes + style + quality * conflicts? * rebase master * readme * added missing files * deleded poolformers test that where in the wrong palce * CI * minor changes * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * resolved conversations * minor changes * conversations [Unispeech] Fix slow tests (#15818) * remove soundfile old way of loading audio * Adapt slow test [Barthez Tokenizer] Fix saving (#15815) [TFXLNet] Correct tf xlnet generate (#15822) * [TFXLNet] Correct tf xlnet * adapt test comment Fix the push run (#15807) Fix semantic segmentation pipeline test (#15826) Fix dummy_inputs() to dummy_inputs in symbolic_trace doc (#15776) Add model specific output classes to PoolFormer model docs (#15746) * Added model specific output classes to poolformer docs * Fixed Segformer typo in Poolformer docs Adding the option to return_timestamps on pure CTC ASR models. (#15792) * Adding the option to return_timestamps on pure CTC ASR models. * Remove `math.prod` which was introduced in Python 3.8 * int are not floats. * Reworking the PR to support "char" vs "word" output. * Fixup! * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/pipelines/automatic_speech_recognition.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Quality. Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> HFTracer.trace should use/return self.graph to be compatible with torch.fx.Tracer (#15824) Fix tf.concatenate + test past_key_values for TF models (#15774) * fix wrong method name tf.concatenate * add tests related to causal LM / decoder * make style and quality * clean-up * Fix TFBertModel's extended_attention_mask when past_key_values is provided * Fix tests * fix copies * More tf.int8 -> tf.int32 in TF test template * clean-up * Update TF test template * revert the previous commit + update the TF test template * Fix TF template extended_attention_mask when past_key_values is provided * Fix some styles manually * clean-up * Fix ValueError: too many values to unpack in the test * Fix more: too many values to unpack in the test * Add a comment for extended_attention_mask when there is past_key_values * Fix TFElectra extended_attention_mask when past_key_values is provided * Add tests to other TF models * Fix for TF Electra test: add prepare_config_and_inputs_for_decoder * Fix not passing training arg to lm_head in TFRobertaForCausalLM * Fix tests (with past) for TF Roberta * add testing for pask_key_values for TFElectra model Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> [examples/summarization and translation] fix readme (#15833) Add ONNX Runtime quantization for text classification notebook (#15817) Re-enable doctests for the quicktour (#15828) * Re-enable doctests for the quicktour * Re-enable doctests for task_summary (#15830) * Remove & Framework split model report (#15825) Add TFConvNextModel (#15750) * feat: initial implementation of convnext in tensorflow. * fix: sample code for the classification model. * chore: added checked for from the classification model. * chore: set bias initializer in the classification head. * chore: updated license terms. * chore: removed ununsed imports * feat: enabled argument during using drop_path. * chore: replaced tf.identity with layers.Activation(linear). * chore: edited default checkpoint. * fix: minor bugs in the initializations. * partial-fix: tf model errors for loading pretrained pt weights. * partial-fix: call method updated * partial-fix: cross loading of weights (4x3 variables to be matched) * chore: removed unneeded comment. * removed playground.py * rebasing * rebasing and removing playground.py. * fix: renaming TFConvNextStage conv and layer norm layers * chore: added initializers and other minor additions. * chore: added initializers and other minor additions. * add: tests for convnext. * fix: integration tester class. * fix: issues mentioned in pr feedback (round 1). * fix: how output_hidden_states arg is propoagated inside the network. * feat: handling of arg for pure cnn models. * chore: added a note on equal contribution in model docs. * rebasing * rebasing and removing playground.py. * feat: encapsulation for the convnext trunk. * Fix variable naming; Test-related corrections; Run make fixup * chore: added Joao as a contributor to convnext. * rebasing * rebasing and removing playground.py. * rebasing * rebasing and removing playground.py. * chore: corrected copyright year and added comment on NHWC. * chore: fixed the black version and ran formatting. * chore: ran make style. * chore: removed from_pt argument from test, ran make style. * rebasing * rebasing and removing playground.py. * rebasing * rebasing and removing playground.py. * fix: tests in the convnext subclass, ran make style. * rebasing * rebasing and removing playground.py. * rebasing * rebasing and removing playground.py. * chore: moved convnext test to the correct location * fix: locations for the test file of convnext. * fix: convnext tests. * chore: applied sgugger's suggestion for dealing w/ output_attentions. * chore: added comments. * chore: applied updated quality enviornment style. * chore: applied formatting with quality enviornment. * chore: revert to the previous tests/test_modeling_common.py. * chore: revert to the original test_modeling_common.py * chore: revert to previous states for test_modeling_tf_common.py and modeling_tf_utils.py * fix: tests for convnext. * chore: removed output_attentions argument from convnext config. * chore: revert to the earlier tf utils. * fix: output shapes of the hidden states * chore: removed unnecessary comment * chore: reverting to the right test_modeling_tf_common.py. * Styling nits Co-authored-by: ariG23498 <aritra.born2fly@gmail.com> Co-authored-by: Joao Gante <joao@huggingface.co> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com> * minor changes * doc fix in feature extractor * doc * typose * removed detr logic from config * removed detr logic from config * removed num_labels * small fix in the config * auxilary -> auxiliary * make style * some test is failing * fix a weird char in config prevending doc-builder * retry to fix the doc-builder issue * make style * new try to fix the doc builder * CI * change weights to facebook Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: ariG23498 <aritra.born2fly@gmail.com> Co-authored-by: Joao Gante <joao@huggingface.co> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
This commit is contained in:
committed by
GitHub
parent
e535c389aa
commit
d83d22f578
@@ -281,6 +281,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
|||||||
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||||
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||||
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||||
|
1. **[MaskFormer](https://huggingface.co/docs/transformers/master/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
|
||||||
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
|||||||
@@ -259,6 +259,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||||
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||||
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||||
|
1. **[MaskFormer](https://huggingface.co/docs/transformers/master/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
|
||||||
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
|||||||
@@ -283,6 +283,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (来自 UNC Chapel Hill) 伴随论文 [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 由 Hao Tan and Mohit Bansal 发布。
|
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (来自 UNC Chapel Hill) 伴随论文 [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 由 Hao Tan and Mohit Bansal 发布。
|
||||||
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (来自 Facebook) 伴随论文 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 由 Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 发布。
|
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (来自 Facebook) 伴随论文 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 由 Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 发布。
|
||||||
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** 用 [OPUS](http://opus.nlpl.eu/) 数据训练的机器翻译模型由 Jörg Tiedemann 发布。[Marian Framework](https://marian-nmt.github.io/) 由微软翻译团队开发。
|
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** 用 [OPUS](http://opus.nlpl.eu/) 数据训练的机器翻译模型由 Jörg Tiedemann 发布。[Marian Framework](https://marian-nmt.github.io/) 由微软翻译团队开发。
|
||||||
|
1. **[MaskFormer](https://huggingface.co/docs/transformers/master/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
|
||||||
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 由 Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer 发布。
|
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 由 Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer 发布。
|
||||||
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。
|
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。
|
||||||
|
|||||||
@@ -295,6 +295,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||||
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||||
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||||
|
1. **[MaskFormer](https://huggingface.co/docs/transformers/master/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
|
||||||
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
1. **[MBart](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[MBart-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
|
|||||||
@@ -230,6 +230,8 @@
|
|||||||
title: LXMERT
|
title: LXMERT
|
||||||
- local: model_doc/marian
|
- local: model_doc/marian
|
||||||
title: MarianMT
|
title: MarianMT
|
||||||
|
- local: model_doc/maskformer
|
||||||
|
title: MaskFormer
|
||||||
- local: model_doc/m2m_100
|
- local: model_doc/m2m_100
|
||||||
title: M2M100
|
title: M2M100
|
||||||
- local: model_doc/mbart
|
- local: model_doc/mbart
|
||||||
|
|||||||
@@ -105,6 +105,7 @@ conversion utilities for the following models.
|
|||||||
1. **[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
1. **[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||||
1. **[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
1. **[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||||
1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||||
|
1. **[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
|
||||||
1. **[MBart](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
1. **[MBart](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
1. **[MBart-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
1. **[MBart-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||||
1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||||
@@ -209,6 +210,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| M2M100 | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| M2M100 | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| Marian | ✅ | ❌ | ✅ | ✅ | ✅ |
|
| Marian | ✅ | ❌ | ✅ | ✅ | ✅ |
|
||||||
|
| MaskFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| MegatronBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| MegatronBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
|
|||||||
71
docs/source/model_doc/maskformer.mdx
Normal file
71
docs/source/model_doc/maskformer.mdx
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# MaskFormer
|
||||||
|
|
||||||
|
<Tip>
|
||||||
|
|
||||||
|
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
|
||||||
|
breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The MaskFormer model was proposed in [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. MaskFormer addresses semantic segmentation with a mask classification paradigm instead of performing classic pixel-level classification.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxilary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).
|
||||||
|
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||||
|
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||||
|
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||||
|
- One can use [`MaskFormerFeatureExtractor`] to prepare images for the model and optional targets for the model.
|
||||||
|
- To get the final segmentation, depending on the task, you can call [`~MaskFormerFeatureExtractor.post_process_semantic_segmentation`] or [`~MaskFormerFeatureExtractor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, the latter needs an additional `is_thing_map` to know which instances must be merged together..
|
||||||
|
|
||||||
|
The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
|
||||||
|
|
||||||
|
<img width="600" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/maskformer_architecture.png"/>
|
||||||
|
|
||||||
|
This model was contributed by [francesco](https://huggingface.co/francesco). The original code can be found [here](https://github.com/facebookresearch/MaskFormer).
|
||||||
|
|
||||||
|
## MaskFormer specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.maskformer.modeling_maskformer.MaskFormerModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput
|
||||||
|
|
||||||
|
## MaskFormerConfig
|
||||||
|
|
||||||
|
[[autodoc]] MaskFormerConfig
|
||||||
|
|
||||||
|
## MaskFormerFeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] MaskFormerFeatureExtractor
|
||||||
|
- __call__
|
||||||
|
- encode_inputs
|
||||||
|
- post_process_segmentation
|
||||||
|
- post_process_semantic_segmentation
|
||||||
|
- post_process_panoptic_segmentation
|
||||||
|
|
||||||
|
## MaskFormerModel
|
||||||
|
|
||||||
|
[[autodoc]] MaskFormerModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## MaskFormerForInstanceSegmentation
|
||||||
|
|
||||||
|
[[autodoc]] MaskFormerForInstanceSegmentation
|
||||||
|
- forward
|
||||||
@@ -247,6 +247,7 @@ _import_structure = {
|
|||||||
"models.lxmert": ["LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "LxmertConfig", "LxmertTokenizer"],
|
"models.lxmert": ["LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "LxmertConfig", "LxmertTokenizer"],
|
||||||
"models.m2m_100": ["M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP", "M2M100Config"],
|
"models.m2m_100": ["M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP", "M2M100Config"],
|
||||||
"models.marian": ["MarianConfig"],
|
"models.marian": ["MarianConfig"],
|
||||||
|
"models.maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig"],
|
||||||
"models.mbart": ["MBartConfig"],
|
"models.mbart": ["MBartConfig"],
|
||||||
"models.mbart50": [],
|
"models.mbart50": [],
|
||||||
"models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"],
|
"models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"],
|
||||||
@@ -527,6 +528,7 @@ if is_vision_available():
|
|||||||
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
|
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
|
||||||
_import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
|
_import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
|
||||||
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
|
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
|
||||||
|
_import_structure["models.maskformer"].append("MaskFormerFeatureExtractor")
|
||||||
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
|
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
|
||||||
_import_structure["models.poolformer"].append("PoolFormerFeatureExtractor")
|
_import_structure["models.poolformer"].append("PoolFormerFeatureExtractor")
|
||||||
_import_structure["models.segformer"].append("SegformerFeatureExtractor")
|
_import_structure["models.segformer"].append("SegformerFeatureExtractor")
|
||||||
@@ -1147,6 +1149,14 @@ if is_torch_available():
|
|||||||
]
|
]
|
||||||
)
|
)
|
||||||
_import_structure["models.marian"].extend(["MarianForCausalLM", "MarianModel", "MarianMTModel"])
|
_import_structure["models.marian"].extend(["MarianForCausalLM", "MarianModel", "MarianMTModel"])
|
||||||
|
_import_structure["models.maskformer"].extend(
|
||||||
|
[
|
||||||
|
"MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"MaskFormerForInstanceSegmentation",
|
||||||
|
"MaskFormerModel",
|
||||||
|
"MaskFormerPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.mbart"].extend(
|
_import_structure["models.mbart"].extend(
|
||||||
[
|
[
|
||||||
"MBartForCausalLM",
|
"MBartForCausalLM",
|
||||||
@@ -2532,6 +2542,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.lxmert import LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LxmertConfig, LxmertTokenizer
|
from .models.lxmert import LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LxmertConfig, LxmertTokenizer
|
||||||
from .models.m2m_100 import M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP, M2M100Config
|
from .models.m2m_100 import M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP, M2M100Config
|
||||||
from .models.marian import MarianConfig
|
from .models.marian import MarianConfig
|
||||||
|
from .models.maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig
|
||||||
from .models.mbart import MBartConfig
|
from .models.mbart import MBartConfig
|
||||||
from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig
|
from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig
|
||||||
from .models.mmbt import MMBTConfig
|
from .models.mmbt import MMBTConfig
|
||||||
@@ -2763,6 +2774,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.imagegpt import ImageGPTFeatureExtractor
|
from .models.imagegpt import ImageGPTFeatureExtractor
|
||||||
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
|
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
|
||||||
from .models.layoutxlm import LayoutXLMProcessor
|
from .models.layoutxlm import LayoutXLMProcessor
|
||||||
|
from .models.maskformer import MaskFormerFeatureExtractor
|
||||||
from .models.perceiver import PerceiverFeatureExtractor
|
from .models.perceiver import PerceiverFeatureExtractor
|
||||||
from .models.poolformer import PoolFormerFeatureExtractor
|
from .models.poolformer import PoolFormerFeatureExtractor
|
||||||
from .models.segformer import SegformerFeatureExtractor
|
from .models.segformer import SegformerFeatureExtractor
|
||||||
@@ -3273,6 +3285,12 @@ if TYPE_CHECKING:
|
|||||||
M2M100PreTrainedModel,
|
M2M100PreTrainedModel,
|
||||||
)
|
)
|
||||||
from .models.marian import MarianForCausalLM, MarianModel, MarianMTModel
|
from .models.marian import MarianForCausalLM, MarianModel, MarianMTModel
|
||||||
|
from .models.maskformer import (
|
||||||
|
MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
MaskFormerForInstanceSegmentation,
|
||||||
|
MaskFormerModel,
|
||||||
|
MaskFormerPreTrainedModel,
|
||||||
|
)
|
||||||
from .models.mbart import (
|
from .models.mbart import (
|
||||||
MBartForCausalLM,
|
MBartForCausalLM,
|
||||||
MBartForConditionalGeneration,
|
MBartForConditionalGeneration,
|
||||||
|
|||||||
@@ -70,6 +70,7 @@ from . import (
|
|||||||
lxmert,
|
lxmert,
|
||||||
m2m_100,
|
m2m_100,
|
||||||
marian,
|
marian,
|
||||||
|
maskformer,
|
||||||
mbart,
|
mbart,
|
||||||
mbart50,
|
mbart50,
|
||||||
megatron_bert,
|
megatron_bert,
|
||||||
|
|||||||
@@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
|
|||||||
CONFIG_MAPPING_NAMES = OrderedDict(
|
CONFIG_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add configs here
|
# Add configs here
|
||||||
|
("maskformer", "MaskFormerConfig"),
|
||||||
("poolformer", "PoolFormerConfig"),
|
("poolformer", "PoolFormerConfig"),
|
||||||
("convnext", "ConvNextConfig"),
|
("convnext", "ConvNextConfig"),
|
||||||
("yoso", "YosoConfig"),
|
("yoso", "YosoConfig"),
|
||||||
@@ -129,6 +130,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add archive maps here
|
# Add archive maps here
|
||||||
|
("maskformer", "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("convnext", "CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("convnext", "CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("yoso", "YOSO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("yoso", "YOSO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@@ -215,6 +217,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
MODEL_NAMES_MAPPING = OrderedDict(
|
MODEL_NAMES_MAPPING = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add full (and cased) model names here
|
# Add full (and cased) model names here
|
||||||
|
("maskformer", "MaskFormer"),
|
||||||
("poolformer", "PoolFormer"),
|
("poolformer", "PoolFormer"),
|
||||||
("convnext", "ConvNext"),
|
("convnext", "ConvNext"),
|
||||||
("yoso", "YOSO"),
|
("yoso", "YOSO"),
|
||||||
|
|||||||
@@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
|
|||||||
MODEL_MAPPING_NAMES = OrderedDict(
|
MODEL_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Base model mapping
|
# Base model mapping
|
||||||
|
("maskformer", "MaskFormerModel"),
|
||||||
("poolformer", "PoolFormerModel"),
|
("poolformer", "PoolFormerModel"),
|
||||||
("convnext", "ConvNextModel"),
|
("convnext", "ConvNextModel"),
|
||||||
("yoso", "YosoModel"),
|
("yoso", "YosoModel"),
|
||||||
|
|||||||
56
src/transformers/models/maskformer/__init__.py
Normal file
56
src/transformers/models/maskformer/__init__.py
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...file_utils import _LazyModule, is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig"],
|
||||||
|
}
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
_import_structure["feature_extraction_maskformer"] = ["MaskFormerFeatureExtractor"]
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
_import_structure["modeling_maskformer"] = [
|
||||||
|
"MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"MaskFormerForInstanceSegmentation",
|
||||||
|
"MaskFormerModel",
|
||||||
|
"MaskFormerPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from .feature_extraction_maskformer import MaskFormerFeatureExtractor
|
||||||
|
if is_torch_available():
|
||||||
|
from .modeling_maskformer import (
|
||||||
|
MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
MaskFormerForInstanceSegmentation,
|
||||||
|
MaskFormerModel,
|
||||||
|
MaskFormerPreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||||
197
src/transformers/models/maskformer/configuration_maskformer.py
Normal file
197
src/transformers/models/maskformer/configuration_maskformer.py
Normal file
@@ -0,0 +1,197 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms, Inc.and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" MaskFormer model configuration"""
|
||||||
|
import copy
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
from ..auto.configuration_auto import AutoConfig
|
||||||
|
from ..detr import DetrConfig
|
||||||
|
from ..swin import SwinConfig
|
||||||
|
|
||||||
|
|
||||||
|
MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"facebook/maskformer-swin-base-ade": "https://huggingface.co/facebook/maskformer-swin-base-ade/blob/main/config.json"
|
||||||
|
# See all MaskFormer models at https://huggingface.co/models?filter=maskformer
|
||||||
|
}
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`MaskFormerModel`]. It is used to instantiate a
|
||||||
|
MaskFormer model according to the specified arguments, defining the model architecture. Instantiating a
|
||||||
|
configuration with the defaults will yield a similar configuration to that of the
|
||||||
|
"facebook/maskformer-swin-base-ade" architecture trained on
|
||||||
|
[ADE20k-150](https://huggingface.co/datasets/scene_parse_150).
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Currently, MaskFormer only supports the [Swin Transformer](swin) as backbone.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
mask_feature_size (`int`, *optional*, defaults to 256):
|
||||||
|
The masks' features size, this value will also be used to specify the Feature Pyramid Network features'
|
||||||
|
size.
|
||||||
|
no_object_weight (`float`, *optional*, defaults to 0.1):
|
||||||
|
Weight to apply to the null (no object) class.
|
||||||
|
use_auxiliary_loss(`bool`, *optional*, defaults to `False`):
|
||||||
|
If `True` [`MaskFormerForInstanceSegmentationOutput`] will contain the auxiliary losses computed using the
|
||||||
|
logits from each decoder's stage.
|
||||||
|
backbone_config (`Dict`, *optional*):
|
||||||
|
The configuration passed to the backbone, if unset, the configuration corresponding to
|
||||||
|
`swin-base-patch4-window12-384` will be used.
|
||||||
|
decoder_config (`Dict`, *optional*):
|
||||||
|
The configuration passed to the transformer decoder model, if unset the base config for `detr-resnet-50`
|
||||||
|
will be used.
|
||||||
|
init_std (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
init_xavier_std (`float`, *optional*, defaults to 1):
|
||||||
|
The scaling factor used for the Xavier initialization gain in the HM Attention map module.
|
||||||
|
dice_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
The weight for the dice loss.
|
||||||
|
cross_entropy_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
The weight for the cross entropy loss.
|
||||||
|
mask_weight (`float`, *optional*, defaults to 20.0):
|
||||||
|
The weight for the mask loss.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
`ValueError`:
|
||||||
|
Raised if the backbone model type selected is not in `["swin"]` or the decoder model type selected is not
|
||||||
|
in `["detr"]`
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MaskFormerConfig, MaskFormerModel
|
||||||
|
|
||||||
|
>>> # Initializing a MaskFormer facebook/maskformer-swin-base-ade configuration
|
||||||
|
>>> configuration = MaskFormerConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the facebook/maskformer-swin-base-ade style configuration
|
||||||
|
>>> model = MaskFormerModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```
|
||||||
|
|
||||||
|
"""
|
||||||
|
model_type = "maskformer"
|
||||||
|
attribute_map = {"hidden_size": "mask_feature_size"}
|
||||||
|
backbones_supported = ["swin"]
|
||||||
|
decoders_supported = ["detr"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
fpn_feature_size: int = 256,
|
||||||
|
mask_feature_size: int = 256,
|
||||||
|
no_object_weight: float = 0.1,
|
||||||
|
use_auxiliary_loss: bool = False,
|
||||||
|
backbone_config: Optional[Dict] = None,
|
||||||
|
decoder_config: Optional[Dict] = None,
|
||||||
|
init_std: float = 0.02,
|
||||||
|
init_xavier_std: float = 1.0,
|
||||||
|
dice_weight: float = 1.0,
|
||||||
|
cross_entropy_weight: float = 1.0,
|
||||||
|
mask_weight: float = 20.0,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
if backbone_config is None:
|
||||||
|
# fall back to https://huggingface.co/microsoft/swin-base-patch4-window12-384-in22k
|
||||||
|
backbone_config = SwinConfig(
|
||||||
|
image_size=384,
|
||||||
|
in_channels=3,
|
||||||
|
patch_size=4,
|
||||||
|
embed_dim=128,
|
||||||
|
depths=[2, 2, 18, 2],
|
||||||
|
num_heads=[4, 8, 16, 32],
|
||||||
|
window_size=12,
|
||||||
|
drop_path_rate=0.3,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
backbone_model_type = backbone_config.pop("model_type")
|
||||||
|
if backbone_model_type not in self.backbones_supported:
|
||||||
|
raise ValueError(
|
||||||
|
f"Backbone {backbone_model_type} not supported, please use one of {','.join(self.backbones_supported)}"
|
||||||
|
)
|
||||||
|
backbone_config = AutoConfig.for_model(backbone_model_type, **backbone_config)
|
||||||
|
|
||||||
|
if decoder_config is None:
|
||||||
|
# fall back to https://huggingface.co/facebook/detr-resnet-50
|
||||||
|
decoder_config = DetrConfig()
|
||||||
|
else:
|
||||||
|
decoder_type = decoder_config.pop("model_type")
|
||||||
|
if decoder_type not in self.decoders_supported:
|
||||||
|
raise ValueError(
|
||||||
|
f"Transformer Decoder {decoder_type} not supported, please use one of {','.join(self.decoders_supported)}"
|
||||||
|
)
|
||||||
|
decoder_config = AutoConfig.for_model(decoder_type, **decoder_config)
|
||||||
|
|
||||||
|
self.backbone_config = backbone_config
|
||||||
|
self.decoder_config = decoder_config
|
||||||
|
# main feature dimension for the model
|
||||||
|
self.fpn_feature_size = fpn_feature_size
|
||||||
|
self.mask_feature_size = mask_feature_size
|
||||||
|
# initializer
|
||||||
|
self.init_std = init_std
|
||||||
|
self.init_xavier_std = init_xavier_std
|
||||||
|
# Hungarian matcher && loss
|
||||||
|
self.cross_entropy_weight = cross_entropy_weight
|
||||||
|
self.dice_weight = dice_weight
|
||||||
|
self.mask_weight = mask_weight
|
||||||
|
self.use_auxiliary_loss = use_auxiliary_loss
|
||||||
|
self.no_object_weight = no_object_weight
|
||||||
|
|
||||||
|
self.num_attention_heads = self.decoder_config.encoder_attention_heads
|
||||||
|
self.num_hidden_layers = self.decoder_config.num_hidden_layers
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_backbone_and_decoder_configs(
|
||||||
|
cls, backbone_config: PretrainedConfig, decoder_config: PretrainedConfig, **kwargs
|
||||||
|
):
|
||||||
|
"""Instantiate a [`MaskFormerConfig`] (or a derived class) from a pre-trained backbone model configuration and DETR model
|
||||||
|
configuration.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backbone_config ([`PretrainedConfig`]):
|
||||||
|
The backbone configuration.
|
||||||
|
decoder_config ([`PretrainedConfig`]):
|
||||||
|
The transformer decoder configuration to use.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`MaskFormerConfig`]: An instance of a configuration object
|
||||||
|
"""
|
||||||
|
return cls(
|
||||||
|
backbone_config=backbone_config.to_dict(),
|
||||||
|
decoder_config=decoder_config.to_dict(),
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, any]:
|
||||||
|
"""
|
||||||
|
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||||
|
"""
|
||||||
|
output = copy.deepcopy(self.__dict__)
|
||||||
|
output["backbone_config"] = self.backbone_config.to_dict()
|
||||||
|
output["decoder_config"] = self.decoder_config.to_dict()
|
||||||
|
output["model_type"] = self.__class__.model_type
|
||||||
|
return output
|
||||||
@@ -0,0 +1,727 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import sys
|
||||||
|
from argparse import ArgumentParser
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from pprint import pformat
|
||||||
|
from typing import Any, Dict, Iterator, List, Set, Tuple
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torchvision.transforms as T
|
||||||
|
from PIL import Image
|
||||||
|
from torch import Tensor, nn
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from detectron2.checkpoint import DetectionCheckpointer
|
||||||
|
from detectron2.config import get_cfg
|
||||||
|
from detectron2.data import MetadataCatalog
|
||||||
|
from detectron2.projects.deeplab import add_deeplab_config
|
||||||
|
from transformers.models.maskformer.feature_extraction_maskformer import MaskFormerFeatureExtractor
|
||||||
|
from transformers.models.maskformer.modeling_maskformer import (
|
||||||
|
MaskFormerConfig,
|
||||||
|
MaskFormerForInstanceSegmentation,
|
||||||
|
MaskFormerForInstanceSegmentationOutput,
|
||||||
|
MaskFormerModel,
|
||||||
|
MaskFormerModelOutput,
|
||||||
|
)
|
||||||
|
from transformers.utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
StateDict = Dict[str, Tensor]
|
||||||
|
|
||||||
|
logging.set_verbosity_info()
|
||||||
|
logger = logging.get_logger()
|
||||||
|
|
||||||
|
torch.manual_seed(0)
|
||||||
|
|
||||||
|
|
||||||
|
class TrackedStateDict:
|
||||||
|
def __init__(self, to_track: Dict):
|
||||||
|
"""This class "tracks" a python dictionary by keeping track of which item is accessed.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
to_track (Dict): The dictionary we wish to track
|
||||||
|
"""
|
||||||
|
self.to_track = to_track
|
||||||
|
self._seen: Set[str] = set()
|
||||||
|
|
||||||
|
def __getitem__(self, key: str) -> Any:
|
||||||
|
return self.to_track[key]
|
||||||
|
|
||||||
|
def __setitem__(self, key: str, item: Any):
|
||||||
|
self._seen.add(key)
|
||||||
|
self.to_track[key] = item
|
||||||
|
|
||||||
|
def diff(self) -> List[str]:
|
||||||
|
"""This method returns a set difference between the keys in the tracked state dict and the one we have access so far.
|
||||||
|
This is an effective method to check if we have update all the keys
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List[str]: List of keys not yet updated
|
||||||
|
"""
|
||||||
|
return set(list(self.to_track.keys())) - self._seen
|
||||||
|
|
||||||
|
def copy(self) -> Dict:
|
||||||
|
# proxy the call to the internal dictionary
|
||||||
|
return self.to_track.copy()
|
||||||
|
|
||||||
|
|
||||||
|
# We will verify our results on an image of cute cats
|
||||||
|
def prepare_img():
|
||||||
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
|
img_data = requests.get(url, stream=True).raw
|
||||||
|
im = Image.open(img_data)
|
||||||
|
return im
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Args:
|
||||||
|
"""Fake command line arguments needed by maskformer/detectron implementation"""
|
||||||
|
|
||||||
|
config_file: str
|
||||||
|
|
||||||
|
|
||||||
|
def setup_cfg(args: Args):
|
||||||
|
# load config from file and command-line arguments
|
||||||
|
cfg = get_cfg()
|
||||||
|
add_deeplab_config(cfg)
|
||||||
|
add_mask_former_config(cfg)
|
||||||
|
cfg.merge_from_file(args.config_file)
|
||||||
|
cfg.freeze()
|
||||||
|
return cfg
|
||||||
|
|
||||||
|
|
||||||
|
class OriginalMaskFormerConfigToOursConverter:
|
||||||
|
def __call__(self, original_config: object) -> MaskFormerConfig:
|
||||||
|
|
||||||
|
model = original_config.MODEL
|
||||||
|
mask_former = model.MASK_FORMER
|
||||||
|
swin = model.SWIN
|
||||||
|
|
||||||
|
dataset_catalog = MetadataCatalog.get(original_config.DATASETS.TEST[0])
|
||||||
|
id2label = {idx: label for idx, label in enumerate(dataset_catalog.stuff_classes)}
|
||||||
|
label2id = {label: idx for idx, label in id2label.items()}
|
||||||
|
|
||||||
|
config: MaskFormerConfig = MaskFormerConfig(
|
||||||
|
fpn_feature_size=model.SEM_SEG_HEAD.CONVS_DIM,
|
||||||
|
mask_feature_size=model.SEM_SEG_HEAD.MASK_DIM,
|
||||||
|
num_labels=model.SEM_SEG_HEAD.NUM_CLASSES,
|
||||||
|
no_object_weight=mask_former.NO_OBJECT_WEIGHT,
|
||||||
|
num_queries=mask_former.NUM_OBJECT_QUERIES,
|
||||||
|
backbone_config=dict(
|
||||||
|
pretrain_img_size=swin.PRETRAIN_IMG_SIZE,
|
||||||
|
image_size=swin.PRETRAIN_IMG_SIZE,
|
||||||
|
in_channels=3,
|
||||||
|
patch_size=swin.PATCH_SIZE,
|
||||||
|
embed_dim=swin.EMBED_DIM,
|
||||||
|
depths=swin.DEPTHS,
|
||||||
|
num_heads=swin.NUM_HEADS,
|
||||||
|
window_size=swin.WINDOW_SIZE,
|
||||||
|
drop_path_rate=swin.DROP_PATH_RATE,
|
||||||
|
model_type="swin",
|
||||||
|
),
|
||||||
|
dice_weight=mask_former.DICE_WEIGHT,
|
||||||
|
ce_weight=1.0,
|
||||||
|
mask_weight=mask_former.MASK_WEIGHT,
|
||||||
|
decoder_config=dict(
|
||||||
|
model_type="detr",
|
||||||
|
max_position_embeddings=1024,
|
||||||
|
encoder_layers=6,
|
||||||
|
encoder_ffn_dim=2048,
|
||||||
|
encoder_attention_heads=8,
|
||||||
|
decoder_layers=mask_former.DEC_LAYERS,
|
||||||
|
decoder_ffn_dim=mask_former.DIM_FEEDFORWARD,
|
||||||
|
decoder_attention_heads=mask_former.NHEADS,
|
||||||
|
encoder_layerdrop=0.0,
|
||||||
|
decoder_layerdrop=0.0,
|
||||||
|
d_model=mask_former.HIDDEN_DIM,
|
||||||
|
dropout=mask_former.DROPOUT,
|
||||||
|
attention_dropout=0.0,
|
||||||
|
activation_dropout=0.0,
|
||||||
|
init_std=0.02,
|
||||||
|
init_xavier_std=1.0,
|
||||||
|
scale_embedding=False,
|
||||||
|
auxiliary_loss=False,
|
||||||
|
dilation=False,
|
||||||
|
# default pretrained config values
|
||||||
|
),
|
||||||
|
id2label=id2label,
|
||||||
|
label2id=label2id,
|
||||||
|
)
|
||||||
|
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
class OriginalMaskFormerConfigToFeatureExtractorConverter:
|
||||||
|
def __call__(self, original_config: object) -> MaskFormerFeatureExtractor:
|
||||||
|
model = original_config.MODEL
|
||||||
|
model_input = original_config.INPUT
|
||||||
|
|
||||||
|
return MaskFormerFeatureExtractor(
|
||||||
|
image_mean=(torch.tensor(model.PIXEL_MEAN) / 255).tolist(),
|
||||||
|
image_std=(torch.tensor(model.PIXEL_STD) / 255).tolist(),
|
||||||
|
size=model_input.MIN_SIZE_TEST,
|
||||||
|
max_size=model_input.MAX_SIZE_TEST,
|
||||||
|
size_divisibility=32, # 32 is required by swin
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class OriginalMaskFormerCheckpointToOursConverter:
|
||||||
|
def __init__(self, original_model: nn.Module, config: MaskFormerConfig):
|
||||||
|
self.original_model = original_model
|
||||||
|
self.config = config
|
||||||
|
|
||||||
|
def pop_all(self, renamed_keys: List[Tuple[str, str]], dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
for (src_key, dst_key) in renamed_keys:
|
||||||
|
dst_state_dict[dst_key] = src_state_dict.pop(src_key)
|
||||||
|
|
||||||
|
def replace_backbone(self, dst_state_dict: StateDict, src_state_dict: StateDict, config: MaskFormerConfig):
|
||||||
|
dst_prefix: str = "pixel_level_module.encoder"
|
||||||
|
src_prefix: str = "backbone"
|
||||||
|
|
||||||
|
renamed_keys = [
|
||||||
|
(
|
||||||
|
f"{src_prefix}.patch_embed.proj.weight",
|
||||||
|
f"{dst_prefix}.model.embeddings.patch_embeddings.projection.weight",
|
||||||
|
),
|
||||||
|
(f"{src_prefix}.patch_embed.proj.bias", f"{dst_prefix}.model.embeddings.patch_embeddings.projection.bias"),
|
||||||
|
(f"{src_prefix}.patch_embed.norm.weight", f"{dst_prefix}.model.embeddings.norm.weight"),
|
||||||
|
(f"{src_prefix}.patch_embed.norm.bias", f"{dst_prefix}.model.embeddings.norm.bias"),
|
||||||
|
]
|
||||||
|
num_layers = len(config.backbone_config.depths)
|
||||||
|
for layer_idx in range(num_layers):
|
||||||
|
for block_idx in range(config.backbone_config.depths[layer_idx]):
|
||||||
|
renamed_keys.extend(
|
||||||
|
[ # src, dst
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.norm1.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.layernorm_before.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.norm1.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.layernorm_before.bias",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.relative_position_bias_table",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.relative_position_bias_table",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# now we need to handle the attentions
|
||||||
|
# read in weights + bias of input projection layer of cross-attention
|
||||||
|
|
||||||
|
src_att_weight = src_state_dict[f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.qkv.weight"]
|
||||||
|
src_att_bias = src_state_dict[f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.qkv.bias"]
|
||||||
|
|
||||||
|
size = src_att_weight.shape[0]
|
||||||
|
offset = size // 3
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.query.weight"
|
||||||
|
] = src_att_weight[:offset, :]
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.query.bias"
|
||||||
|
] = src_att_bias[:offset]
|
||||||
|
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.key.weight"
|
||||||
|
] = src_att_weight[offset : offset * 2, :]
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.key.bias"
|
||||||
|
] = src_att_bias[offset : offset * 2]
|
||||||
|
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.value.weight"
|
||||||
|
] = src_att_weight[-offset:, :]
|
||||||
|
dst_state_dict[
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.value.bias"
|
||||||
|
] = src_att_bias[-offset:]
|
||||||
|
|
||||||
|
# let's pop them
|
||||||
|
src_state_dict.pop(f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.qkv.weight")
|
||||||
|
src_state_dict.pop(f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.qkv.bias")
|
||||||
|
# proj
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.proj.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.output.dense.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.proj.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.output.dense.bias",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
# second norm
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.norm2.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.layernorm_after.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.norm2.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.layernorm_after.bias",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
# mlp
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.mlp.fc1.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.intermediate.dense.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.mlp.fc1.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.intermediate.dense.bias",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.mlp.fc2.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.output.dense.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.mlp.fc2.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.output.dense.bias",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.blocks.{block_idx}.attn.relative_position_index",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.blocks.{block_idx}.attention.self.relative_position_index",
|
||||||
|
)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
if layer_idx < num_layers - 1:
|
||||||
|
# patch merging
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.downsample.reduction.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.downsample.reduction.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.downsample.norm.weight",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.downsample.norm.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{layer_idx}.downsample.norm.bias",
|
||||||
|
f"{dst_prefix}.model.encoder.layers.{layer_idx}.downsample.norm.bias",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
# hidden states norms
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(
|
||||||
|
f"{src_prefix}.norm{layer_idx}.weight",
|
||||||
|
f"{dst_prefix}.hidden_states_norms.{layer_idx}.weight",
|
||||||
|
),
|
||||||
|
(
|
||||||
|
f"{src_prefix}.norm{layer_idx}.bias",
|
||||||
|
f"{dst_prefix}.hidden_states_norms.{layer_idx}.bias",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
self.pop_all(renamed_keys, dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
def replace_pixel_module(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
dst_prefix: str = "pixel_level_module.decoder"
|
||||||
|
src_prefix: str = "sem_seg_head.pixel_decoder"
|
||||||
|
|
||||||
|
self.replace_backbone(dst_state_dict, src_state_dict, self.config)
|
||||||
|
|
||||||
|
def rename_keys_for_conv(detectron_conv: str, mine_conv: str):
|
||||||
|
return [
|
||||||
|
(f"{detectron_conv}.weight", f"{mine_conv}.0.weight"),
|
||||||
|
# 2 cuz the have act in the middle -> rename it
|
||||||
|
(f"{detectron_conv}.norm.weight", f"{mine_conv}.1.weight"),
|
||||||
|
(f"{detectron_conv}.norm.bias", f"{mine_conv}.1.bias"),
|
||||||
|
]
|
||||||
|
|
||||||
|
renamed_keys = [
|
||||||
|
(f"{src_prefix}.mask_features.weight", f"{dst_prefix}.mask_projection.weight"),
|
||||||
|
(f"{src_prefix}.mask_features.bias", f"{dst_prefix}.mask_projection.bias"),
|
||||||
|
# the layers in the original one are in reverse order, stem is the last one!
|
||||||
|
]
|
||||||
|
|
||||||
|
renamed_keys.extend(rename_keys_for_conv(f"{src_prefix}.layer_4", f"{dst_prefix}.fpn.stem"))
|
||||||
|
|
||||||
|
# add all the fpn layers (here we need some config parameters to know the size in advance)
|
||||||
|
for src_i, dst_i in zip(range(3, 0, -1), range(0, 3)):
|
||||||
|
renamed_keys.extend(
|
||||||
|
rename_keys_for_conv(f"{src_prefix}.adapter_{src_i}", f"{dst_prefix}.fpn.layers.{dst_i}.proj")
|
||||||
|
)
|
||||||
|
renamed_keys.extend(
|
||||||
|
rename_keys_for_conv(f"{src_prefix}.layer_{src_i}", f"{dst_prefix}.fpn.layers.{dst_i}.block")
|
||||||
|
)
|
||||||
|
|
||||||
|
self.pop_all(renamed_keys, dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
def rename_keys_in_detr_decoder(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
dst_prefix: str = "transformer_module.decoder"
|
||||||
|
src_prefix: str = "sem_seg_head.predictor.transformer.decoder"
|
||||||
|
# not sure why we are not popping direcetly here!
|
||||||
|
# here we list all keys to be renamed (original name on the left, our name on the right)
|
||||||
|
rename_keys = []
|
||||||
|
for i in range(self.config.decoder_config.decoder_layers):
|
||||||
|
# decoder layers: 2 times output projection, 2 feedforward neural networks and 3 layernorms
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{i}.self_attn.out_proj.weight",
|
||||||
|
f"{dst_prefix}.layers.{i}.self_attn.out_proj.weight",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{i}.self_attn.out_proj.bias",
|
||||||
|
f"{dst_prefix}.layers.{i}.self_attn.out_proj.bias",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{i}.multihead_attn.out_proj.weight",
|
||||||
|
f"{dst_prefix}.layers.{i}.encoder_attn.out_proj.weight",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"{src_prefix}.layers.{i}.multihead_attn.out_proj.bias",
|
||||||
|
f"{dst_prefix}.layers.{i}.encoder_attn.out_proj.bias",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append((f"{src_prefix}.layers.{i}.linear1.weight", f"{dst_prefix}.layers.{i}.fc1.weight"))
|
||||||
|
rename_keys.append((f"{src_prefix}.layers.{i}.linear1.bias", f"{dst_prefix}.layers.{i}.fc1.bias"))
|
||||||
|
rename_keys.append((f"{src_prefix}.layers.{i}.linear2.weight", f"{dst_prefix}.layers.{i}.fc2.weight"))
|
||||||
|
rename_keys.append((f"{src_prefix}.layers.{i}.linear2.bias", f"{dst_prefix}.layers.{i}.fc2.bias"))
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm1.weight", f"{dst_prefix}.layers.{i}.self_attn_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm1.bias", f"{dst_prefix}.layers.{i}.self_attn_layer_norm.bias")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm2.weight", f"{dst_prefix}.layers.{i}.encoder_attn_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm2.bias", f"{dst_prefix}.layers.{i}.encoder_attn_layer_norm.bias")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm3.weight", f"{dst_prefix}.layers.{i}.final_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"{src_prefix}.layers.{i}.norm3.bias", f"{dst_prefix}.layers.{i}.final_layer_norm.bias")
|
||||||
|
)
|
||||||
|
|
||||||
|
return rename_keys
|
||||||
|
|
||||||
|
def replace_q_k_v_in_detr_decoder(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
dst_prefix: str = "transformer_module.decoder"
|
||||||
|
src_prefix: str = "sem_seg_head.predictor.transformer.decoder"
|
||||||
|
for i in range(self.config.decoder_config.decoder_layers):
|
||||||
|
# read in weights + bias of input projection layer of self-attention
|
||||||
|
in_proj_weight = src_state_dict.pop(f"{src_prefix}.layers.{i}.self_attn.in_proj_weight")
|
||||||
|
in_proj_bias = src_state_dict.pop(f"{src_prefix}.layers.{i}.self_attn.in_proj_bias")
|
||||||
|
# next, add query, keys and values (in that order) to the state dict
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[:256, :]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:256]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[256:512, :]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[256:512]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[-256:, :]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-256:]
|
||||||
|
# read in weights + bias of input projection layer of cross-attention
|
||||||
|
in_proj_weight_cross_attn = src_state_dict.pop(f"{src_prefix}.layers.{i}.multihead_attn.in_proj_weight")
|
||||||
|
in_proj_bias_cross_attn = src_state_dict.pop(f"{src_prefix}.layers.{i}.multihead_attn.in_proj_bias")
|
||||||
|
# next, add query, keys and values (in that order) of cross-attention to the state dict
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.q_proj.weight"] = in_proj_weight_cross_attn[:256, :]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.q_proj.bias"] = in_proj_bias_cross_attn[:256]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.k_proj.weight"] = in_proj_weight_cross_attn[
|
||||||
|
256:512, :
|
||||||
|
]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.k_proj.bias"] = in_proj_bias_cross_attn[256:512]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.v_proj.weight"] = in_proj_weight_cross_attn[-256:, :]
|
||||||
|
dst_state_dict[f"{dst_prefix}.layers.{i}.encoder_attn.v_proj.bias"] = in_proj_bias_cross_attn[-256:]
|
||||||
|
|
||||||
|
def replace_detr_decoder(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
dst_prefix: str = "transformer_module.decoder"
|
||||||
|
src_prefix: str = "sem_seg_head.predictor.transformer.decoder"
|
||||||
|
renamed_keys = self.rename_keys_in_detr_decoder(dst_state_dict, src_state_dict)
|
||||||
|
# add more
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(f"{src_prefix}.norm.weight", f"{dst_prefix}.layernorm.weight"),
|
||||||
|
(f"{src_prefix}.norm.bias", f"{dst_prefix}.layernorm.bias"),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
self.pop_all(renamed_keys, dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
self.replace_q_k_v_in_detr_decoder(dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
def replace_transformer_module(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
dst_prefix: str = "transformer_module"
|
||||||
|
src_prefix: str = "sem_seg_head.predictor"
|
||||||
|
|
||||||
|
self.replace_detr_decoder(dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
renamed_keys = [
|
||||||
|
(f"{src_prefix}.query_embed.weight", f"{dst_prefix}.queries_embedder.weight"),
|
||||||
|
(f"{src_prefix}.input_proj.weight", f"{dst_prefix}.input_projection.weight"),
|
||||||
|
(f"{src_prefix}.input_proj.bias", f"{dst_prefix}.input_projection.bias"),
|
||||||
|
]
|
||||||
|
|
||||||
|
self.pop_all(renamed_keys, dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
def replace_instance_segmentation_module(self, dst_state_dict: StateDict, src_state_dict: StateDict):
|
||||||
|
# NOTE in our case we don't have a prefix, thus we removed the "." from the keys later on!
|
||||||
|
dst_prefix: str = ""
|
||||||
|
src_prefix: str = "sem_seg_head.predictor"
|
||||||
|
|
||||||
|
renamed_keys = [
|
||||||
|
(f"{src_prefix}.class_embed.weight", f"{dst_prefix}class_predictor.weight"),
|
||||||
|
(f"{src_prefix}.class_embed.bias", f"{dst_prefix}class_predictor.bias"),
|
||||||
|
]
|
||||||
|
|
||||||
|
mlp_len = 3
|
||||||
|
for i in range(mlp_len):
|
||||||
|
renamed_keys.extend(
|
||||||
|
[
|
||||||
|
(f"{src_prefix}.mask_embed.layers.{i}.weight", f"{dst_prefix}mask_embedder.{i}.0.weight"),
|
||||||
|
(f"{src_prefix}.mask_embed.layers.{i}.bias", f"{dst_prefix}mask_embedder.{i}.0.bias"),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
logger.info(f"Replacing keys {pformat(renamed_keys)}")
|
||||||
|
self.pop_all(renamed_keys, dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
def convert(self, mask_former: MaskFormerModel) -> MaskFormerModel:
|
||||||
|
dst_state_dict = TrackedStateDict(mask_former.state_dict())
|
||||||
|
src_state_dict = self.original_model.state_dict()
|
||||||
|
|
||||||
|
self.replace_pixel_module(dst_state_dict, src_state_dict)
|
||||||
|
self.replace_transformer_module(dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
logger.info(f"Missed keys are {pformat(dst_state_dict.diff())}")
|
||||||
|
logger.info(f"Not copied keys are {pformat(src_state_dict.keys())}")
|
||||||
|
logger.info("🙌 Done")
|
||||||
|
|
||||||
|
mask_former.load_state_dict(dst_state_dict)
|
||||||
|
|
||||||
|
return mask_former
|
||||||
|
|
||||||
|
def convert_instance_segmentation(
|
||||||
|
self, mask_former: MaskFormerForInstanceSegmentation
|
||||||
|
) -> MaskFormerForInstanceSegmentation:
|
||||||
|
dst_state_dict = TrackedStateDict(mask_former.state_dict())
|
||||||
|
src_state_dict = self.original_model.state_dict()
|
||||||
|
|
||||||
|
self.replace_instance_segmentation_module(dst_state_dict, src_state_dict)
|
||||||
|
|
||||||
|
mask_former.load_state_dict(dst_state_dict)
|
||||||
|
|
||||||
|
return mask_former
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def using_dirs(checkpoints_dir: Path, config_dir: Path) -> Iterator[Tuple[object, Path, Path]]:
|
||||||
|
checkpoints: List[Path] = checkpoints_dir.glob("**/*.pkl")
|
||||||
|
|
||||||
|
for checkpoint in checkpoints:
|
||||||
|
logger.info(f"💪 Converting {checkpoint.stem}")
|
||||||
|
# find associated config file
|
||||||
|
config: Path = config_dir / checkpoint.parents[0].stem / "swin" / f"{checkpoint.stem}.yaml"
|
||||||
|
|
||||||
|
yield config, checkpoint
|
||||||
|
|
||||||
|
|
||||||
|
def test(original_model, our_model: MaskFormerForInstanceSegmentation):
|
||||||
|
with torch.no_grad():
|
||||||
|
|
||||||
|
original_model = original_model.eval()
|
||||||
|
our_model = our_model.eval()
|
||||||
|
|
||||||
|
im = prepare_img()
|
||||||
|
|
||||||
|
tr = T.Compose(
|
||||||
|
[
|
||||||
|
T.Resize((384, 384)),
|
||||||
|
T.ToTensor(),
|
||||||
|
T.Normalize(
|
||||||
|
mean=torch.tensor([123.675, 116.280, 103.530]) / 255.0,
|
||||||
|
std=torch.tensor([58.395, 57.120, 57.375]) / 255.0,
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
x = tr(im).unsqueeze(0)
|
||||||
|
|
||||||
|
original_model_backbone_features = original_model.backbone(x.clone())
|
||||||
|
|
||||||
|
our_model_output: MaskFormerModelOutput = our_model.model(x.clone(), output_hidden_states=True)
|
||||||
|
|
||||||
|
for original_model_feature, our_model_feature in zip(
|
||||||
|
original_model_backbone_features.values(), our_model_output.encoder_hidden_states
|
||||||
|
):
|
||||||
|
|
||||||
|
assert torch.allclose(
|
||||||
|
original_model_feature, our_model_feature, atol=1e-3
|
||||||
|
), "The backbone features are not the same."
|
||||||
|
|
||||||
|
original_model_pixel_out = original_model.sem_seg_head.pixel_decoder.forward_features(
|
||||||
|
original_model_backbone_features
|
||||||
|
)
|
||||||
|
|
||||||
|
assert torch.allclose(
|
||||||
|
original_model_pixel_out[0], our_model_output.pixel_decoder_last_hidden_state, atol=1e-4
|
||||||
|
), "The pixel decoder feature are not the same"
|
||||||
|
|
||||||
|
# let's test the full model
|
||||||
|
original_model_out = original_model([{"image": x.squeeze(0)}])
|
||||||
|
|
||||||
|
original_segmentation = original_model_out[0]["sem_seg"]
|
||||||
|
|
||||||
|
our_model_out: MaskFormerForInstanceSegmentationOutput = our_model(x)
|
||||||
|
|
||||||
|
feature_extractor = MaskFormerFeatureExtractor()
|
||||||
|
|
||||||
|
our_segmentation = feature_extractor.post_process_segmentation(our_model_out, target_size=(384, 384))
|
||||||
|
|
||||||
|
assert torch.allclose(
|
||||||
|
original_segmentation, our_segmentation, atol=1e-3
|
||||||
|
), "The segmentation image is not the same."
|
||||||
|
|
||||||
|
logger.info("✅ Test passed!")
|
||||||
|
|
||||||
|
|
||||||
|
def get_name(checkpoint_file: Path):
|
||||||
|
model_name_raw: str = checkpoint_file.stem
|
||||||
|
# model_name_raw is something like maskformer_panoptic_swin_base_IN21k_384_bs64_554k
|
||||||
|
parent_name: str = checkpoint_file.parents[0].stem
|
||||||
|
backbone = "swin"
|
||||||
|
dataset = ""
|
||||||
|
if "coco" in parent_name:
|
||||||
|
dataset = "coco"
|
||||||
|
elif "ade" in parent_name:
|
||||||
|
dataset = "ade"
|
||||||
|
else:
|
||||||
|
raise ValueError(f"{parent_name} must be wrong since we didn't find 'coco' or 'ade' in it ")
|
||||||
|
|
||||||
|
backbone_types = ["tiny", "small", "base", "large"]
|
||||||
|
|
||||||
|
backbone_type = list(filter(lambda x: x in model_name_raw, backbone_types))[0]
|
||||||
|
|
||||||
|
model_name = f"maskformer-{backbone}-{backbone_type}-{dataset}"
|
||||||
|
|
||||||
|
return model_name
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
|
||||||
|
parser = ArgumentParser(
|
||||||
|
description="Command line to convert the original maskformers (with swin backbone) to our implementations."
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--checkpoints_dir",
|
||||||
|
type=Path,
|
||||||
|
help="A directory containing the model's checkpoints. The directory has to have the following structure: <DIR_NAME>/<DATASET_NAME>/<CONFIG_NAME>.pkl",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--configs_dir",
|
||||||
|
type=Path,
|
||||||
|
help="A directory containing the model's configs, see detectron2 doc. The directory has to have the following structure: <DIR_NAME>/<DATASET_NAME>/<CONFIG_NAME>.yaml",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--pytorch_dump_folder_path",
|
||||||
|
required=True,
|
||||||
|
type=Path,
|
||||||
|
help="Path to the folder to output PyTorch models.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--maskformer_dir",
|
||||||
|
required=True,
|
||||||
|
type=Path,
|
||||||
|
help="A path to MaskFormer's original implementation directory. You can download from here: https://github.com/facebookresearch/MaskFormer",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
checkpoints_dir: Path = args.checkpoints_dir
|
||||||
|
config_dir: Path = args.configs_dir
|
||||||
|
save_directory: Path = args.pytorch_dump_folder_path
|
||||||
|
maskformer_dir: Path = args.maskformer_dir
|
||||||
|
# append the path to the parents to maskformer dir
|
||||||
|
sys.path.append(str(maskformer_dir.parent))
|
||||||
|
# and import what's needed
|
||||||
|
from MaskFormer.mask_former import add_mask_former_config
|
||||||
|
from MaskFormer.mask_former.mask_former_model import MaskFormer as OriginalMaskFormer
|
||||||
|
|
||||||
|
if not save_directory.exists():
|
||||||
|
save_directory.mkdir(parents=True)
|
||||||
|
|
||||||
|
for config_file, checkpoint_file in OriginalMaskFormerCheckpointToOursConverter.using_dirs(
|
||||||
|
checkpoints_dir, config_dir
|
||||||
|
):
|
||||||
|
|
||||||
|
feature_extractor = OriginalMaskFormerConfigToFeatureExtractorConverter()(
|
||||||
|
setup_cfg(Args(config_file=config_file))
|
||||||
|
)
|
||||||
|
|
||||||
|
original_config = setup_cfg(Args(config_file=config_file))
|
||||||
|
mask_former_kwargs = OriginalMaskFormer.from_config(original_config)
|
||||||
|
|
||||||
|
original_model = OriginalMaskFormer(**mask_former_kwargs).eval()
|
||||||
|
|
||||||
|
DetectionCheckpointer(original_model).load(str(checkpoint_file))
|
||||||
|
|
||||||
|
config: MaskFormerConfig = OriginalMaskFormerConfigToOursConverter()(original_config)
|
||||||
|
|
||||||
|
mask_former = MaskFormerModel(config=config).eval()
|
||||||
|
|
||||||
|
converter = OriginalMaskFormerCheckpointToOursConverter(original_model, config)
|
||||||
|
|
||||||
|
maskformer = converter.convert(mask_former)
|
||||||
|
|
||||||
|
mask_former_for_instance_segmentation = MaskFormerForInstanceSegmentation(config=config).eval()
|
||||||
|
|
||||||
|
mask_former_for_instance_segmentation.model = mask_former
|
||||||
|
mask_former_for_instance_segmentation = converter.convert_instance_segmentation(
|
||||||
|
mask_former_for_instance_segmentation
|
||||||
|
)
|
||||||
|
|
||||||
|
test(original_model, mask_former_for_instance_segmentation)
|
||||||
|
|
||||||
|
model_name = get_name(checkpoint_file)
|
||||||
|
logger.info(f"🪄 Saving {model_name}")
|
||||||
|
|
||||||
|
feature_extractor.save_pretrained(save_directory / model_name)
|
||||||
|
mask_former_for_instance_segmentation.save_pretrained(save_directory / model_name)
|
||||||
|
|
||||||
|
feature_extractor.push_to_hub(
|
||||||
|
repo_path_or_name=save_directory / model_name,
|
||||||
|
commit_message="Add model",
|
||||||
|
use_temp_dir=True,
|
||||||
|
)
|
||||||
|
mask_former_for_instance_segmentation.push_to_hub(
|
||||||
|
repo_path_or_name=save_directory / model_name,
|
||||||
|
commit_message="Add model",
|
||||||
|
use_temp_dir=True,
|
||||||
|
)
|
||||||
@@ -0,0 +1,569 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Feature extractor class for MaskFormer."""
|
||||||
|
|
||||||
|
from typing import TYPE_CHECKING, Dict, List, Optional, Tuple, Union
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
|
||||||
|
from ...file_utils import TensorType, is_torch_available
|
||||||
|
from ...image_utils import ImageFeatureExtractionMixin, ImageInput, is_torch_tensor
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
from torch import Tensor, nn
|
||||||
|
from torch.nn.functional import interpolate
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from transformers.models.maskformer.modeling_maskformer import MaskFormerForInstanceSegmentationOutput
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a MaskFormer feature extractor. The feature extractor can be used to prepare image(s) and optional
|
||||||
|
targets for the model.
|
||||||
|
|
||||||
|
This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
|
||||||
|
should refer to this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
do_resize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to resize the input to a certain `size`.
|
||||||
|
size (`int`, *optional*, defaults to 800):
|
||||||
|
Resize the input to the given size. Only has an effect if `do_resize` is set to `True`. If size is a
|
||||||
|
sequence like `(width, height)`, output size will be matched to this. If size is an int, smaller edge of
|
||||||
|
the image will be matched to this number. i.e, if `height > width`, then image will be rescaled to `(size *
|
||||||
|
height / width, size)`.
|
||||||
|
max_size (`int`, *optional*, defaults to 1333):
|
||||||
|
The largest size an image dimension can have (otherwise it's capped). Only has an effect if `do_resize` is
|
||||||
|
set to `True`.
|
||||||
|
size_divisibility (`int`, *optional*, defaults to 32):
|
||||||
|
Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in
|
||||||
|
Swin Transformer.
|
||||||
|
do_normalize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to normalize the input with mean and standard deviation.
|
||||||
|
image_mean (`int`, *optional*, defaults to `[0.485, 0.456, 0.406]`):
|
||||||
|
The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
|
||||||
|
image_std (`int`, *optional*, defaults to `[0.229, 0.224, 0.225]`):
|
||||||
|
The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the
|
||||||
|
ImageNet std.
|
||||||
|
ignore_index (`int`, *optional*, default to 255):
|
||||||
|
Value of the index (label) to ignore.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_input_names = ["pixel_values", "pixel_mask"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
do_resize=True,
|
||||||
|
size=800,
|
||||||
|
max_size=1333,
|
||||||
|
size_divisibility=32,
|
||||||
|
do_normalize=True,
|
||||||
|
image_mean=None,
|
||||||
|
image_std=None,
|
||||||
|
ignore_index=255,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.do_resize = do_resize
|
||||||
|
self.size = size
|
||||||
|
self.max_size = max_size
|
||||||
|
self.size_divisibility = size_divisibility
|
||||||
|
self.ignore_index = ignore_index
|
||||||
|
self.do_normalize = do_normalize
|
||||||
|
self.image_mean = image_mean if image_mean is not None else [0.485, 0.456, 0.406] # ImageNet mean
|
||||||
|
self.image_std = image_std if image_std is not None else [0.229, 0.224, 0.225] # ImageNet std
|
||||||
|
|
||||||
|
def _resize(self, image, size, target=None, max_size=None):
|
||||||
|
"""
|
||||||
|
Resize the image to the given size. Size can be min_size (scalar) or (width, height) tuple. If size is an int,
|
||||||
|
smaller edge of the image will be matched to this number.
|
||||||
|
|
||||||
|
If given, also resize the target accordingly.
|
||||||
|
"""
|
||||||
|
if not isinstance(image, Image.Image):
|
||||||
|
image = self.to_pil_image(image)
|
||||||
|
|
||||||
|
def get_size_with_aspect_ratio(image_size, size, max_size=None):
|
||||||
|
width, height = image_size
|
||||||
|
if max_size is not None:
|
||||||
|
min_original_size = float(min((width, height)))
|
||||||
|
max_original_size = float(max((width, height)))
|
||||||
|
if max_original_size / min_original_size * size > max_size:
|
||||||
|
size = int(round(max_size * min_original_size / max_original_size))
|
||||||
|
|
||||||
|
if (width <= height and width == size) or (height <= width and height == size):
|
||||||
|
return (height, width)
|
||||||
|
|
||||||
|
if width < height:
|
||||||
|
output_width = size
|
||||||
|
output_height = int(size * height / width)
|
||||||
|
else:
|
||||||
|
output_height = size
|
||||||
|
output_width = int(size * width / height)
|
||||||
|
|
||||||
|
return (output_height, output_width)
|
||||||
|
|
||||||
|
def get_size(image_size, size, max_size=None):
|
||||||
|
if isinstance(size, (list, tuple)):
|
||||||
|
return size
|
||||||
|
else:
|
||||||
|
# size returned must be (width, height) since we use PIL to resize images
|
||||||
|
# so we revert the tuple
|
||||||
|
return get_size_with_aspect_ratio(image_size, size, max_size)[::-1]
|
||||||
|
|
||||||
|
width, height = get_size(image.size, size, max_size)
|
||||||
|
|
||||||
|
if self.size_divisibility > 0:
|
||||||
|
height = int(np.ceil(height / self.size_divisibility)) * self.size_divisibility
|
||||||
|
width = int(np.ceil(width / self.size_divisibility)) * self.size_divisibility
|
||||||
|
|
||||||
|
size = (width, height)
|
||||||
|
rescaled_image = self.resize(image, size=size)
|
||||||
|
|
||||||
|
has_target = target is not None
|
||||||
|
|
||||||
|
if has_target:
|
||||||
|
target = target.copy()
|
||||||
|
# store original_size
|
||||||
|
target["original_size"] = image.size
|
||||||
|
if "masks" in target:
|
||||||
|
masks = torch.from_numpy(target["masks"])[:, None].float()
|
||||||
|
# use PyTorch as current workaround
|
||||||
|
# TODO replace by self.resize
|
||||||
|
interpolated_masks = (
|
||||||
|
nn.functional.interpolate(masks, size=(height, width), mode="nearest")[:, 0] > 0.5
|
||||||
|
).float()
|
||||||
|
target["masks"] = interpolated_masks.numpy()
|
||||||
|
|
||||||
|
return rescaled_image, target
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
images: ImageInput,
|
||||||
|
annotations: Union[List[Dict], List[List[Dict]]] = None,
|
||||||
|
pad_and_return_pixel_mask: Optional[bool] = True,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
**kwargs,
|
||||||
|
) -> BatchFeature:
|
||||||
|
"""
|
||||||
|
Main method to prepare for the model one or several image(s) and optional annotations. Images are by default
|
||||||
|
padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are
|
||||||
|
real/which are padding.
|
||||||
|
|
||||||
|
<Tip warning={true}>
|
||||||
|
|
||||||
|
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
|
||||||
|
PIL images.
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||||
|
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||||
|
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||||
|
number of channels, H and W are image height and width.
|
||||||
|
|
||||||
|
annotations (`Dict`, `List[Dict]`, *optional*):
|
||||||
|
The corresponding annotations as dictionary of numpy arrays with the following keys:
|
||||||
|
- **masks** (`np.ndarray`) The target mask of shape `(num_classes, height, width)`.
|
||||||
|
- **labels** (`np.ndarray`) The target labels of shape `(num_classes)`.
|
||||||
|
|
||||||
|
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
|
||||||
|
|
||||||
|
If left to the default, will return a pixel mask that is:
|
||||||
|
|
||||||
|
- 1 for pixels that are real (i.e. **not masked**),
|
||||||
|
- 0 for pixels that are padding (i.e. **masked**).
|
||||||
|
|
||||||
|
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
|
||||||
|
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
|
||||||
|
objects.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||||
|
|
||||||
|
- **pixel_values** -- Pixel values to be fed to a model.
|
||||||
|
- **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
|
||||||
|
*"pixel_mask"* is in `self.model_input_names`).
|
||||||
|
- **mask_labels** -- Optional mask labels of shape `(batch_size, num_classes, height, width) to be fed to a
|
||||||
|
model (when `annotations` are provided).
|
||||||
|
- **class_labels** -- Optional class labels of shape `(batch_size, num_classes) to be fed to a model (when
|
||||||
|
`annotations` are provided).
|
||||||
|
"""
|
||||||
|
# Input type checking for clearer error
|
||||||
|
|
||||||
|
valid_images = False
|
||||||
|
valid_annotations = False
|
||||||
|
|
||||||
|
# Check that images has a valid type
|
||||||
|
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
|
||||||
|
valid_images = True
|
||||||
|
elif isinstance(images, (list, tuple)):
|
||||||
|
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
|
||||||
|
valid_images = True
|
||||||
|
|
||||||
|
if not valid_images:
|
||||||
|
raise ValueError(
|
||||||
|
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example), "
|
||||||
|
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
|
||||||
|
)
|
||||||
|
|
||||||
|
is_batched = bool(
|
||||||
|
isinstance(images, (list, tuple))
|
||||||
|
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
|
||||||
|
)
|
||||||
|
|
||||||
|
if not is_batched:
|
||||||
|
images = [images]
|
||||||
|
if annotations is not None:
|
||||||
|
annotations = [annotations]
|
||||||
|
|
||||||
|
# Check that annotations has a valid type
|
||||||
|
if annotations is not None:
|
||||||
|
valid_annotations = type(annotations) is list and "masks" in annotations[0] and "labels" in annotations[0]
|
||||||
|
if not valid_annotations:
|
||||||
|
raise ValueError(
|
||||||
|
"Annotations must of type `Dict` (single image) or `List[Dict]` (batch of images)."
|
||||||
|
"The annotations must be numpy arrays in the following format:"
|
||||||
|
"{ 'masks' : the target mask, with shape [C,H,W], 'labels' : the target labels, with shape [C]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# transformations (resizing + normalization)
|
||||||
|
if self.do_resize and self.size is not None:
|
||||||
|
if annotations is not None:
|
||||||
|
for idx, (image, target) in enumerate(zip(images, annotations)):
|
||||||
|
image, target = self._resize(image=image, target=target, size=self.size, max_size=self.max_size)
|
||||||
|
images[idx] = image
|
||||||
|
annotations[idx] = target
|
||||||
|
else:
|
||||||
|
for idx, image in enumerate(images):
|
||||||
|
images[idx] = self._resize(image=image, target=None, size=self.size, max_size=self.max_size)[0]
|
||||||
|
|
||||||
|
if self.do_normalize:
|
||||||
|
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
|
||||||
|
# NOTE I will be always forced to pad them them since they have to be stacked in the batch dim
|
||||||
|
encoded_inputs = self.encode_inputs(
|
||||||
|
images, annotations, pad_and_return_pixel_mask, return_tensors=return_tensors
|
||||||
|
)
|
||||||
|
|
||||||
|
# Convert to TensorType
|
||||||
|
tensor_type = return_tensors
|
||||||
|
if not isinstance(tensor_type, TensorType):
|
||||||
|
tensor_type = TensorType(tensor_type)
|
||||||
|
|
||||||
|
if not tensor_type == TensorType.PYTORCH:
|
||||||
|
raise ValueError("Only PyTorch is supported for the moment.")
|
||||||
|
else:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
|
||||||
|
|
||||||
|
return encoded_inputs
|
||||||
|
|
||||||
|
def _max_by_axis(self, the_list: List[List[int]]) -> List[int]:
|
||||||
|
maxes = the_list[0]
|
||||||
|
for sublist in the_list[1:]:
|
||||||
|
for index, item in enumerate(sublist):
|
||||||
|
maxes[index] = max(maxes[index], item)
|
||||||
|
return maxes
|
||||||
|
|
||||||
|
def encode_inputs(
|
||||||
|
self,
|
||||||
|
pixel_values_list: List["torch.Tensor"],
|
||||||
|
annotations: Optional[List[Dict]] = None,
|
||||||
|
pad_and_return_pixel_mask: Optional[bool] = True,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pixel_values_list (`List[torch.Tensor]`):
|
||||||
|
List of images (pixel values) to be padded. Each image should be a tensor of shape `(channels, height,
|
||||||
|
width)`.
|
||||||
|
|
||||||
|
annotations (`Dict`, `List[Dict]`, *optional*):
|
||||||
|
The corresponding annotations as dictionary of numpy arrays with the following keys:
|
||||||
|
- **masks** (`np.ndarray`) The target mask of shape `(num_classes, height, width)`.
|
||||||
|
- **labels** (`np.ndarray`) The target labels of shape `(num_classes)`.
|
||||||
|
|
||||||
|
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
|
||||||
|
|
||||||
|
If left to the default, will return a pixel mask that is:
|
||||||
|
|
||||||
|
- 1 for pixels that are real (i.e. **not masked**),
|
||||||
|
- 0 for pixels that are padding (i.e. **masked**).
|
||||||
|
|
||||||
|
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
|
||||||
|
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
|
||||||
|
objects.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||||
|
|
||||||
|
- **pixel_values** -- Pixel values to be fed to a model.
|
||||||
|
- **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
|
||||||
|
*"pixel_mask"* is in `self.model_input_names`).
|
||||||
|
- **mask_labels** -- Optional mask labels of shape `(batch_size, num_classes, height, width) to be fed to a
|
||||||
|
model (when `annotations` are provided).
|
||||||
|
- **class_labels** -- Optional class labels of shape `(batch_size, num_classes) to be fed to a model (when
|
||||||
|
`annotations` are provided).
|
||||||
|
"""
|
||||||
|
|
||||||
|
max_size = self._max_by_axis([list(image.shape) for image in pixel_values_list])
|
||||||
|
channels, height, width = max_size
|
||||||
|
pixel_values = []
|
||||||
|
pixel_mask = []
|
||||||
|
mask_labels = []
|
||||||
|
class_labels = []
|
||||||
|
for idx, image in enumerate(pixel_values_list):
|
||||||
|
# create padded image
|
||||||
|
if pad_and_return_pixel_mask:
|
||||||
|
padded_image = np.zeros((channels, height, width), dtype=np.float32)
|
||||||
|
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
|
||||||
|
image = padded_image
|
||||||
|
pixel_values.append(image)
|
||||||
|
# if we have a target, pad it
|
||||||
|
if annotations:
|
||||||
|
annotation = annotations[idx]
|
||||||
|
masks = annotation["masks"]
|
||||||
|
if pad_and_return_pixel_mask:
|
||||||
|
padded_masks = np.zeros((masks.shape[0], height, width), dtype=masks.dtype)
|
||||||
|
padded_masks[:, : masks.shape[1], : masks.shape[2]] = np.copy(masks)
|
||||||
|
masks = padded_masks
|
||||||
|
mask_labels.append(masks)
|
||||||
|
class_labels.append(annotation["labels"])
|
||||||
|
if pad_and_return_pixel_mask:
|
||||||
|
# create pixel mask
|
||||||
|
mask = np.zeros((height, width), dtype=np.int64)
|
||||||
|
mask[: image.shape[1], : image.shape[2]] = True
|
||||||
|
pixel_mask.append(mask)
|
||||||
|
|
||||||
|
# return as BatchFeature
|
||||||
|
data = {"pixel_values": pixel_values, "pixel_mask": pixel_mask}
|
||||||
|
|
||||||
|
if annotations:
|
||||||
|
data["mask_labels"] = mask_labels
|
||||||
|
data["class_labels"] = class_labels
|
||||||
|
|
||||||
|
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||||
|
|
||||||
|
return encoded_inputs
|
||||||
|
|
||||||
|
def post_process_segmentation(
|
||||||
|
self, outputs: "MaskFormerForInstanceSegmentationOutput", target_size: Tuple[int, int] = None
|
||||||
|
) -> "torch.Tensor":
|
||||||
|
"""
|
||||||
|
Converts the output of [`MaskFormerForInstanceSegmentationOutput`] into image segmentation predictions. Only
|
||||||
|
supports PyTorch.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
outputs ([`MaskFormerForInstanceSegmentationOutput`]):
|
||||||
|
The outputs from [`MaskFormerForInstanceSegmentation`].
|
||||||
|
|
||||||
|
target_size (`Tuple[int, int]`, *optional*):
|
||||||
|
If set, the `masks_queries_logits` will be resized to `target_size`.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`torch.Tensor`:
|
||||||
|
A tensor of shape (`batch_size, num_labels, height, width`).
|
||||||
|
"""
|
||||||
|
# class_queries_logits has shape [BATCH, QUERIES, CLASSES + 1]
|
||||||
|
class_queries_logits = outputs.class_queries_logits
|
||||||
|
# masks_queries_logits has shape [BATCH, QUERIES, HEIGHT, WIDTH]
|
||||||
|
masks_queries_logits = outputs.masks_queries_logits
|
||||||
|
if target_size is not None:
|
||||||
|
masks_queries_logits = interpolate(
|
||||||
|
masks_queries_logits,
|
||||||
|
size=target_size,
|
||||||
|
mode="bilinear",
|
||||||
|
align_corners=False,
|
||||||
|
)
|
||||||
|
# remove the null class `[..., :-1]`
|
||||||
|
masks_classes = class_queries_logits.softmax(dim=-1)[..., :-1]
|
||||||
|
# mask probs has shape [BATCH, QUERIES, HEIGHT, WIDTH]
|
||||||
|
masks_probs = masks_queries_logits.sigmoid()
|
||||||
|
# now we want to sum over the queries,
|
||||||
|
# $ out_{c,h,w} = \sum_q p_{q,c} * m_{q,h,w} $
|
||||||
|
# where $ softmax(p) \in R^{q, c} $ is the mask classes
|
||||||
|
# and $ sigmoid(m) \in R^{q, h, w}$ is the mask probabilities
|
||||||
|
# b(atch)q(uery)c(lasses), b(atch)q(uery)h(eight)w(idth)
|
||||||
|
segmentation = torch.einsum("bqc, bqhw -> bchw", masks_classes, masks_probs)
|
||||||
|
|
||||||
|
return segmentation
|
||||||
|
|
||||||
|
def remove_low_and_no_objects(self, masks, scores, labels, object_mask_threshold, num_labels):
|
||||||
|
"""
|
||||||
|
Binarize the given masks using `object_mask_threshold`, it returns the associated values of `masks`, `scores`
|
||||||
|
and `labels`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
masks (`torch.Tensor`):
|
||||||
|
A tensor of shape `(num_queries, height, width)`.
|
||||||
|
scores (`torch.Tensor`):
|
||||||
|
A tensor of shape `(num_queries)`.
|
||||||
|
labels (`torch.Tensor`):
|
||||||
|
A tensor of shape `(num_queries)`.
|
||||||
|
object_mask_threshold (`float`):
|
||||||
|
A number between 0 and 1 used to binarize the masks.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
`ValueError`: Raised when the first dimension doesn't match in all input tensors.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Tuple[`torch.Tensor`, `torch.Tensor`, `torch.Tensor`]`: The `masks`, `scores` and `labels` without the
|
||||||
|
region < `object_mask_threshold`.
|
||||||
|
"""
|
||||||
|
if not (masks.shape[0] == scores.shape[0] == labels.shape[0]):
|
||||||
|
raise ValueError("mask, scores and labels must have the same shape!")
|
||||||
|
|
||||||
|
to_keep = labels.ne(num_labels) & (scores > object_mask_threshold)
|
||||||
|
|
||||||
|
return masks[to_keep], scores[to_keep], labels[to_keep]
|
||||||
|
|
||||||
|
def post_process_semantic_segmentation(
|
||||||
|
self, outputs: "MaskFormerForInstanceSegmentationOutput", target_size: Tuple[int, int] = None
|
||||||
|
) -> "torch.Tensor":
|
||||||
|
"""
|
||||||
|
Converts the output of [`MaskFormerForInstanceSegmentationOutput`] into semantic segmentation predictions. Only
|
||||||
|
supports PyTorch.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
outputs ([`MaskFormerForInstanceSegmentationOutput`]):
|
||||||
|
The outputs from [`MaskFormerForInstanceSegmentation`].
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`torch.Tensor`: A tensor of shape `batch_size, height, width`.
|
||||||
|
"""
|
||||||
|
segmentation = self.post_process_segmentation(outputs, target_size)
|
||||||
|
semantic_segmentation = segmentation.argmax(dim=1)
|
||||||
|
return semantic_segmentation
|
||||||
|
|
||||||
|
def post_process_panoptic_segmentation(
|
||||||
|
self,
|
||||||
|
outputs: "MaskFormerForInstanceSegmentationOutput",
|
||||||
|
object_mask_threshold: float = 0.8,
|
||||||
|
overlap_mask_area_threshold: float = 0.8,
|
||||||
|
is_thing_map: Optional[Dict[int, bool]] = None,
|
||||||
|
) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Converts the output of [`MaskFormerForInstanceSegmentationOutput`] into image panoptic segmentation
|
||||||
|
predictions. Only supports PyTorch.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
outputs ([`MaskFormerForInstanceSegmentationOutput`]):
|
||||||
|
The outputs from [`MaskFormerForInstanceSegmentation`].
|
||||||
|
object_mask_threshold (`float`, *optional*, defaults to 0.8):
|
||||||
|
The object mask threshold.
|
||||||
|
overlap_mask_area_threshold (`float`, *optional*, defaults to 0.8):
|
||||||
|
The overlap mask area threshold to use.
|
||||||
|
is_thing_map (`Dict[int, bool]`, *optional*):
|
||||||
|
Dictionary mapping class indices to either `True` or `False`, depending on whether or not they are a
|
||||||
|
thing. If not set, defaults to the `is_thing_map` of COCO panoptic.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`List[Dict]`: A list of dictionaries, one per image, each dictionary containing two keys:
|
||||||
|
- **segmentation** -- a tensor of shape `(height, width)` where each pixel represents a `segment_id`.
|
||||||
|
- **segments** -- a dictionary with the following keys
|
||||||
|
- **id** -- an integer representing the `segment_id`.
|
||||||
|
- **category_id** -- an integer representing the segment's label.
|
||||||
|
- **is_thing** -- a boolean, `True` if `category_id` was in `is_thing_map`, `False` otherwise.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if is_thing_map is None:
|
||||||
|
logger.warning("`is_thing_map` unset. Default to COCO.")
|
||||||
|
# default to is_thing_map of COCO panoptic
|
||||||
|
is_thing_map = {i: i <= 90 for i in range(201)}
|
||||||
|
# class_queries_logits has shape [BATCH, QUERIES, CLASSES + 1]
|
||||||
|
class_queries_logits = outputs.class_queries_logits
|
||||||
|
# keep track of the number of labels, subtract -1 for null class
|
||||||
|
num_labels = class_queries_logits.shape[-1] - 1
|
||||||
|
# masks_queries_logits has shape [BATCH, QUERIES, HEIGHT, WIDTH]
|
||||||
|
masks_queries_logits = outputs.masks_queries_logits
|
||||||
|
# since all images are padded, they all have the same spatial dimensions
|
||||||
|
_, _, height, width = masks_queries_logits.shape
|
||||||
|
# for each query, the best scores and their indeces
|
||||||
|
pred_scores, pred_labels = nn.functional.softmax(class_queries_logits, dim=-1).max(-1)
|
||||||
|
# pred_scores and pred_labels shape = [BATH,NUM_QUERIES]
|
||||||
|
mask_probs = masks_queries_logits.sigmoid()
|
||||||
|
# mask probs has shape [BATCH, QUERIES, HEIGHT, WIDTH]
|
||||||
|
# now, we need to iterate over the batch size to correctly process the segmentation we got from the queries using our thresholds. Even if the original predicted masks have the same shape across the batch, they won't after thresholding so batch-wise operations are impossible
|
||||||
|
results: List[Dict[str, Tensor]] = []
|
||||||
|
for (mask_probs, pred_scores, pred_labels) in zip(mask_probs, pred_scores, pred_labels):
|
||||||
|
mask_probs, pred_scores, pred_labels = self.remove_low_and_no_objects(
|
||||||
|
mask_probs, pred_scores, pred_labels, object_mask_threshold, num_labels
|
||||||
|
)
|
||||||
|
we_detect_something = mask_probs.shape[0] > 0
|
||||||
|
|
||||||
|
segmentation = torch.zeros((height, width), dtype=torch.int32, device=mask_probs.device)
|
||||||
|
segments: List[Dict] = []
|
||||||
|
|
||||||
|
if we_detect_something:
|
||||||
|
current_segment_id = 0
|
||||||
|
# weight each mask by its score
|
||||||
|
mask_probs *= pred_scores.view(-1, 1, 1)
|
||||||
|
# find out for each pixel what is the most likely class to be there
|
||||||
|
mask_labels = mask_probs.argmax(0)
|
||||||
|
# mask_labels shape = [H,W] where each pixel has a class label
|
||||||
|
stuff_memory_list: Dict[str, int] = {}
|
||||||
|
# this is a map between stuff and segments id, the used it to keep track of the instances of one class
|
||||||
|
for k in range(pred_labels.shape[0]):
|
||||||
|
pred_class = pred_labels[k].item()
|
||||||
|
# check if pred_class is not a "thing", so it can be merged with other instance. For example, class "sky" cannot have more then one instance
|
||||||
|
is_stuff = not is_thing_map[pred_class]
|
||||||
|
# get the mask associated with the k class
|
||||||
|
mask_k = mask_labels == k
|
||||||
|
# create the area, since bool we just need to sum :)
|
||||||
|
mask_k_area = mask_k.sum()
|
||||||
|
# this is the area of all the stuff in query k
|
||||||
|
# TODO not 100%, why are the taking the k query here????
|
||||||
|
original_area = (mask_probs[k] >= 0.5).sum()
|
||||||
|
|
||||||
|
mask_does_exist = mask_k_area > 0 and original_area > 0
|
||||||
|
|
||||||
|
if mask_does_exist:
|
||||||
|
# find out how much of the all area mask_k is using
|
||||||
|
area_ratio = mask_k_area / original_area
|
||||||
|
mask_k_is_overlapping_enough = area_ratio.item() > overlap_mask_area_threshold
|
||||||
|
|
||||||
|
if mask_k_is_overlapping_enough:
|
||||||
|
# merge stuff regions
|
||||||
|
if pred_class in stuff_memory_list:
|
||||||
|
current_segment_id = stuff_memory_list[pred_class]
|
||||||
|
else:
|
||||||
|
current_segment_id += 1
|
||||||
|
# then we update out mask with the current segment
|
||||||
|
segmentation[mask_k] = current_segment_id
|
||||||
|
segments.append(
|
||||||
|
{
|
||||||
|
"id": current_segment_id,
|
||||||
|
"category_id": pred_class,
|
||||||
|
"is_thing": not is_stuff,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
if is_stuff:
|
||||||
|
stuff_memory_list[pred_class] = current_segment_id
|
||||||
|
results.append({"segmentation": segmentation, "segments": segments})
|
||||||
|
return results
|
||||||
2499
src/transformers/models/maskformer/modeling_maskformer.py
Normal file
2499
src/transformers/models/maskformer/modeling_maskformer.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -2401,6 +2401,30 @@ class MarianMTModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerForInstanceSegmentation(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
class MBartForCausalLM(metaclass=DummyObject):
|
class MBartForCausalLM(metaclass=DummyObject):
|
||||||
_backends = ["torch"]
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
|||||||
@@ -80,6 +80,13 @@ class LayoutXLMProcessor(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["vision"])
|
requires_backends(self, ["vision"])
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerFeatureExtractor(metaclass=DummyObject):
|
||||||
|
_backends = ["vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["vision"])
|
||||||
|
|
||||||
|
|
||||||
class PerceiverFeatureExtractor(metaclass=DummyObject):
|
class PerceiverFeatureExtractor(metaclass=DummyObject):
|
||||||
_backends = ["vision"]
|
_backends = ["vision"]
|
||||||
|
|
||||||
|
|||||||
0
tests/maskformer/__init__.py
Normal file
0
tests/maskformer/__init__.py
Normal file
303
tests/maskformer/test_feature_extraction_maskformer.py
Normal file
303
tests/maskformer/test_feature_extraction_maskformer.py
Normal file
@@ -0,0 +1,303 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 HuggingFace Inc.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers.file_utils import is_torch_available, is_vision_available
|
||||||
|
from transformers.testing_utils import require_torch, require_vision
|
||||||
|
|
||||||
|
from ..test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from transformers import MaskFormerFeatureExtractor
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerFeatureExtractionTester(unittest.TestCase):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=7,
|
||||||
|
num_channels=3,
|
||||||
|
min_resolution=30,
|
||||||
|
max_resolution=400,
|
||||||
|
do_resize=True,
|
||||||
|
size=32,
|
||||||
|
max_size=1333, # by setting max_size > max_resolution we're effectively not testing this :p
|
||||||
|
do_normalize=True,
|
||||||
|
image_mean=[0.5, 0.5, 0.5],
|
||||||
|
image_std=[0.5, 0.5, 0.5],
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.min_resolution = min_resolution
|
||||||
|
self.max_resolution = max_resolution
|
||||||
|
self.do_resize = do_resize
|
||||||
|
self.size = size
|
||||||
|
self.max_size = max_size
|
||||||
|
self.do_normalize = do_normalize
|
||||||
|
self.image_mean = image_mean
|
||||||
|
self.image_std = image_std
|
||||||
|
self.size_divisibility = 0
|
||||||
|
|
||||||
|
def prepare_feat_extract_dict(self):
|
||||||
|
return {
|
||||||
|
"do_resize": self.do_resize,
|
||||||
|
"size": self.size,
|
||||||
|
"max_size": self.max_size,
|
||||||
|
"do_normalize": self.do_normalize,
|
||||||
|
"image_mean": self.image_mean,
|
||||||
|
"image_std": self.image_std,
|
||||||
|
"size_divisibility": self.size_divisibility,
|
||||||
|
}
|
||||||
|
|
||||||
|
def get_expected_values(self, image_inputs, batched=False):
|
||||||
|
"""
|
||||||
|
This function computes the expected height and width when providing images to MaskFormerFeatureExtractor,
|
||||||
|
assuming do_resize is set to True with a scalar size.
|
||||||
|
"""
|
||||||
|
if not batched:
|
||||||
|
image = image_inputs[0]
|
||||||
|
if isinstance(image, Image.Image):
|
||||||
|
w, h = image.size
|
||||||
|
else:
|
||||||
|
h, w = image.shape[1], image.shape[2]
|
||||||
|
if w < h:
|
||||||
|
expected_height = int(self.size * h / w)
|
||||||
|
expected_width = self.size
|
||||||
|
elif w > h:
|
||||||
|
expected_height = self.size
|
||||||
|
expected_width = int(self.size * w / h)
|
||||||
|
else:
|
||||||
|
expected_height = self.size
|
||||||
|
expected_width = self.size
|
||||||
|
|
||||||
|
else:
|
||||||
|
expected_values = []
|
||||||
|
for image in image_inputs:
|
||||||
|
expected_height, expected_width = self.get_expected_values([image])
|
||||||
|
expected_values.append((expected_height, expected_width))
|
||||||
|
expected_height = max(expected_values, key=lambda item: item[0])[0]
|
||||||
|
expected_width = max(expected_values, key=lambda item: item[1])[1]
|
||||||
|
|
||||||
|
return expected_height, expected_width
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class MaskFormerFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCase):
|
||||||
|
|
||||||
|
feature_extraction_class = MaskFormerFeatureExtractor if (is_vision_available() and is_torch_available()) else None
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.feature_extract_tester = MaskFormerFeatureExtractionTester(self)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def feat_extract_dict(self):
|
||||||
|
return self.feature_extract_tester.prepare_feat_extract_dict()
|
||||||
|
|
||||||
|
def test_feat_extract_properties(self):
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "image_mean"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "image_std"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "do_normalize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "do_resize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "size"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "max_size"))
|
||||||
|
|
||||||
|
def test_batch_feature(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_call_pil(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random PIL images
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, Image.Image)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||||
|
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_call_numpy(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random numpy tensors
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, numpify=True)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, np.ndarray)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_call_pytorch(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random PyTorch tensors
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, torch.Tensor)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_equivalence_pad_and_create_pixel_mask(self):
|
||||||
|
# Initialize feature_extractors
|
||||||
|
feature_extractor_1 = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
feature_extractor_2 = self.feature_extraction_class(do_resize=False, do_normalize=False)
|
||||||
|
# create random PyTorch tensors
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, torch.Tensor)
|
||||||
|
|
||||||
|
# Test whether the method "pad_and_return_pixel_mask" and calling the feature extractor return the same tensors
|
||||||
|
encoded_images_with_method = feature_extractor_1.encode_inputs(image_inputs, return_tensors="pt")
|
||||||
|
encoded_images = feature_extractor_2(image_inputs, return_tensors="pt")
|
||||||
|
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(encoded_images_with_method["pixel_values"], encoded_images["pixel_values"], atol=1e-4)
|
||||||
|
)
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(encoded_images_with_method["pixel_mask"], encoded_images["pixel_mask"], atol=1e-4)
|
||||||
|
)
|
||||||
|
|
||||||
|
def comm_get_feature_extractor_inputs(self, with_annotations=False):
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# prepare image and target
|
||||||
|
num_classes = 8
|
||||||
|
batch_size = self.feature_extract_tester.batch_size
|
||||||
|
annotations = None
|
||||||
|
|
||||||
|
if with_annotations:
|
||||||
|
annotations = [
|
||||||
|
{
|
||||||
|
"masks": np.random.rand(num_classes, 384, 384).astype(np.float32),
|
||||||
|
"labels": (np.random.rand(num_classes) > 0.5).astype(np.int64),
|
||||||
|
}
|
||||||
|
for _ in range(batch_size)
|
||||||
|
]
|
||||||
|
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||||
|
|
||||||
|
inputs = feature_extractor(image_inputs, annotations, return_tensors="pt", pad_and_return_pixel_mask=True)
|
||||||
|
|
||||||
|
return inputs
|
||||||
|
|
||||||
|
def test_with_size_divisibility(self):
|
||||||
|
size_divisibilities = [8, 16, 32]
|
||||||
|
weird_input_sizes = [(407, 802), (582, 1094)]
|
||||||
|
for size_divisibility in size_divisibilities:
|
||||||
|
feat_extract_dict = {**self.feat_extract_dict, **{"size_divisibility": size_divisibility}}
|
||||||
|
feature_extractor = self.feature_extraction_class(**feat_extract_dict)
|
||||||
|
for weird_input_size in weird_input_sizes:
|
||||||
|
inputs = feature_extractor([np.ones((3, *weird_input_size))], return_tensors="pt")
|
||||||
|
pixel_values = inputs["pixel_values"]
|
||||||
|
# check if divisible
|
||||||
|
self.assertTrue((pixel_values.shape[-1] % size_divisibility) == 0)
|
||||||
|
self.assertTrue((pixel_values.shape[-2] % size_divisibility) == 0)
|
||||||
|
|
||||||
|
def test_call_with_numpy_annotations(self):
|
||||||
|
num_classes = 8
|
||||||
|
batch_size = self.feature_extract_tester.batch_size
|
||||||
|
|
||||||
|
inputs = self.comm_get_feature_extractor_inputs(with_annotations=True)
|
||||||
|
|
||||||
|
# check the batch_size
|
||||||
|
for el in inputs.values():
|
||||||
|
self.assertEqual(el.shape[0], batch_size)
|
||||||
|
|
||||||
|
pixel_values = inputs["pixel_values"]
|
||||||
|
mask_labels = inputs["mask_labels"]
|
||||||
|
class_labels = inputs["class_labels"]
|
||||||
|
|
||||||
|
self.assertEqual(pixel_values.shape[-2], mask_labels.shape[-2])
|
||||||
|
self.assertEqual(pixel_values.shape[-1], mask_labels.shape[-1])
|
||||||
|
self.assertEqual(mask_labels.shape[1], class_labels.shape[1])
|
||||||
|
self.assertEqual(mask_labels.shape[1], num_classes)
|
||||||
405
tests/maskformer/test_modeling_maskformer.py
Normal file
405
tests/maskformer/test_modeling_maskformer.py
Normal file
@@ -0,0 +1,405 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch MaskFormer model. """
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from tests.test_modeling_common import floats_tensor
|
||||||
|
from transformers import MaskFormerConfig, is_torch_available, is_vision_available
|
||||||
|
from transformers.file_utils import cached_property
|
||||||
|
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||||
|
|
||||||
|
from ..test_configuration_common import ConfigTester
|
||||||
|
from ..test_modeling_common import ModelTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import MaskFormerForInstanceSegmentation, MaskFormerModel
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from transformers import MaskFormerFeatureExtractor
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
|
||||||
|
class MaskFormerModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=2,
|
||||||
|
is_training=True,
|
||||||
|
use_auxiliary_loss=False,
|
||||||
|
num_queries=100,
|
||||||
|
num_channels=3,
|
||||||
|
min_size=384,
|
||||||
|
max_size=640,
|
||||||
|
num_labels=150,
|
||||||
|
mask_feature_size=256,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_auxiliary_loss = use_auxiliary_loss
|
||||||
|
self.num_queries = num_queries
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.min_size = min_size
|
||||||
|
self.max_size = max_size
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.mask_feature_size = mask_feature_size
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.min_size, self.max_size])
|
||||||
|
|
||||||
|
pixel_mask = torch.ones([self.batch_size, self.min_size, self.max_size], device=torch_device)
|
||||||
|
|
||||||
|
mask_labels = (
|
||||||
|
torch.rand([self.batch_size, self.num_labels, self.min_size, self.max_size], device=torch_device) > 0.5
|
||||||
|
).float()
|
||||||
|
class_labels = (torch.rand((self.batch_size, self.num_labels), device=torch_device) > 0.5).long()
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
return config, pixel_values, pixel_mask, mask_labels, class_labels
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return MaskFormerConfig(
|
||||||
|
num_queries=self.num_queries,
|
||||||
|
num_channels=self.num_channels,
|
||||||
|
num_labels=self.num_labels,
|
||||||
|
mask_feature_size=self.mask_feature_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config, pixel_values, pixel_mask, _, _ = self.prepare_config_and_inputs()
|
||||||
|
inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def check_output_hidden_state(self, output, config):
|
||||||
|
encoder_hidden_states = output.encoder_hidden_states
|
||||||
|
pixel_decoder_hidden_states = output.pixel_decoder_hidden_states
|
||||||
|
transformer_decoder_hidden_states = output.transformer_decoder_hidden_states
|
||||||
|
|
||||||
|
self.parent.assertTrue(len(encoder_hidden_states), len(config.backbone_config.depths))
|
||||||
|
self.parent.assertTrue(len(pixel_decoder_hidden_states), len(config.backbone_config.depths))
|
||||||
|
self.parent.assertTrue(len(transformer_decoder_hidden_states), config.decoder_config.decoder_layers)
|
||||||
|
|
||||||
|
def create_and_check_maskformer_model(self, config, pixel_values, pixel_mask, output_hidden_states=False):
|
||||||
|
with torch.no_grad():
|
||||||
|
model = MaskFormerModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
output = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||||
|
output = model(pixel_values, output_hidden_states=True)
|
||||||
|
# the correct shape of output.transformer_decoder_hidden_states ensure the correcteness of the
|
||||||
|
# encoder and pixel decoder
|
||||||
|
self.parent.assertEqual(
|
||||||
|
output.transformer_decoder_last_hidden_state.shape,
|
||||||
|
(self.batch_size, self.num_queries, self.mask_feature_size),
|
||||||
|
)
|
||||||
|
# let's ensure the other two hidden state exists
|
||||||
|
self.parent.assertTrue(output.pixel_decoder_last_hidden_state is not None)
|
||||||
|
self.parent.assertTrue(output.encoder_last_hidden_state is not None)
|
||||||
|
|
||||||
|
if output_hidden_states:
|
||||||
|
self.check_output_hidden_state(output, config)
|
||||||
|
|
||||||
|
def create_and_check_maskformer_instance_segmentation_head_model(
|
||||||
|
self, config, pixel_values, pixel_mask, mask_labels, class_labels
|
||||||
|
):
|
||||||
|
model = MaskFormerForInstanceSegmentation(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
def comm_check_on_output(result):
|
||||||
|
# let's still check that all the required stuff is there
|
||||||
|
self.parent.assertTrue(result.transformer_decoder_hidden_states is not None)
|
||||||
|
self.parent.assertTrue(result.pixel_decoder_last_hidden_state is not None)
|
||||||
|
self.parent.assertTrue(result.encoder_last_hidden_state is not None)
|
||||||
|
# okay, now we need to check the logits shape
|
||||||
|
# due to the encoder compression, masks have a //4 spatial size
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.masks_queries_logits.shape,
|
||||||
|
(self.batch_size, self.num_queries, self.min_size // 4, self.max_size // 4),
|
||||||
|
)
|
||||||
|
# + 1 for null class
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.class_queries_logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1)
|
||||||
|
)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||||
|
result = model(pixel_values)
|
||||||
|
|
||||||
|
comm_check_on_output(result)
|
||||||
|
|
||||||
|
result = model(
|
||||||
|
pixel_values=pixel_values, pixel_mask=pixel_mask, mask_labels=mask_labels, class_labels=class_labels
|
||||||
|
)
|
||||||
|
|
||||||
|
comm_check_on_output(result)
|
||||||
|
|
||||||
|
self.parent.assertTrue(result.loss is not None)
|
||||||
|
self.parent.assertEqual(result.loss.shape, torch.Size([1]))
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class MaskFormerModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
|
||||||
|
all_model_classes = (MaskFormerModel, MaskFormerForInstanceSegmentation) if is_torch_available() else ()
|
||||||
|
|
||||||
|
is_encoder_decoder = False
|
||||||
|
test_torchscript = False
|
||||||
|
test_pruning = False
|
||||||
|
test_head_masking = False
|
||||||
|
test_missing_keys = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = MaskFormerModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=MaskFormerConfig, has_text_modality=False)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_maskformer_model(self):
|
||||||
|
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
self.model_tester.create_and_check_maskformer_model(config, **inputs, output_hidden_states=False)
|
||||||
|
|
||||||
|
def test_maskformer_instance_segmentation_head_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_maskformer_instance_segmentation_head_model(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="MaskFormer does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="MaskFormer does not have a get_input_embeddings method")
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="MaskFormer is not a generative model")
|
||||||
|
def test_generate_without_input_ids(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="MaskFormer does not use token embeddings")
|
||||||
|
def test_resize_tokens_embeddings(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
expected_arg_names = ["pixel_values"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in ["facebook/maskformer-swin-small-coco"]:
|
||||||
|
model = MaskFormerModel.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_with_labels(self):
|
||||||
|
inputs = {
|
||||||
|
"pixel_values": torch.randn((2, 3, 384, 384)),
|
||||||
|
"mask_labels": torch.randn((2, 10, 384, 384)),
|
||||||
|
"class_labels": torch.zeros(2, 10).long(),
|
||||||
|
}
|
||||||
|
|
||||||
|
model = MaskFormerForInstanceSegmentation(MaskFormerConfig())
|
||||||
|
outputs = model(**inputs)
|
||||||
|
self.assertTrue(outputs.loss is not None)
|
||||||
|
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
self.model_tester.create_and_check_maskformer_model(config, **inputs, output_hidden_states=True)
|
||||||
|
|
||||||
|
def test_attention_outputs(self):
|
||||||
|
config, inputs = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
outputs = model(**inputs, output_attentions=True)
|
||||||
|
self.assertTrue(outputs.attentions is not None)
|
||||||
|
|
||||||
|
def test_training(self):
|
||||||
|
if not self.model_tester.is_training:
|
||||||
|
return
|
||||||
|
# only MaskFormerForInstanceSegmentation has the loss
|
||||||
|
model_class = self.all_model_classes[1]
|
||||||
|
config, pixel_values, pixel_mask, mask_labels, class_labels = self.model_tester.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
loss = model(pixel_values, mask_labels=mask_labels, class_labels=class_labels).loss
|
||||||
|
loss.backward()
|
||||||
|
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
# only MaskFormerForInstanceSegmentation has the loss
|
||||||
|
model_class = self.all_model_classes[1]
|
||||||
|
config, pixel_values, pixel_mask, mask_labels, class_labels = self.model_tester.prepare_config_and_inputs()
|
||||||
|
config.output_hidden_states = True
|
||||||
|
config.output_attentions = True
|
||||||
|
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
outputs = model(pixel_values, mask_labels=mask_labels, class_labels=class_labels)
|
||||||
|
|
||||||
|
encoder_hidden_states = outputs.encoder_hidden_states[0]
|
||||||
|
encoder_hidden_states.retain_grad()
|
||||||
|
|
||||||
|
pixel_decoder_hidden_states = outputs.pixel_decoder_hidden_states[0]
|
||||||
|
pixel_decoder_hidden_states.retain_grad()
|
||||||
|
# we requires_grad=True in inputs_embeds (line 2152), the original implementation don't
|
||||||
|
transformer_decoder_hidden_states = outputs.transformer_decoder_hidden_states[0]
|
||||||
|
transformer_decoder_hidden_states.retain_grad()
|
||||||
|
|
||||||
|
attentions = outputs.attentions[0]
|
||||||
|
attentions.retain_grad()
|
||||||
|
|
||||||
|
outputs.loss.backward(retain_graph=True)
|
||||||
|
|
||||||
|
self.assertIsNotNone(encoder_hidden_states.grad)
|
||||||
|
self.assertIsNotNone(pixel_decoder_hidden_states.grad)
|
||||||
|
self.assertIsNotNone(transformer_decoder_hidden_states.grad)
|
||||||
|
self.assertIsNotNone(attentions.grad)
|
||||||
|
|
||||||
|
|
||||||
|
TOLERANCE = 1e-4
|
||||||
|
|
||||||
|
|
||||||
|
# We will verify our results on an image of cute cats
|
||||||
|
def prepare_img():
|
||||||
|
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
@require_vision
|
||||||
|
@slow
|
||||||
|
class MaskFormerModelIntegrationTest(unittest.TestCase):
|
||||||
|
@cached_property
|
||||||
|
def model_checkpoints(self):
|
||||||
|
return "facebook/maskformer-swin-small-coco"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def default_feature_extractor(self):
|
||||||
|
return MaskFormerFeatureExtractor.from_pretrained(self.model_checkpoints) if is_vision_available() else None
|
||||||
|
|
||||||
|
def test_inference_no_head(self):
|
||||||
|
model = MaskFormerModel.from_pretrained(self.model_checkpoints).to(torch_device)
|
||||||
|
feature_extractor = self.default_feature_extractor
|
||||||
|
image = prepare_img()
|
||||||
|
inputs = feature_extractor(image, return_tensors="pt").to(torch_device)
|
||||||
|
inputs_shape = inputs["pixel_values"].shape
|
||||||
|
# check size is divisible by 32
|
||||||
|
self.assertTrue((inputs_shape[-1] % 32) == 0 and (inputs_shape[-2] % 32) == 0)
|
||||||
|
# check size
|
||||||
|
self.assertEqual(inputs_shape, (1, 3, 800, 1088))
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
expected_slice_hidden_state = torch.tensor(
|
||||||
|
[[-0.0482, 0.9228, 0.4951], [-0.2547, 0.8017, 0.8527], [-0.0069, 0.3385, -0.0089]]
|
||||||
|
).to(torch_device)
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(
|
||||||
|
outputs.encoder_last_hidden_state[0, 0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_slice_hidden_state = torch.tensor(
|
||||||
|
[[-0.8422, -0.8434, -0.9718], [-1.0144, -0.5565, -0.4195], [-1.0038, -0.4484, -0.1961]]
|
||||||
|
).to(torch_device)
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(
|
||||||
|
outputs.pixel_decoder_last_hidden_state[0, 0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_slice_hidden_state = torch.tensor(
|
||||||
|
[[0.2852, -0.0159, 0.9735], [0.6254, 0.1858, 0.8529], [-0.0680, -0.4116, 1.8413]]
|
||||||
|
).to(torch_device)
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(
|
||||||
|
outputs.transformer_decoder_last_hidden_state[0, :3, :3], expected_slice_hidden_state, atol=TOLERANCE
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_inference_instance_segmentation_head(self):
|
||||||
|
model = MaskFormerForInstanceSegmentation.from_pretrained(self.model_checkpoints).to(torch_device).eval()
|
||||||
|
feature_extractor = self.default_feature_extractor
|
||||||
|
image = prepare_img()
|
||||||
|
inputs = feature_extractor(image, return_tensors="pt").to(torch_device)
|
||||||
|
inputs_shape = inputs["pixel_values"].shape
|
||||||
|
# check size is divisible by 32
|
||||||
|
self.assertTrue((inputs_shape[-1] % 32) == 0 and (inputs_shape[-2] % 32) == 0)
|
||||||
|
# check size
|
||||||
|
self.assertEqual(inputs_shape, (1, 3, 800, 1088))
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
# masks_queries_logits
|
||||||
|
masks_queries_logits = outputs.masks_queries_logits
|
||||||
|
self.assertEqual(
|
||||||
|
masks_queries_logits.shape, (1, model.config.num_queries, inputs_shape[-2] // 4, inputs_shape[-1] // 4)
|
||||||
|
)
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[-1.3738, -1.7725, -1.9365], [-1.5978, -1.9869, -2.1524], [-1.5796, -1.9271, -2.0940]]
|
||||||
|
)
|
||||||
|
self.assertTrue(torch.allclose(masks_queries_logits[0, 0, :3, :3], expected_slice, atol=TOLERANCE))
|
||||||
|
# class_queries_logits
|
||||||
|
class_queries_logits = outputs.class_queries_logits
|
||||||
|
self.assertEqual(class_queries_logits.shape, (1, model.config.num_queries, model.config.num_labels + 1))
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[
|
||||||
|
[1.6512e00, -5.2572e00, -3.3519e00],
|
||||||
|
[3.6169e-02, -5.9025e00, -2.9313e00],
|
||||||
|
[1.0766e-04, -7.7630e00, -5.1263e00],
|
||||||
|
]
|
||||||
|
)
|
||||||
|
self.assertTrue(torch.allclose(outputs.class_queries_logits[0, :3, :3], expected_slice, atol=TOLERANCE))
|
||||||
|
|
||||||
|
def test_with_annotations_and_loss(self):
|
||||||
|
model = MaskFormerForInstanceSegmentation.from_pretrained(self.model_checkpoints).to(torch_device).eval()
|
||||||
|
feature_extractor = self.default_feature_extractor
|
||||||
|
|
||||||
|
inputs = feature_extractor(
|
||||||
|
[np.zeros((3, 800, 1333)), np.zeros((3, 800, 1333))],
|
||||||
|
annotations=[
|
||||||
|
{"masks": np.random.rand(10, 384, 384).astype(np.float32), "labels": np.zeros(10).astype(np.int64)},
|
||||||
|
{"masks": np.random.rand(10, 384, 384).astype(np.float32), "labels": np.zeros(10).astype(np.int64)},
|
||||||
|
],
|
||||||
|
return_tensors="pt",
|
||||||
|
)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
self.assertTrue(outputs.loss is not None)
|
||||||
@@ -169,6 +169,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||||||
"VisualBertForMultipleChoice",
|
"VisualBertForMultipleChoice",
|
||||||
"TFWav2Vec2ForCTC",
|
"TFWav2Vec2ForCTC",
|
||||||
"TFHubertForCTC",
|
"TFHubertForCTC",
|
||||||
|
"MaskFormerForInstanceSegmentation",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Update this list for models that have multiple model types for the same
|
# Update this list for models that have multiple model types for the same
|
||||||
|
|||||||
Reference in New Issue
Block a user