Add MAE (#15120)
* First draft * More improvements * More improvements * More improvements * Fix embeddings * Add conversion script * Finish conversion script * More improvements * Fix forward pass * Remove print statements * Add weights initialization * Add initialization of decoder weights * Add support for other models in the conversion script * Fix patch_size for huge model * Fix most of the tests * Fix integration test * Fix docs * Fix archive_list * Apply suggestions from code review * Improve documentation * Apply more suggestions * Skip some tests due to non-deterministic behaviour * Fix test_initialization * Remove unneccessary initialization of nn.Embedding * Improve docs * Fix dummies * Remove ViTMAEFeatureExtractor from docs * Add model to README and table of contents * Delete inference file
This commit is contained in:
@@ -312,6 +312,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
|||||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER
|
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER
|
||||||
AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||||
|
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||||
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
|
|||||||
@@ -291,6 +291,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||||
|
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
|
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
|
||||||
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
||||||
|
|||||||
@@ -315,6 +315,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
|
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
|
||||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。
|
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。
|
||||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。
|
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。
|
||||||
|
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
|
||||||
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
|
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。
|
||||||
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
||||||
|
|||||||
@@ -327,6 +327,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||||
|
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
|
1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/master/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
|
||||||
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
1. **[WavLM](https://huggingface.co/docs/transformers/master/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
||||||
|
|||||||
@@ -288,6 +288,8 @@
|
|||||||
title: Vision Text Dual Encoder
|
title: Vision Text Dual Encoder
|
||||||
- local: model_doc/vit
|
- local: model_doc/vit
|
||||||
title: Vision Transformer (ViT)
|
title: Vision Transformer (ViT)
|
||||||
|
- local: model_doc/vit_mae
|
||||||
|
title: ViTMAE
|
||||||
- local: model_doc/visual_bert
|
- local: model_doc/visual_bert
|
||||||
title: VisualBERT
|
title: VisualBERT
|
||||||
- local: model_doc/wav2vec2
|
- local: model_doc/wav2vec2
|
||||||
|
|||||||
@@ -171,6 +171,7 @@ conversion utilities for the following models.
|
|||||||
1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||||
1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||||
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||||
|
1. **[ViTMAE)](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||||
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
1. **[WavLM](model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
|
||||||
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
@@ -269,6 +270,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
|
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
|
||||||
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
|
| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||||
|
| ViTMAE | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
|
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
|
||||||
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
|
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||||
|
|||||||
63
docs/source/model_doc/vit_mae.mdx
Normal file
63
docs/source/model_doc/vit_mae.mdx
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# ViTMAE
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ViTMAE model was proposed in [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v2) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li,
|
||||||
|
Piotr Dollár, Ross Girshick. The paper shows that, by pre-training a Vision Transformer (ViT) to reconstruct pixel values for masked patches, one can get results after
|
||||||
|
fine-tuning that outperform supervised pre-training.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the
|
||||||
|
input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates
|
||||||
|
only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask
|
||||||
|
tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs
|
||||||
|
enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity
|
||||||
|
models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream
|
||||||
|
tasks outperforms supervised pre-training and shows promising scaling behavior.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- MAE (masked auto encoding) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training objective is relatively simple:
|
||||||
|
by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values. One can use [`ViTMAEForPreTraining`] for this purpose.
|
||||||
|
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
|
||||||
|
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
|
||||||
|
- One can use [`ViTFeatureExtractor`] to prepare images for the model. See the code examples for more info.
|
||||||
|
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
|
||||||
|
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
|
||||||
|
sin/cos position embeddings are added both to the input of the encoder and the decoder.
|
||||||
|
|
||||||
|
<img src="https://user-images.githubusercontent.com/11435359/146857310-f258c86c-fde6-48e8-9cee-badd2b21bd2c.png"
|
||||||
|
alt="drawing" width="600"/>
|
||||||
|
|
||||||
|
<small> MAE architecture. Taken from the <a href="https://arxiv.org/abs/2111.06377">original paper.</a> </small>
|
||||||
|
|
||||||
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/mae).
|
||||||
|
|
||||||
|
## ViTMAEConfig
|
||||||
|
|
||||||
|
[[autodoc]] ViTMAEConfig
|
||||||
|
|
||||||
|
|
||||||
|
## ViTMAEModel
|
||||||
|
|
||||||
|
[[autodoc]] ViTMAEModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
|
||||||
|
## ViTMAEForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] transformers.ViTMAEForPreTraining
|
||||||
|
- forward
|
||||||
@@ -312,6 +312,7 @@ _import_structure = {
|
|||||||
"models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
|
"models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
|
||||||
"models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
|
"models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
|
||||||
"models.vit": ["VIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTConfig"],
|
"models.vit": ["VIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTConfig"],
|
||||||
|
"models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"],
|
||||||
"models.wav2vec2": [
|
"models.wav2vec2": [
|
||||||
"WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
"WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
"Wav2Vec2Config",
|
"Wav2Vec2Config",
|
||||||
@@ -627,6 +628,7 @@ if is_torch_available():
|
|||||||
_import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
|
_import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
|
||||||
|
|
||||||
# PyTorch models structure
|
# PyTorch models structure
|
||||||
|
|
||||||
_import_structure["models.albert"].extend(
|
_import_structure["models.albert"].extend(
|
||||||
[
|
[
|
||||||
"ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -1402,6 +1404,15 @@ if is_torch_available():
|
|||||||
"ViTPreTrainedModel",
|
"ViTPreTrainedModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.vit_mae"].extend(
|
||||||
|
[
|
||||||
|
"VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ViTMAEForPreTraining",
|
||||||
|
"ViTMAELayer",
|
||||||
|
"ViTMAEModel",
|
||||||
|
"ViTMAEPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.wav2vec2"].extend(
|
_import_structure["models.wav2vec2"].extend(
|
||||||
[
|
[
|
||||||
"WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -2401,6 +2412,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
|
from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
|
||||||
from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
|
from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
|
||||||
from .models.vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig
|
from .models.vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig
|
||||||
|
from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig
|
||||||
from .models.wav2vec2 import (
|
from .models.wav2vec2 import (
|
||||||
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
Wav2Vec2Config,
|
Wav2Vec2Config,
|
||||||
@@ -3307,6 +3319,13 @@ if TYPE_CHECKING:
|
|||||||
ViTModel,
|
ViTModel,
|
||||||
ViTPreTrainedModel,
|
ViTPreTrainedModel,
|
||||||
)
|
)
|
||||||
|
from .models.vit_mae import (
|
||||||
|
VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ViTMAEForPreTraining,
|
||||||
|
ViTMAELayer,
|
||||||
|
ViTMAEModel,
|
||||||
|
ViTMAEPreTrainedModel,
|
||||||
|
)
|
||||||
from .models.wav2vec2 import (
|
from .models.wav2vec2 import (
|
||||||
WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
Wav2Vec2ForAudioFrameClassification,
|
Wav2Vec2ForAudioFrameClassification,
|
||||||
|
|||||||
@@ -108,6 +108,7 @@ from . import (
|
|||||||
vision_text_dual_encoder,
|
vision_text_dual_encoder,
|
||||||
visual_bert,
|
visual_bert,
|
||||||
vit,
|
vit,
|
||||||
|
vit_mae,
|
||||||
wav2vec2,
|
wav2vec2,
|
||||||
wav2vec2_phoneme,
|
wav2vec2_phoneme,
|
||||||
wav2vec2_with_lm,
|
wav2vec2_with_lm,
|
||||||
|
|||||||
@@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
|
|||||||
CONFIG_MAPPING_NAMES = OrderedDict(
|
CONFIG_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add configs here
|
# Add configs here
|
||||||
|
("vit_mae", "ViTMAEConfig"),
|
||||||
("realm", "RealmConfig"),
|
("realm", "RealmConfig"),
|
||||||
("nystromformer", "NystromformerConfig"),
|
("nystromformer", "NystromformerConfig"),
|
||||||
("imagegpt", "ImageGPTConfig"),
|
("imagegpt", "ImageGPTConfig"),
|
||||||
@@ -118,6 +119,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add archive maps here
|
# Add archive maps here
|
||||||
|
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@@ -194,6 +196,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
MODEL_NAMES_MAPPING = OrderedDict(
|
MODEL_NAMES_MAPPING = OrderedDict(
|
||||||
[
|
[
|
||||||
# Add full (and cased) model names here
|
# Add full (and cased) model names here
|
||||||
|
("vit_mae", "ViTMAE"),
|
||||||
("realm", "Realm"),
|
("realm", "Realm"),
|
||||||
("nystromformer", "Nystromformer"),
|
("nystromformer", "Nystromformer"),
|
||||||
("imagegpt", "ImageGPT"),
|
("imagegpt", "ImageGPT"),
|
||||||
|
|||||||
@@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
|
|||||||
MODEL_MAPPING_NAMES = OrderedDict(
|
MODEL_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Base model mapping
|
# Base model mapping
|
||||||
|
("vit_mae", "ViTMAEModel"),
|
||||||
("nystromformer", "NystromformerModel"),
|
("nystromformer", "NystromformerModel"),
|
||||||
("imagegpt", "ImageGPTModel"),
|
("imagegpt", "ImageGPTModel"),
|
||||||
("qdqbert", "QDQBertModel"),
|
("qdqbert", "QDQBertModel"),
|
||||||
@@ -109,6 +110,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Model for pre-training mapping
|
# Model for pre-training mapping
|
||||||
|
("vit_mae", "ViTMAEForPreTraining"),
|
||||||
("fnet", "FNetForPreTraining"),
|
("fnet", "FNetForPreTraining"),
|
||||||
("visual_bert", "VisualBertForPreTraining"),
|
("visual_bert", "VisualBertForPreTraining"),
|
||||||
("layoutlm", "LayoutLMForMaskedLM"),
|
("layoutlm", "LayoutLMForMaskedLM"),
|
||||||
|
|||||||
53
src/transformers/models/vit_mae/__init__.py
Normal file
53
src/transformers/models/vit_mae/__init__.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...file_utils import _LazyModule, is_flax_available, is_tf_available, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"],
|
||||||
|
}
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
_import_structure["modeling_vit_mae"] = [
|
||||||
|
"VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ViTMAEForPreTraining",
|
||||||
|
"ViTMAELayer",
|
||||||
|
"ViTMAEModel",
|
||||||
|
"ViTMAEPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from .modeling_vit_mae import (
|
||||||
|
VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ViTMAEForPreTraining,
|
||||||
|
ViTMAELayer,
|
||||||
|
ViTMAEModel,
|
||||||
|
ViTMAEPreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||||
141
src/transformers/models/vit_mae/configuration_vit_mae.py
Normal file
141
src/transformers/models/vit_mae/configuration_vit_mae.py
Normal file
@@ -0,0 +1,141 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Facebook AI and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" ViT MAE model configuration"""
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"facebook/vit-mae-base": "https://huggingface.co/facebook/vit-mae-base/resolve/main/config.json",
|
||||||
|
# See all ViT MAE models at https://huggingface.co/models?filter=vit-mae
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`ViTMAEModel`]. It is used to instantiate an ViT
|
||||||
|
MAE model according to the specified arguments, defining the model architecture. Instantiating a configuration with
|
||||||
|
the defaults will yield a similar configuration to that of the ViT
|
||||||
|
[facebook/vit-mae-base](https://huggingface.co/facebook/vit-mae-base) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
image_size (`int`, *optional*, defaults to 224):
|
||||||
|
The size (resolution) of each image.
|
||||||
|
patch_size (`int`, *optional*, defaults to 16):
|
||||||
|
The size (resolution) of each patch.
|
||||||
|
num_channels (`int`, *optional*, defaults to 3):
|
||||||
|
The number of input channels.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to add a bias to the queries, keys and values.
|
||||||
|
decoder_num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the decoder.
|
||||||
|
decoder_hidden_size (`int`, *optional*, defaults to 512):
|
||||||
|
Dimensionality of the decoder.
|
||||||
|
decoder_num_hidden_layers (`int`, *optional*, defaults to 8):
|
||||||
|
Number of hidden layers in the decoder.
|
||||||
|
decoder_intermediate_size (`int`, *optional*, defaults to 2048):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the decoder.
|
||||||
|
mask_ratio (`float`, *optional*, defaults to 0.75):
|
||||||
|
The ratio of the number of masked tokens in the input sequence.
|
||||||
|
norm_pix_loss (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether train with normalized pixels (see Table 3 in the paper).
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ViTMAEModel, ViTMAEConfig
|
||||||
|
|
||||||
|
>>> # Initializing a ViT MAE vit-mae-base style configuration
|
||||||
|
>>> configuration = ViTMAEConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the vit-mae-base style configuration
|
||||||
|
>>> model = ViTMAEModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "vit_mae"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
hidden_size=768,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_attention_heads=12,
|
||||||
|
intermediate_size=3072,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.0,
|
||||||
|
attention_probs_dropout_prob=0.0,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-12,
|
||||||
|
is_encoder_decoder=False,
|
||||||
|
image_size=224,
|
||||||
|
patch_size=16,
|
||||||
|
num_channels=3,
|
||||||
|
qkv_bias=True,
|
||||||
|
decoder_num_attention_heads=16,
|
||||||
|
decoder_hidden_size=512,
|
||||||
|
decoder_num_hidden_layers=8,
|
||||||
|
decoder_intermediate_size=2048,
|
||||||
|
mask_ratio=0.75,
|
||||||
|
norm_pix_loss=False,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.decoder_num_attention_heads = decoder_num_attention_heads
|
||||||
|
self.decoder_hidden_size = decoder_hidden_size
|
||||||
|
self.decoder_num_hidden_layers = decoder_num_hidden_layers
|
||||||
|
self.decoder_intermediate_size = decoder_intermediate_size
|
||||||
|
self.mask_ratio = mask_ratio
|
||||||
|
self.norm_pix_loss = norm_pix_loss
|
||||||
178
src/transformers/models/vit_mae/convert_vit_mae_to_pytorch.py
Normal file
178
src/transformers/models/vit_mae/convert_vit_mae_to_pytorch.py
Normal file
@@ -0,0 +1,178 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert ViT MAE checkpoints from the original repository: https://github.com/facebookresearch/mae"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from transformers import ViTMAEConfig, ViTMAEFeatureExtractor, ViTMAEForPreTraining
|
||||||
|
|
||||||
|
|
||||||
|
def rename_key(name):
|
||||||
|
if "cls_token" in name:
|
||||||
|
name = name.replace("cls_token", "vit.embeddings.cls_token")
|
||||||
|
if "mask_token" in name:
|
||||||
|
name = name.replace("mask_token", "decoder.mask_token")
|
||||||
|
if "decoder_pos_embed" in name:
|
||||||
|
name = name.replace("decoder_pos_embed", "decoder.decoder_pos_embed")
|
||||||
|
if "pos_embed" in name and "decoder" not in name:
|
||||||
|
name = name.replace("pos_embed", "vit.embeddings.position_embeddings")
|
||||||
|
if "patch_embed.proj" in name:
|
||||||
|
name = name.replace("patch_embed.proj", "vit.embeddings.patch_embeddings.projection")
|
||||||
|
if "patch_embed.norm" in name:
|
||||||
|
name = name.replace("patch_embed.norm", "vit.embeddings.norm")
|
||||||
|
if "decoder_blocks" in name:
|
||||||
|
name = name.replace("decoder_blocks", "decoder.decoder_layers")
|
||||||
|
if "blocks" in name:
|
||||||
|
name = name.replace("blocks", "vit.encoder.layer")
|
||||||
|
if "attn.proj" in name:
|
||||||
|
name = name.replace("attn.proj", "attention.output.dense")
|
||||||
|
if "attn" in name:
|
||||||
|
name = name.replace("attn", "attention.self")
|
||||||
|
if "norm1" in name:
|
||||||
|
name = name.replace("norm1", "layernorm_before")
|
||||||
|
if "norm2" in name:
|
||||||
|
name = name.replace("norm2", "layernorm_after")
|
||||||
|
if "mlp.fc1" in name:
|
||||||
|
name = name.replace("mlp.fc1", "intermediate.dense")
|
||||||
|
if "mlp.fc2" in name:
|
||||||
|
name = name.replace("mlp.fc2", "output.dense")
|
||||||
|
if "decoder_embed" in name:
|
||||||
|
name = name.replace("decoder_embed", "decoder.decoder_embed")
|
||||||
|
if "decoder_norm" in name:
|
||||||
|
name = name.replace("decoder_norm", "decoder.decoder_norm")
|
||||||
|
if "decoder_pred" in name:
|
||||||
|
name = name.replace("decoder_pred", "decoder.decoder_pred")
|
||||||
|
if "norm.weight" in name and "decoder" not in name:
|
||||||
|
name = name.replace("norm.weight", "vit.layernorm.weight")
|
||||||
|
if "norm.bias" in name and "decoder" not in name:
|
||||||
|
name = name.replace("norm.bias", "vit.layernorm.bias")
|
||||||
|
|
||||||
|
return name
|
||||||
|
|
||||||
|
|
||||||
|
def convert_state_dict(orig_state_dict, config):
|
||||||
|
for key in orig_state_dict.copy().keys():
|
||||||
|
val = orig_state_dict.pop(key)
|
||||||
|
|
||||||
|
if "qkv" in key:
|
||||||
|
key_split = key.split(".")
|
||||||
|
layer_num = int(key_split[1])
|
||||||
|
if "decoder_blocks" in key:
|
||||||
|
dim = config.decoder_hidden_size
|
||||||
|
prefix = "decoder.decoder_layers."
|
||||||
|
if "weight" in key:
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.query.weight"] = val[:dim, :]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.key.weight"] = val[dim : dim * 2, :]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.value.weight"] = val[-dim:, :]
|
||||||
|
elif "bias" in key:
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.query.bias"] = val[:dim]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.key.bias"] = val[dim : dim * 2]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.value.bias"] = val[-dim:]
|
||||||
|
else:
|
||||||
|
dim = config.hidden_size
|
||||||
|
prefix = "vit.encoder.layer."
|
||||||
|
if "weight" in key:
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.query.weight"] = val[:dim, :]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.key.weight"] = val[dim : dim * 2, :]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.value.weight"] = val[-dim:, :]
|
||||||
|
elif "bias" in key:
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.query.bias"] = val[:dim]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.key.bias"] = val[dim : dim * 2]
|
||||||
|
orig_state_dict[f"{prefix}{layer_num}.attention.attention.value.bias"] = val[-dim:]
|
||||||
|
|
||||||
|
else:
|
||||||
|
orig_state_dict[rename_key(key)] = val
|
||||||
|
|
||||||
|
return orig_state_dict
|
||||||
|
|
||||||
|
|
||||||
|
def convert_vit_mae_checkpoint(checkpoint_url, pytorch_dump_folder_path):
|
||||||
|
config = ViTMAEConfig()
|
||||||
|
if "large" in checkpoint_url:
|
||||||
|
config.hidden_size = 1024
|
||||||
|
config.intermediate_size = 4096
|
||||||
|
config.num_hidden_layers = 24
|
||||||
|
config.num_attention_heads = 16
|
||||||
|
elif "huge" in checkpoint_url:
|
||||||
|
config.patch_size = 14
|
||||||
|
config.hidden_size = 1280
|
||||||
|
config.intermediate_size = 5120
|
||||||
|
config.num_hidden_layers = 32
|
||||||
|
config.num_attention_heads = 16
|
||||||
|
|
||||||
|
model = ViTMAEForPreTraining(config)
|
||||||
|
|
||||||
|
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["model"]
|
||||||
|
|
||||||
|
feature_extractor = ViTMAEFeatureExtractor(size=config.image_size)
|
||||||
|
|
||||||
|
new_state_dict = convert_state_dict(state_dict, config)
|
||||||
|
|
||||||
|
model.load_state_dict(new_state_dict)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
url = "https://user-images.githubusercontent.com/11435359/147738734-196fd92f-9260-48d5-ba7e-bf103d29364d.jpg"
|
||||||
|
|
||||||
|
image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
feature_extractor = ViTMAEFeatureExtractor(size=config.image_size)
|
||||||
|
inputs = feature_extractor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
|
# forward pass
|
||||||
|
torch.manual_seed(2)
|
||||||
|
outputs = model(**inputs)
|
||||||
|
logits = outputs.logits
|
||||||
|
|
||||||
|
if "large" in checkpoint_url:
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[-0.7309, -0.7128, -1.0169], [-1.0161, -0.9058, -1.1878], [-1.0478, -0.9411, -1.1911]]
|
||||||
|
)
|
||||||
|
elif "huge" in checkpoint_url:
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[-1.1599, -0.9199, -1.2221], [-1.1952, -0.9269, -1.2307], [-1.2143, -0.9337, -1.2262]]
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[-0.9192, -0.8481, -1.1259], [-1.1349, -1.0034, -1.2599], [-1.1757, -1.0429, -1.2726]]
|
||||||
|
)
|
||||||
|
|
||||||
|
# verify logits
|
||||||
|
assert torch.allclose(logits[0, :3, :3], expected_slice, atol=1e-4)
|
||||||
|
|
||||||
|
print(f"Saving model to {pytorch_dump_folder_path}")
|
||||||
|
model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
print(f"Saving feature extractor to {pytorch_dump_folder_path}")
|
||||||
|
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
# Required parameters
|
||||||
|
parser.add_argument(
|
||||||
|
"--checkpoint_url",
|
||||||
|
default="https://dl.fbaipublicfiles.com/mae/visualize/mae_visualize_vit_base.pth",
|
||||||
|
type=str,
|
||||||
|
help="URL of the checkpoint you'd like to convert.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_vit_mae_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path)
|
||||||
956
src/transformers/models/vit_mae/modeling_vit_mae.py
Executable file
956
src/transformers/models/vit_mae/modeling_vit_mae.py
Executable file
@@ -0,0 +1,956 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Facebook AI and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" PyTorch ViT MAE (masked autoencoder) model."""
|
||||||
|
|
||||||
|
|
||||||
|
import collections.abc
|
||||||
|
import math
|
||||||
|
from copy import deepcopy
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Optional, Tuple
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import torch.utils.checkpoint
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from ...activations import ACT2FN
|
||||||
|
from ...file_utils import (
|
||||||
|
ModelOutput,
|
||||||
|
add_start_docstrings,
|
||||||
|
add_start_docstrings_to_model_forward,
|
||||||
|
replace_return_docstrings,
|
||||||
|
)
|
||||||
|
from ...modeling_outputs import BaseModelOutput
|
||||||
|
from ...modeling_utils import PreTrainedModel, find_pruneable_heads_and_indices, prune_linear_layer
|
||||||
|
from ...utils import logging
|
||||||
|
from .configuration_vit_mae import ViTMAEConfig
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
_CONFIG_FOR_DOC = "ViTMAEConfig"
|
||||||
|
_CHECKPOINT_FOR_DOC = "facebook/vit-mae-base"
|
||||||
|
|
||||||
|
VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
||||||
|
"facebook/vit-mae-base",
|
||||||
|
# See all ViTMAE models at https://huggingface.co/models?filter=vit_mae
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ViTMAEModelOutput(ModelOutput):
|
||||||
|
"""
|
||||||
|
Class for ViTMAEModel's outputs, with potential hidden states and attentions.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
|
||||||
|
Sequence of hidden-states at the output of the last layer of the model.
|
||||||
|
mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
|
||||||
|
Tensor indicating which patches are masked (1) and which are not (0).
|
||||||
|
ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
|
||||||
|
Tensor containing the original index of the (shuffled) masked patches.
|
||||||
|
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
|
||||||
|
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
|
||||||
|
plus the initial embedding outputs.
|
||||||
|
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
||||||
|
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
|
||||||
|
the self-attention heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
last_hidden_state: torch.FloatTensor = None
|
||||||
|
mask: torch.LongTensor = None
|
||||||
|
ids_restore: torch.LongTensor = None
|
||||||
|
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ViTMAEDecoderOutput(ModelOutput):
|
||||||
|
"""
|
||||||
|
Class for ViTMAEDecoder's outputs, with potential hidden states and attentions.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
logits (`torch.FloatTensor` of shape `(batch_size, patch_size ** 2 * num_channels)`):
|
||||||
|
Pixel reconstruction logits.
|
||||||
|
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
|
||||||
|
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
|
||||||
|
plus the initial embedding outputs.
|
||||||
|
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
||||||
|
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
|
||||||
|
the self-attention heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
logits: torch.FloatTensor = None
|
||||||
|
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ViTMAEForPreTrainingOutput(ModelOutput):
|
||||||
|
"""
|
||||||
|
Class for ViTMAEForPreTraining's outputs, with potential hidden states and attentions.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
loss (`torch.FloatTensor` of shape `(1,)`):
|
||||||
|
Pixel reconstruction loss.
|
||||||
|
logits (`torch.FloatTensor` of shape `(batch_size, patch_size ** 2 * num_channels)`):
|
||||||
|
Pixel reconstruction logits.
|
||||||
|
mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
|
||||||
|
Tensor indicating which patches are masked (1) and which are not (0).
|
||||||
|
ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
|
||||||
|
Tensor containing the original index of the (shuffled) masked patches.
|
||||||
|
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
|
||||||
|
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
|
||||||
|
plus the initial embedding outputs.
|
||||||
|
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||||
|
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
||||||
|
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
|
||||||
|
the self-attention heads.
|
||||||
|
"""
|
||||||
|
|
||||||
|
loss: Optional[torch.FloatTensor] = None
|
||||||
|
logits: torch.FloatTensor = None
|
||||||
|
mask: torch.LongTensor = None
|
||||||
|
ids_restore: torch.LongTensor = None
|
||||||
|
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||||
|
|
||||||
|
|
||||||
|
# Inspired by
|
||||||
|
# https://github.com/rwightman/pytorch-image-models/blob/b9bd960a032c75ca6b808ddeed76bee5f3ed4972/timm/models/layers/helpers.py
|
||||||
|
# From PyTorch internals
|
||||||
|
def to_2tuple(x):
|
||||||
|
if isinstance(x, collections.abc.Iterable):
|
||||||
|
return x
|
||||||
|
return (x, x)
|
||||||
|
|
||||||
|
|
||||||
|
def get_2d_sincos_pos_embed(embed_dim, grid_size, add_cls_token=False):
|
||||||
|
"""
|
||||||
|
Create 2D sin/cos positional embeddings.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
embed_dim (`int`):
|
||||||
|
Embedding dimension.
|
||||||
|
grid_size (`int`):
|
||||||
|
The grid height and width.
|
||||||
|
add_cls_token (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to add a classification (CLS) token.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
(`torch.FloatTensor` of shape (grid_size*grid_size, embed_dim) or (1+grid_size*grid_size, embed_dim): the
|
||||||
|
position embeddings (with or without classification token)
|
||||||
|
"""
|
||||||
|
grid_h = np.arange(grid_size, dtype=np.float32)
|
||||||
|
grid_w = np.arange(grid_size, dtype=np.float32)
|
||||||
|
grid = np.meshgrid(grid_w, grid_h) # here w goes first
|
||||||
|
grid = np.stack(grid, axis=0)
|
||||||
|
|
||||||
|
grid = grid.reshape([2, 1, grid_size, grid_size])
|
||||||
|
pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
|
||||||
|
if add_cls_token:
|
||||||
|
pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)
|
||||||
|
return pos_embed
|
||||||
|
|
||||||
|
|
||||||
|
def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
|
||||||
|
if embed_dim % 2 != 0:
|
||||||
|
raise ValueError("embed_dim must be even")
|
||||||
|
|
||||||
|
# use half of dimensions to encode grid_h
|
||||||
|
emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0]) # (H*W, D/2)
|
||||||
|
emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1]) # (H*W, D/2)
|
||||||
|
|
||||||
|
emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
|
||||||
|
return emb
|
||||||
|
|
||||||
|
|
||||||
|
def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
|
||||||
|
"""
|
||||||
|
embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)
|
||||||
|
"""
|
||||||
|
if embed_dim % 2 != 0:
|
||||||
|
raise ValueError("embed_dim must be even")
|
||||||
|
|
||||||
|
omega = np.arange(embed_dim // 2, dtype=np.float)
|
||||||
|
omega /= embed_dim / 2.0
|
||||||
|
omega = 1.0 / 10000 ** omega # (D/2,)
|
||||||
|
|
||||||
|
pos = pos.reshape(-1) # (M,)
|
||||||
|
out = np.einsum("m,d->md", pos, omega) # (M, D/2), outer product
|
||||||
|
|
||||||
|
emb_sin = np.sin(out) # (M, D/2)
|
||||||
|
emb_cos = np.cos(out) # (M, D/2)
|
||||||
|
|
||||||
|
emb = np.concatenate([emb_sin, emb_cos], axis=1) # (M, D)
|
||||||
|
return emb
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEEmbeddings(nn.Module):
|
||||||
|
"""
|
||||||
|
Construct the CLS token, position and patch embeddings.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
|
||||||
|
self.patch_embeddings = PatchEmbeddings(
|
||||||
|
image_size=config.image_size,
|
||||||
|
patch_size=config.patch_size,
|
||||||
|
num_channels=config.num_channels,
|
||||||
|
embed_dim=config.hidden_size,
|
||||||
|
)
|
||||||
|
self.num_patches = self.patch_embeddings.num_patches
|
||||||
|
# fixed sin-cos embedding
|
||||||
|
self.position_embeddings = nn.Parameter(
|
||||||
|
torch.zeros(1, self.num_patches + 1, config.hidden_size), requires_grad=False
|
||||||
|
)
|
||||||
|
self.config = config
|
||||||
|
self.initialize_weights()
|
||||||
|
|
||||||
|
def initialize_weights(self):
|
||||||
|
# initialize (and freeze) position embeddings by sin-cos embedding
|
||||||
|
pos_embed = get_2d_sincos_pos_embed(
|
||||||
|
self.position_embeddings.shape[-1], int(self.patch_embeddings.num_patches ** 0.5), add_cls_token=True
|
||||||
|
)
|
||||||
|
self.position_embeddings.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
|
||||||
|
|
||||||
|
# initialize patch_embeddings like nn.Linear (instead of nn.Conv2d)
|
||||||
|
w = self.patch_embeddings.projection.weight.data
|
||||||
|
torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
|
||||||
|
|
||||||
|
# timm's trunc_normal_(std=.02) is effectively normal_(std=0.02) as cutoff is too big (2.)
|
||||||
|
torch.nn.init.normal_(self.cls_token, std=self.config.initializer_range)
|
||||||
|
|
||||||
|
def random_masking(self, sequence):
|
||||||
|
"""
|
||||||
|
Perform per-sample random masking by per-sample shuffling. Per-sample shuffling is done by argsort random
|
||||||
|
noise.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
sequence (`torch.LongTensor` of shape `(batch_size, sequence_length, dim)`)
|
||||||
|
"""
|
||||||
|
batch_size, seq_length, dim = sequence.shape
|
||||||
|
len_keep = int(seq_length * (1 - self.config.mask_ratio))
|
||||||
|
|
||||||
|
noise = torch.rand(batch_size, seq_length, device=sequence.device) # noise in [0, 1]
|
||||||
|
|
||||||
|
# sort noise for each sample
|
||||||
|
ids_shuffle = torch.argsort(noise, dim=1) # ascend: small is keep, large is remove
|
||||||
|
ids_restore = torch.argsort(ids_shuffle, dim=1)
|
||||||
|
|
||||||
|
# keep the first subset
|
||||||
|
ids_keep = ids_shuffle[:, :len_keep]
|
||||||
|
sequence_masked = torch.gather(sequence, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, dim))
|
||||||
|
|
||||||
|
# generate the binary mask: 0 is keep, 1 is remove
|
||||||
|
mask = torch.ones([batch_size, seq_length], device=sequence.device)
|
||||||
|
mask[:, :len_keep] = 0
|
||||||
|
# unshuffle to get the binary mask
|
||||||
|
mask = torch.gather(mask, dim=1, index=ids_restore)
|
||||||
|
|
||||||
|
return sequence_masked, mask, ids_restore
|
||||||
|
|
||||||
|
def forward(self, pixel_values):
|
||||||
|
batch_size, num_channels, height, width = pixel_values.shape
|
||||||
|
embeddings = self.patch_embeddings(pixel_values)
|
||||||
|
|
||||||
|
# add position embeddings w/o cls token
|
||||||
|
embeddings = embeddings + self.position_embeddings[:, 1:, :]
|
||||||
|
|
||||||
|
# masking: length -> length * config.mask_ratio
|
||||||
|
embeddings, mask, ids_restore = self.random_masking(embeddings)
|
||||||
|
|
||||||
|
# append cls token
|
||||||
|
cls_token = self.cls_token + self.position_embeddings[:, :1, :]
|
||||||
|
cls_tokens = cls_token.expand(embeddings.shape[0], -1, -1)
|
||||||
|
embeddings = torch.cat((cls_tokens, embeddings), dim=1)
|
||||||
|
|
||||||
|
return embeddings, mask, ids_restore
|
||||||
|
|
||||||
|
|
||||||
|
# Based on timm implementation, which can be found here:
|
||||||
|
# https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
|
||||||
|
class PatchEmbeddings(nn.Module):
|
||||||
|
"""
|
||||||
|
Image to Patch Embedding.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, image_size=224, patch_size=16, num_channels=3, embed_dim=768):
|
||||||
|
super().__init__()
|
||||||
|
image_size = to_2tuple(image_size)
|
||||||
|
patch_size = to_2tuple(patch_size)
|
||||||
|
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_patches = num_patches
|
||||||
|
|
||||||
|
self.projection = nn.Conv2d(num_channels, embed_dim, kernel_size=patch_size, stride=patch_size)
|
||||||
|
|
||||||
|
def forward(self, pixel_values):
|
||||||
|
batch_size, num_channels, height, width = pixel_values.shape
|
||||||
|
if height != self.image_size[0] or width != self.image_size[1]:
|
||||||
|
raise ValueError(
|
||||||
|
f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."
|
||||||
|
)
|
||||||
|
x = self.projection(pixel_values).flatten(2).transpose(1, 2)
|
||||||
|
return x
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAESelfAttention(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
|
||||||
|
raise ValueError(
|
||||||
|
f"The hidden size {config.hidden_size,} is not a multiple of the number of attention "
|
||||||
|
f"heads {config.num_attention_heads}."
|
||||||
|
)
|
||||||
|
|
||||||
|
self.num_attention_heads = config.num_attention_heads
|
||||||
|
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
|
||||||
|
self.all_head_size = self.num_attention_heads * self.attention_head_size
|
||||||
|
|
||||||
|
self.query = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
|
||||||
|
self.key = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
|
||||||
|
self.value = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
|
||||||
|
|
||||||
|
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
|
||||||
|
|
||||||
|
def transpose_for_scores(self, x):
|
||||||
|
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
|
||||||
|
x = x.view(*new_x_shape)
|
||||||
|
return x.permute(0, 2, 1, 3)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, head_mask=None, output_attentions=False):
|
||||||
|
mixed_query_layer = self.query(hidden_states)
|
||||||
|
|
||||||
|
key_layer = self.transpose_for_scores(self.key(hidden_states))
|
||||||
|
value_layer = self.transpose_for_scores(self.value(hidden_states))
|
||||||
|
query_layer = self.transpose_for_scores(mixed_query_layer)
|
||||||
|
|
||||||
|
# Take the dot product between "query" and "key" to get the raw attention scores.
|
||||||
|
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
|
||||||
|
|
||||||
|
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
|
||||||
|
|
||||||
|
# Normalize the attention scores to probabilities.
|
||||||
|
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
|
||||||
|
|
||||||
|
# This is actually dropping out entire tokens to attend to, which might
|
||||||
|
# seem a bit unusual, but is taken from the original Transformer paper.
|
||||||
|
attention_probs = self.dropout(attention_probs)
|
||||||
|
|
||||||
|
# Mask heads if we want to
|
||||||
|
if head_mask is not None:
|
||||||
|
attention_probs = attention_probs * head_mask
|
||||||
|
|
||||||
|
context_layer = torch.matmul(attention_probs, value_layer)
|
||||||
|
|
||||||
|
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
|
||||||
|
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
|
||||||
|
context_layer = context_layer.view(*new_context_layer_shape)
|
||||||
|
|
||||||
|
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAESelfOutput(nn.Module):
|
||||||
|
"""
|
||||||
|
The residual connection is defined in ViTMAELayer instead of here (as is the case with other models), due to the
|
||||||
|
layernorm applied before each block.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
||||||
|
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, input_tensor):
|
||||||
|
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.dropout(hidden_states)
|
||||||
|
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEAttention(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.attention = ViTMAESelfAttention(config)
|
||||||
|
self.output = ViTMAESelfOutput(config)
|
||||||
|
self.pruned_heads = set()
|
||||||
|
|
||||||
|
def prune_heads(self, heads):
|
||||||
|
if len(heads) == 0:
|
||||||
|
return
|
||||||
|
heads, index = find_pruneable_heads_and_indices(
|
||||||
|
heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
|
||||||
|
)
|
||||||
|
|
||||||
|
# Prune linear layers
|
||||||
|
self.attention.query = prune_linear_layer(self.attention.query, index)
|
||||||
|
self.attention.key = prune_linear_layer(self.attention.key, index)
|
||||||
|
self.attention.value = prune_linear_layer(self.attention.value, index)
|
||||||
|
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
|
||||||
|
|
||||||
|
# Update hyper params and store pruned heads
|
||||||
|
self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
|
||||||
|
self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
|
||||||
|
self.pruned_heads = self.pruned_heads.union(heads)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, head_mask=None, output_attentions=False):
|
||||||
|
self_outputs = self.attention(hidden_states, head_mask, output_attentions)
|
||||||
|
|
||||||
|
attention_output = self.output(self_outputs[0], hidden_states)
|
||||||
|
|
||||||
|
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEIntermediate(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
|
||||||
|
if isinstance(config.hidden_act, str):
|
||||||
|
self.intermediate_act_fn = ACT2FN[config.hidden_act]
|
||||||
|
else:
|
||||||
|
self.intermediate_act_fn = config.hidden_act
|
||||||
|
|
||||||
|
def forward(self, hidden_states):
|
||||||
|
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.intermediate_act_fn(hidden_states)
|
||||||
|
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEOutput(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
|
||||||
|
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, input_tensor):
|
||||||
|
hidden_states = self.dense(hidden_states)
|
||||||
|
hidden_states = self.dropout(hidden_states)
|
||||||
|
|
||||||
|
hidden_states = hidden_states + input_tensor
|
||||||
|
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAELayer(nn.Module):
|
||||||
|
"""This corresponds to the Block class in the timm implementation."""
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.chunk_size_feed_forward = config.chunk_size_feed_forward
|
||||||
|
self.seq_len_dim = 1
|
||||||
|
self.attention = ViTMAEAttention(config)
|
||||||
|
self.intermediate = ViTMAEIntermediate(config)
|
||||||
|
self.output = ViTMAEOutput(config)
|
||||||
|
self.layernorm_before = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
self.layernorm_after = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
|
||||||
|
def forward(self, hidden_states, head_mask=None, output_attentions=False):
|
||||||
|
self_attention_outputs = self.attention(
|
||||||
|
self.layernorm_before(hidden_states), # in ViTMAE, layernorm is applied before self-attention
|
||||||
|
head_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
)
|
||||||
|
attention_output = self_attention_outputs[0]
|
||||||
|
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
|
||||||
|
|
||||||
|
# first residual connection
|
||||||
|
hidden_states = attention_output + hidden_states
|
||||||
|
|
||||||
|
# in ViTMAE, layernorm is also applied after self-attention
|
||||||
|
layer_output = self.layernorm_after(hidden_states)
|
||||||
|
|
||||||
|
layer_output = self.intermediate(layer_output)
|
||||||
|
|
||||||
|
# second residual connection is done here
|
||||||
|
layer_output = self.output(layer_output, hidden_states)
|
||||||
|
|
||||||
|
outputs = (layer_output,) + outputs
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
def feed_forward_chunk(self, attention_output):
|
||||||
|
intermediate_output = self.intermediate(attention_output)
|
||||||
|
layer_output = self.output(intermediate_output)
|
||||||
|
return layer_output
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEEncoder(nn.Module):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__()
|
||||||
|
self.config = config
|
||||||
|
self.layer = nn.ModuleList([ViTMAELayer(config) for _ in range(config.num_hidden_layers)])
|
||||||
|
self.gradient_checkpointing = False
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
head_mask=None,
|
||||||
|
output_attentions=False,
|
||||||
|
output_hidden_states=False,
|
||||||
|
return_dict=True,
|
||||||
|
):
|
||||||
|
all_hidden_states = () if output_hidden_states else None
|
||||||
|
all_self_attentions = () if output_attentions else None
|
||||||
|
|
||||||
|
for i, layer_module in enumerate(self.layer):
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
layer_head_mask = head_mask[i] if head_mask is not None else None
|
||||||
|
|
||||||
|
if self.gradient_checkpointing and self.training:
|
||||||
|
|
||||||
|
def create_custom_forward(module):
|
||||||
|
def custom_forward(*inputs):
|
||||||
|
return module(*inputs, output_attentions)
|
||||||
|
|
||||||
|
return custom_forward
|
||||||
|
|
||||||
|
layer_outputs = torch.utils.checkpoint.checkpoint(
|
||||||
|
create_custom_forward(layer_module),
|
||||||
|
hidden_states,
|
||||||
|
layer_head_mask,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions)
|
||||||
|
|
||||||
|
hidden_states = layer_outputs[0]
|
||||||
|
|
||||||
|
if output_attentions:
|
||||||
|
all_self_attentions = all_self_attentions + (layer_outputs[1],)
|
||||||
|
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
|
||||||
|
return BaseModelOutput(
|
||||||
|
last_hidden_state=hidden_states,
|
||||||
|
hidden_states=all_hidden_states,
|
||||||
|
attentions=all_self_attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEPreTrainedModel(PreTrainedModel):
|
||||||
|
"""
|
||||||
|
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
|
||||||
|
models.
|
||||||
|
"""
|
||||||
|
|
||||||
|
config_class = ViTMAEConfig
|
||||||
|
base_model_prefix = "vit_mae"
|
||||||
|
main_input_name = "pixel_values"
|
||||||
|
supports_gradient_checkpointing = True
|
||||||
|
|
||||||
|
def _init_weights(self, module):
|
||||||
|
"""Initialize the weights"""
|
||||||
|
if isinstance(module, (nn.Linear, nn.Conv2d)):
|
||||||
|
# Slightly different from the TF version which uses truncated_normal for initialization
|
||||||
|
# cf https://github.com/pytorch/pytorch/pull/5617
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
if module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
elif isinstance(module, nn.LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
|
||||||
|
def _set_gradient_checkpointing(self, module, value=False):
|
||||||
|
if isinstance(module, ViTMAEEncoder):
|
||||||
|
module.gradient_checkpointing = value
|
||||||
|
|
||||||
|
|
||||||
|
VIT_MAE_START_DOCSTRING = r"""
|
||||||
|
This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
|
||||||
|
as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
|
||||||
|
behavior.
|
||||||
|
|
||||||
|
Parameters:
|
||||||
|
config ([`ViTMAEConfig`]): Model configuration class with all the parameters of the model.
|
||||||
|
Initializing with a config file does not load the weights associated with the model, only the
|
||||||
|
configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
|
||||||
|
"""
|
||||||
|
|
||||||
|
VIT_MAE_INPUTS_DOCSTRING = r"""
|
||||||
|
Args:
|
||||||
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
|
Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
|
||||||
|
[`ViTFeatureExtractor.__call__`] for details.
|
||||||
|
|
||||||
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
|
|
||||||
|
- 1 indicates the head is **not masked**,
|
||||||
|
- 0 indicates the head is **masked**.
|
||||||
|
|
||||||
|
output_attentions (`bool`, *optional*):
|
||||||
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
|
tensors for more detail.
|
||||||
|
output_hidden_states (`bool`, *optional*):
|
||||||
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
more detail.
|
||||||
|
return_dict (`bool`, *optional*):
|
||||||
|
Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings(
|
||||||
|
"The bare ViTMAE Model transformer outputting raw hidden-states without any specific head on top.",
|
||||||
|
VIT_MAE_START_DOCSTRING,
|
||||||
|
)
|
||||||
|
class ViTMAEModel(ViTMAEPreTrainedModel):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
self.config = config
|
||||||
|
|
||||||
|
self.embeddings = ViTMAEEmbeddings(config)
|
||||||
|
self.encoder = ViTMAEEncoder(config)
|
||||||
|
|
||||||
|
self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.embeddings.patch_embeddings
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
"""
|
||||||
|
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
|
||||||
|
class PreTrainedModel
|
||||||
|
"""
|
||||||
|
for layer, heads in heads_to_prune.items():
|
||||||
|
self.encoder.layer[layer].attention.prune_heads(heads)
|
||||||
|
|
||||||
|
@add_start_docstrings_to_model_forward(VIT_MAE_INPUTS_DOCSTRING)
|
||||||
|
@replace_return_docstrings(output_type=ViTMAEModelOutput, config_class=_CONFIG_FOR_DOC)
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
pixel_values=None,
|
||||||
|
head_mask=None,
|
||||||
|
output_attentions=None,
|
||||||
|
output_hidden_states=None,
|
||||||
|
return_dict=None,
|
||||||
|
):
|
||||||
|
r"""
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ViTFeatureExtractor, ViTMAEModel
|
||||||
|
>>> from PIL import Image
|
||||||
|
>>> import requests
|
||||||
|
|
||||||
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
|
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
||||||
|
>>> model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
|
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
|
```"""
|
||||||
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
|
output_hidden_states = (
|
||||||
|
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||||
|
)
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
if pixel_values is None:
|
||||||
|
raise ValueError("You have to specify pixel_values")
|
||||||
|
|
||||||
|
# Prepare head mask if needed
|
||||||
|
# 1.0 in head_mask indicate we keep the head
|
||||||
|
# attention_probs has shape bsz x n_heads x N x N
|
||||||
|
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
|
||||||
|
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
|
||||||
|
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
|
||||||
|
|
||||||
|
embedding_output, mask, ids_restore = self.embeddings(pixel_values)
|
||||||
|
|
||||||
|
encoder_outputs = self.encoder(
|
||||||
|
embedding_output,
|
||||||
|
head_mask=head_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
output_hidden_states=output_hidden_states,
|
||||||
|
return_dict=return_dict,
|
||||||
|
)
|
||||||
|
sequence_output = encoder_outputs[0]
|
||||||
|
sequence_output = self.layernorm(sequence_output)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return (sequence_output, mask, ids_restore) + encoder_outputs[1:]
|
||||||
|
|
||||||
|
return ViTMAEModelOutput(
|
||||||
|
last_hidden_state=sequence_output,
|
||||||
|
mask=mask,
|
||||||
|
ids_restore=ids_restore,
|
||||||
|
hidden_states=encoder_outputs.hidden_states,
|
||||||
|
attentions=encoder_outputs.attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEDecoder(nn.Module):
|
||||||
|
def __init__(self, config, num_patches):
|
||||||
|
super().__init__()
|
||||||
|
self.decoder_embed = nn.Linear(config.hidden_size, config.decoder_hidden_size, bias=True)
|
||||||
|
self.mask_token = nn.Parameter(torch.zeros(1, 1, config.decoder_hidden_size))
|
||||||
|
self.decoder_pos_embed = nn.Parameter(
|
||||||
|
torch.zeros(1, num_patches + 1, config.decoder_hidden_size), requires_grad=False
|
||||||
|
) # fixed sin-cos embedding
|
||||||
|
|
||||||
|
decoder_config = deepcopy(config)
|
||||||
|
decoder_config.hidden_size = config.decoder_hidden_size
|
||||||
|
decoder_config.num_hidden_layers = config.decoder_num_hidden_layers
|
||||||
|
decoder_config.num_attention_heads = config.decoder_num_attention_heads
|
||||||
|
decoder_config.intermediate_size = config.decoder_intermediate_size
|
||||||
|
self.decoder_layers = nn.ModuleList(
|
||||||
|
[ViTMAELayer(decoder_config) for _ in range(config.decoder_num_hidden_layers)]
|
||||||
|
)
|
||||||
|
|
||||||
|
self.decoder_norm = nn.LayerNorm(config.decoder_hidden_size)
|
||||||
|
self.decoder_pred = nn.Linear(
|
||||||
|
config.decoder_hidden_size, config.patch_size ** 2 * config.num_channels, bias=True
|
||||||
|
) # encoder to decoder
|
||||||
|
self.gradient_checkpointing = False
|
||||||
|
self.config = config
|
||||||
|
self.initialize_weights(num_patches)
|
||||||
|
|
||||||
|
def initialize_weights(self, num_patches):
|
||||||
|
# initialize (and freeze) position embeddings by sin-cos embedding
|
||||||
|
decoder_pos_embed = get_2d_sincos_pos_embed(
|
||||||
|
self.decoder_pos_embed.shape[-1], int(num_patches ** 0.5), add_cls_token=True
|
||||||
|
)
|
||||||
|
self.decoder_pos_embed.data.copy_(torch.from_numpy(decoder_pos_embed).float().unsqueeze(0))
|
||||||
|
|
||||||
|
# timm's trunc_normal_(std=.02) is effectively normal_(std=0.02) as cutoff is too big (2.)
|
||||||
|
torch.nn.init.normal_(self.mask_token, std=self.config.initializer_range)
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
ids_restore,
|
||||||
|
output_attentions=False,
|
||||||
|
output_hidden_states=False,
|
||||||
|
return_dict=True,
|
||||||
|
):
|
||||||
|
# embed tokens
|
||||||
|
x = self.decoder_embed(hidden_states)
|
||||||
|
|
||||||
|
# append mask tokens to sequence
|
||||||
|
mask_tokens = self.mask_token.repeat(x.shape[0], ids_restore.shape[1] + 1 - x.shape[1], 1)
|
||||||
|
x_ = torch.cat([x[:, 1:, :], mask_tokens], dim=1) # no cls token
|
||||||
|
x_ = torch.gather(x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2])) # unshuffle
|
||||||
|
x = torch.cat([x[:, :1, :], x_], dim=1) # append cls token
|
||||||
|
|
||||||
|
# add pos embed
|
||||||
|
hidden_states = x + self.decoder_pos_embed
|
||||||
|
|
||||||
|
# apply Transformer layers (blocks)
|
||||||
|
all_hidden_states = () if output_hidden_states else None
|
||||||
|
all_self_attentions = () if output_attentions else None
|
||||||
|
for i, layer_module in enumerate(self.decoder_layers):
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
if self.gradient_checkpointing and self.training:
|
||||||
|
|
||||||
|
def create_custom_forward(module):
|
||||||
|
def custom_forward(*inputs):
|
||||||
|
return module(*inputs, output_attentions)
|
||||||
|
|
||||||
|
return custom_forward
|
||||||
|
|
||||||
|
layer_outputs = torch.utils.checkpoint.checkpoint(
|
||||||
|
create_custom_forward(layer_module),
|
||||||
|
hidden_states,
|
||||||
|
None,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
layer_outputs = layer_module(hidden_states, head_mask=None, output_attentions=output_attentions)
|
||||||
|
|
||||||
|
hidden_states = layer_outputs[0]
|
||||||
|
|
||||||
|
if output_attentions:
|
||||||
|
all_self_attentions = all_self_attentions + (layer_outputs[1],)
|
||||||
|
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states = all_hidden_states + (hidden_states,)
|
||||||
|
|
||||||
|
hidden_states = self.decoder_norm(hidden_states)
|
||||||
|
|
||||||
|
# predictor projection
|
||||||
|
logits = self.decoder_pred(hidden_states)
|
||||||
|
|
||||||
|
# remove cls token
|
||||||
|
logits = logits[:, 1:, :]
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return tuple(v for v in [logits, all_hidden_states, all_self_attentions] if v is not None)
|
||||||
|
return ViTMAEDecoderOutput(
|
||||||
|
logits=logits,
|
||||||
|
hidden_states=all_hidden_states,
|
||||||
|
attentions=all_self_attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@add_start_docstrings(
|
||||||
|
"The ViTMAE Model transformer with the decoder on top for self-supervised pre-training.",
|
||||||
|
VIT_MAE_START_DOCSTRING,
|
||||||
|
)
|
||||||
|
class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
self.config = config
|
||||||
|
|
||||||
|
self.vit = ViTMAEModel(config)
|
||||||
|
self.decoder = ViTMAEDecoder(config, num_patches=self.vit.embeddings.num_patches)
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.vit.embeddings.patch_embeddings
|
||||||
|
|
||||||
|
def _prune_heads(self, heads_to_prune):
|
||||||
|
"""
|
||||||
|
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
|
||||||
|
class PreTrainedModel
|
||||||
|
"""
|
||||||
|
for layer, heads in heads_to_prune.items():
|
||||||
|
self.encoder.layer[layer].attention.prune_heads(heads)
|
||||||
|
|
||||||
|
def patchify(self, imgs):
|
||||||
|
"""
|
||||||
|
imgs: (N, 3, H, W) x: (N, L, patch_size**2 *3)
|
||||||
|
"""
|
||||||
|
p = self.vit.embeddings.patch_embeddings.patch_size[0]
|
||||||
|
assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
|
||||||
|
|
||||||
|
h = w = imgs.shape[2] // p
|
||||||
|
x = imgs.reshape(shape=(imgs.shape[0], 3, h, p, w, p))
|
||||||
|
x = torch.einsum("nchpwq->nhwpqc", x)
|
||||||
|
x = x.reshape(shape=(imgs.shape[0], h * w, p ** 2 * 3))
|
||||||
|
return x
|
||||||
|
|
||||||
|
def unpatchify(self, x):
|
||||||
|
"""
|
||||||
|
x: (N, L, patch_size**2 *3) imgs: (N, 3, H, W)
|
||||||
|
"""
|
||||||
|
p = self.vit.embeddings.patch_embeddings.patch_size[0]
|
||||||
|
h = w = int(x.shape[1] ** 0.5)
|
||||||
|
assert h * w == x.shape[1]
|
||||||
|
|
||||||
|
x = x.reshape(shape=(x.shape[0], h, w, p, p, 3))
|
||||||
|
x = torch.einsum("nhwpqc->nchpwq", x)
|
||||||
|
imgs = x.reshape(shape=(x.shape[0], 3, h * p, h * p))
|
||||||
|
return imgs
|
||||||
|
|
||||||
|
def forward_loss(self, imgs, pred, mask):
|
||||||
|
"""
|
||||||
|
imgs: [N, 3, H, W] pred: [N, L, p*p*3] mask: [N, L], 0 is keep, 1 is remove,
|
||||||
|
"""
|
||||||
|
target = self.patchify(imgs)
|
||||||
|
if self.config.norm_pix_loss:
|
||||||
|
mean = target.mean(dim=-1, keepdim=True)
|
||||||
|
var = target.var(dim=-1, keepdim=True)
|
||||||
|
target = (target - mean) / (var + 1.0e-6) ** 0.5
|
||||||
|
|
||||||
|
loss = (pred - target) ** 2
|
||||||
|
loss = loss.mean(dim=-1) # [N, L], mean loss per patch
|
||||||
|
|
||||||
|
loss = (loss * mask).sum() / mask.sum() # mean loss on removed patches
|
||||||
|
return loss
|
||||||
|
|
||||||
|
@add_start_docstrings_to_model_forward(VIT_MAE_INPUTS_DOCSTRING)
|
||||||
|
@replace_return_docstrings(output_type=ViTMAEForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
pixel_values=None,
|
||||||
|
head_mask=None,
|
||||||
|
output_attentions=None,
|
||||||
|
output_hidden_states=None,
|
||||||
|
return_dict=None,
|
||||||
|
):
|
||||||
|
r"""
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ViTFeatureExtractor, ViTMAEModel
|
||||||
|
>>> from PIL import Image
|
||||||
|
>>> import requests
|
||||||
|
|
||||||
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
|
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
||||||
|
>>> model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
|
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
|
```"""
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
outputs = self.vit(
|
||||||
|
pixel_values,
|
||||||
|
head_mask=head_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
output_hidden_states=output_hidden_states,
|
||||||
|
return_dict=return_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
latent = outputs.last_hidden_state
|
||||||
|
ids_restore = outputs.ids_restore
|
||||||
|
mask = outputs.mask
|
||||||
|
|
||||||
|
decoder_outputs = self.decoder(latent, ids_restore) # [N, L, p*p*3]
|
||||||
|
logits = decoder_outputs.logits
|
||||||
|
|
||||||
|
loss = self.forward_loss(pixel_values, logits, mask)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
output = (logits, mask, ids_restore) + outputs[2:]
|
||||||
|
return ((loss,) + output) if loss is not None else output
|
||||||
|
|
||||||
|
return ViTMAEForPreTrainingOutput(
|
||||||
|
loss=loss,
|
||||||
|
logits=logits,
|
||||||
|
mask=mask,
|
||||||
|
ids_restore=ids_restore,
|
||||||
|
hidden_states=outputs.hidden_states,
|
||||||
|
attentions=outputs.attentions,
|
||||||
|
)
|
||||||
@@ -3637,6 +3637,37 @@ class ViTPreTrainedModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
VIT_MAE_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEForPreTraining(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAELayer(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
427
tests/test_modeling_vit_mae.py
Normal file
427
tests/test_modeling_vit_mae.py
Normal file
@@ -0,0 +1,427 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch ViTMAE model. """
|
||||||
|
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import math
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers import ViTMAEConfig
|
||||||
|
from transformers.file_utils import cached_property, is_torch_available, is_vision_available
|
||||||
|
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||||
|
|
||||||
|
from .test_configuration_common import ConfigTester
|
||||||
|
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from transformers import ViTMAEForPreTraining, ViTMAEModel
|
||||||
|
from transformers.models.vit.modeling_vit import VIT_PRETRAINED_MODEL_ARCHIVE_LIST, to_2tuple
|
||||||
|
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import ViTFeatureExtractor
|
||||||
|
|
||||||
|
|
||||||
|
class ViTMAEModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
image_size=30,
|
||||||
|
patch_size=2,
|
||||||
|
num_channels=3,
|
||||||
|
is_training=True,
|
||||||
|
use_labels=True,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
type_sequence_label_size=10,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||||
|
|
||||||
|
labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, pixel_values, labels
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return ViTMAEConfig(
|
||||||
|
image_size=self.image_size,
|
||||||
|
patch_size=self.patch_size,
|
||||||
|
num_channels=self.num_channels,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||||
|
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||||
|
is_decoder=False,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, pixel_values, labels):
|
||||||
|
model = ViTMAEModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(pixel_values)
|
||||||
|
# expected sequence length = (num_patches + 1) * (1 - config.mask_ratio), rounded above
|
||||||
|
# (we add 1 for the [CLS] token)
|
||||||
|
image_size = to_2tuple(self.image_size)
|
||||||
|
patch_size = to_2tuple(self.patch_size)
|
||||||
|
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||||
|
expected_seq_len = int(math.ceil((1 - config.mask_ratio) * (num_patches + 1)))
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, expected_seq_len, self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_for_pretraining(self, config, pixel_values, labels):
|
||||||
|
model = ViTMAEForPreTraining(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(pixel_values)
|
||||||
|
# expected sequence length = num_patches
|
||||||
|
image_size = to_2tuple(self.image_size)
|
||||||
|
patch_size = to_2tuple(self.patch_size)
|
||||||
|
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||||
|
expected_seq_len = num_patches
|
||||||
|
expected_num_channels = self.patch_size ** 2 * self.num_channels
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, expected_seq_len, expected_num_channels))
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(
|
||||||
|
config,
|
||||||
|
pixel_values,
|
||||||
|
labels,
|
||||||
|
) = config_and_inputs
|
||||||
|
inputs_dict = {"pixel_values": pixel_values}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class ViTMAEModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
"""
|
||||||
|
Here we also overwrite some of the tests of test_modeling_common.py, as ViTMAE does not use input_ids, inputs_embeds,
|
||||||
|
attention_mask and seq_length.
|
||||||
|
"""
|
||||||
|
|
||||||
|
all_model_classes = (ViTMAEModel, ViTMAEForPreTraining) if is_torch_available() else ()
|
||||||
|
|
||||||
|
test_pruning = False
|
||||||
|
test_torchscript = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = ViTMAEModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=ViTMAEConfig, has_text_modality=False, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
# ViTMAE does not use inputs_embeds
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||||
|
x = model.get_output_embeddings()
|
||||||
|
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
expected_arg_names = ["pixel_values"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_for_pretraining(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_for_pretraining(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_attention_outputs(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.return_dict = True
|
||||||
|
|
||||||
|
# in ViTMAE, the seq_len equals (number of patches + 1) * (1 - mask_ratio), rounded above
|
||||||
|
image_size = to_2tuple(self.model_tester.image_size)
|
||||||
|
patch_size = to_2tuple(self.model_tester.patch_size)
|
||||||
|
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||||
|
seq_len = int(math.ceil((1 - config.mask_ratio) * (num_patches + 1)))
|
||||||
|
encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
|
||||||
|
encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
|
||||||
|
chunk_length = getattr(self.model_tester, "chunk_length", None)
|
||||||
|
if chunk_length is not None and hasattr(self.model_tester, "num_hashes"):
|
||||||
|
encoder_seq_length = encoder_seq_length * self.model_tester.num_hashes
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
inputs_dict["output_hidden_states"] = False
|
||||||
|
config.return_dict = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||||
|
|
||||||
|
# check that output_attentions also work using config
|
||||||
|
del inputs_dict["output_attentions"]
|
||||||
|
config.output_attentions = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||||
|
|
||||||
|
if chunk_length is not None:
|
||||||
|
self.assertListEqual(
|
||||||
|
list(attentions[0].shape[-4:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertListEqual(
|
||||||
|
list(attentions[0].shape[-3:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
out_len = len(outputs)
|
||||||
|
|
||||||
|
# Check attention is always last and order is fine
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
if hasattr(self.model_tester, "num_hidden_states_types"):
|
||||||
|
added_hidden_states = self.model_tester.num_hidden_states_types
|
||||||
|
elif self.is_encoder_decoder:
|
||||||
|
added_hidden_states = 2
|
||||||
|
else:
|
||||||
|
added_hidden_states = 1
|
||||||
|
self.assertEqual(out_len + added_hidden_states, len(outputs))
|
||||||
|
|
||||||
|
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
|
||||||
|
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
|
||||||
|
if chunk_length is not None:
|
||||||
|
self.assertListEqual(
|
||||||
|
list(self_attentions[0].shape[-4:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, chunk_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertListEqual(
|
||||||
|
list(self_attentions[0].shape[-3:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
|
||||||
|
|
||||||
|
expected_num_layers = getattr(
|
||||||
|
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
|
||||||
|
)
|
||||||
|
self.assertEqual(len(hidden_states), expected_num_layers)
|
||||||
|
|
||||||
|
# ViTMAE has a different seq_length
|
||||||
|
image_size = to_2tuple(self.model_tester.image_size)
|
||||||
|
patch_size = to_2tuple(self.model_tester.patch_size)
|
||||||
|
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||||
|
seq_length = int(math.ceil((1 - config.mask_ratio) * (num_patches + 1)))
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(hidden_states[0].shape[-2:]),
|
||||||
|
[seq_length, self.model_tester.hidden_size],
|
||||||
|
)
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# check that output_hidden_states also work using config
|
||||||
|
del inputs_dict["output_hidden_states"]
|
||||||
|
config.output_hidden_states = True
|
||||||
|
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
def test_save_load(self):
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
|
||||||
|
print("Model class:", model_class)
|
||||||
|
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
# make random mask reproducible
|
||||||
|
torch.manual_seed(2)
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
out_2 = outputs[0].cpu().numpy()
|
||||||
|
out_2[np.isnan(out_2)] = 0
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
model = model_class.from_pretrained(tmpdirname)
|
||||||
|
model.to(torch_device)
|
||||||
|
# make random mask reproducible
|
||||||
|
torch.manual_seed(2)
|
||||||
|
with torch.no_grad():
|
||||||
|
after_outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
# Make sure we don't have nans
|
||||||
|
out_1 = after_outputs[0].cpu().numpy()
|
||||||
|
out_1[np.isnan(out_1)] = 0
|
||||||
|
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||||
|
self.assertLessEqual(max_diff, 1e-5)
|
||||||
|
|
||||||
|
@unittest.skip(
|
||||||
|
reason="""ViTMAE returns a random mask + ids_restore in each forward pass. See test_save_load
|
||||||
|
to get deterministic results."""
|
||||||
|
)
|
||||||
|
def test_determinism(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(
|
||||||
|
reason="""ViTMAE returns a random mask + ids_restore in each forward pass. See test_save_load
|
||||||
|
to get deterministic results."""
|
||||||
|
)
|
||||||
|
def test_save_load_fast_init_from_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(
|
||||||
|
reason="""ViTMAE returns a random mask + ids_restore in each forward pass. See test_save_load
|
||||||
|
to get deterministic results."""
|
||||||
|
)
|
||||||
|
def test_save_load_fast_init_to_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="""ViTMAE returns a random mask + ids_restore in each forward pass. See test_save_load""")
|
||||||
|
def test_model_outputs_equivalence(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in VIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ViTMAEModel.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
# We will verify our results on an image of cute cats
|
||||||
|
def prepare_img():
|
||||||
|
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class ViTMAEModelIntegrationTest(unittest.TestCase):
|
||||||
|
@cached_property
|
||||||
|
def default_feature_extractor(self):
|
||||||
|
return ViTFeatureExtractor.from_pretrained("facebook/vit-mae-base") if is_vision_available() else None
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_inference_for_pretraining(self):
|
||||||
|
# make random mask reproducible
|
||||||
|
torch.manual_seed(2)
|
||||||
|
|
||||||
|
model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base").to(torch_device)
|
||||||
|
|
||||||
|
feature_extractor = self.default_feature_extractor
|
||||||
|
image = prepare_img()
|
||||||
|
inputs = feature_extractor(images=image, return_tensors="pt").to(torch_device)
|
||||||
|
|
||||||
|
# forward pass
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
# verify the logits
|
||||||
|
expected_shape = torch.Size((1, 196, 768))
|
||||||
|
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||||
|
|
||||||
|
expected_slice = torch.tensor(
|
||||||
|
[[0.7366, -1.3663, -0.2844], [0.7919, -1.3839, -0.3241], [0.4313, -0.7168, -0.2878]]
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4))
|
||||||
Reference in New Issue
Block a user