Add Perceiver IO (#14487)

* First draft

* Style and remove mlm

* Make forward pass work

* More improvements

* More improvements

* Fix bug

* More improvements

* More improvements

* Add PerceiverTokenizer first draft

* Improve conversion script

* More improvements

* Make conversion script work for the encoder

* Make conversion script work with local pickle files

* Style & quality, fix-copies

* Add dummy input to conversion script

* Add absolute position embeddings to TextPreProcessor

* Make forward pass of encoder work

* More improvements

* Move text preprocessor to separate script

* More improvements

* More improvements

* Add post processor

* Make MLM model work

* Style

* Add PerceiverForMaskedLM

* Add PerceiverImagePreprocessor

* Make style

* Make PerceiverForImageClassification work

* More improvements

* More improvements

* Use tokenizer in conversion script

* Use PerceiverForMaskedLM in conversion script

* Define custom PerceiverModelOutput

* Improve PerceiverAttention to make it work for both MLM and image classification

* More improvements

* More improvements

* More improvements to the conversion script

* Make conversion script work for both MLM and image classification

* Add PerceiverFeatureExtractor

* More improvements

* Style and quality

* Add center cropping

* Fix bug

* Small fix

* Add print statement

* Fix bug in image preprocessor

* Fix bug with conversion script

* Make output position embeddings an nn.Parameter layer instead of nn.Embedding

* Comment out print statements

* Add position encoding classes

* More improvements

* Use position_encoding_kwargs

* Add PerceiverForImageClassificationFourier

* Make style & quality

* Add PerceiverForImageClassificationConvProcessing

* Style & quality

* Add flow model

* Move processors to modeling file

* Make position encodings modular

* Make basic decoder use modular position encodings

* Add PerceiverForOpticalFlow to conversion script

* Add AudioPreprocessor

* Make it possible for the basic decoder to use Fourier position embeddings

* Add PerceiverForMultimodalAutoencoding

* Improve model for optical flow

* Improve _build_network_inputs method

* Add print statement

* Fix device issue

* Fix device of Fourier embeddings

* Add print statements for debugging

* Add another print statement

* Add another print statement

* Add another print statement

* Add another print statement

* Improve PerceiverAudioPreprocessor

* Improve conversion script for multimodal modal

* More improvements

* More improvements

* Improve multimodal model

* Make forward pass multimodal model work

* More improvements

* Improve tests

* Fix some more tests

* Add output dataclasses

* Make more tests pass

* Add print statements for debuggin

* Add tests for image classification

* Add PerceiverClassifierOutput

* More improvements

* Make more tests pass for the optical flow model

* Make style & quality

* Small improvements

* Don't support training for optical flow model for now

* Fix _prepare_for_class for tests

* Make more tests pass, add some docs

* Add multimodal model to tests

* Minor fixes

* Fix tests

* Improve conversion script

* Make fixup

* Remove pos_dim argument

* Fix device issue

* Potential fix for OOM

* Revert previous commit

* Fix test_initialization

* Add print statements for debugging

* Fix print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Add print statement

* Remove need for output_shape

* Comment out output_shape

* Remove unnecessary code

* Improve docs

* Fix make fixup

* Remove PerceiverTextProcessor from init

* Improve docs

* Small improvement

* Apply first batch of suggestions from code review

* Apply more suggestions from code review

* Update docstrings

* Define dicts beforehand for readability

* Rename task to architecture in conversion script, include PerceiverModel in tests

* Add print statements for debugging

* Fix tests on GPU

* Remove preprocessors, postprocessors and decoders from main init

* Add integration test

* Fix docs

* Replace einops by torch

* Update for new docs frontend

* Rename PerceiverForImageClassification

* Improve docs

* Improve docs

* Improve docs of PerceiverModel

* Fix some more tests

* Improve center_crop

* Add PerceiverForSequenceClassification

* Small improvements

* Fix tests

* Add integration test for optical flow model

* Clean up

* Add tests for tokenizer

* Fix tokenizer by adding special tokens properly

* Fix CI
This commit is contained in:
NielsRogge
2021-12-08 14:20:34 +01:00
committed by GitHub
parent 961732c276
commit 65b20b739b
22 changed files with 6592 additions and 0 deletions

View File

@@ -286,6 +286,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

View File

@@ -265,6 +265,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

View File

@@ -289,6 +289,7 @@ conda install -c huggingface transformers
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (来自 Microsoft Research) 伴随论文 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 由 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 发布。
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (来自 Deepmind) 伴随论文 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 由 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 发布。
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。

View File

@@ -301,6 +301,7 @@ conda install -c huggingface transformers
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.

View File

@@ -146,6 +146,7 @@ conversion utilities for the following models.
1. **[MPNet](model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
@@ -234,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
| QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ |
| RAG | ✅ | ❌ | ✅ | ✅ | ❌ |

718
docs/source/index.rst Normal file
View File

@@ -0,0 +1,718 @@
Transformers
=======================================================================================================================
State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow
🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax,
PyTorch and TensorFlow.
This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`__. You can
also follow our `online course <https://huggingface.co/course>`__ that teaches how to use this library, as well as the
other libraries developed by Hugging Face and the Hub.
If you are looking for custom support from the Hugging Face team
-----------------------------------------------------------------------------------------------------------------------
.. raw:: html
<a target="_blank" href="https://huggingface.co/support">
<img alt="HuggingFace Expert Acceleration Program" src="https://huggingface.co/front/thumbnails/support.png" style="max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
</a><br>
Features
-----------------------------------------------------------------------------------------------------------------------
- High performance on NLU and NLG tasks
- Low barrier to entry for educators and practitioners
State-of-the-art NLP for everyone:
- Deep learning researchers
- Hands-on practitioners
- AI/ML/NLP teachers and educators
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Lower compute costs, smaller carbon footprint:
- Researchers can share trained models instead of always retraining
- Practitioners can reduce compute time and production costs
- 8 architectures with over 30 pretrained models, some in more than 100 languages
Choose the right framework for every part of a model's lifetime:
- Train state-of-the-art models in 3 lines of code
- Deep interoperability between Jax, Pytorch and TensorFlow models
- Move a single model between Jax/PyTorch/TensorFlow frameworks at will
- Seamlessly pick the right framework for training, evaluation, production
The support for Jax is still experimental (with a few models right now), expect to see it grow in the coming months!
`All the model checkpoints <https://huggingface.co/models>`__ are seamlessly integrated from the huggingface.co `model
hub <https://huggingface.co>`__ where they are uploaded directly by `users <https://huggingface.co/users>`__ and
`organizations <https://huggingface.co/organizations>`__.
Current number of checkpoints: |checkpoints|
.. |checkpoints| image:: https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen
Contents
-----------------------------------------------------------------------------------------------------------------------
The documentation is organized in five parts:
- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
and a glossary.
- **USING 🤗 TRANSFORMERS** contains general tutorials on how to use the library.
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general research in
transformers model
- The three last section contain the documentation of each public class and function, grouped in:
- **MAIN CLASSES** for the main classes exposing the important APIs of the library.
- **MODELS** for the classes and functions related to each model implemented in the library.
- **INTERNAL HELPERS** for the classes and functions we use internally.
The library currently contains Jax, PyTorch and Tensorflow implementations, pretrained model weights, usage scripts and
conversion utilities for the following models.
Supported models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
This list is updated automatically from the README with `make fix-copies`. Do not update manually!
1. :doc:`ALBERT <model_doc/albert>` (from Google Research and the Toyota Technological Institute at Chicago) released
with the paper `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
<https://arxiv.org/abs/1909.11942>`__, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
Sharma, Radu Soricut.
2. :doc:`BART <model_doc/bart>` (from Facebook) released with the paper `BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension
<https://arxiv.org/pdf/1910.13461.pdf>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
3. :doc:`BARThez <model_doc/barthez>` (from École polytechnique) released with the paper `BARThez: a Skilled Pretrained
French Sequence-to-Sequence Model <https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P.
Tixier, Michalis Vazirgiannis.
4. :doc:`BARTpho <model_doc/bartpho>` (from VinAI Research) released with the paper `BARTpho: Pre-trained
Sequence-to-Sequence Models for Vietnamese <https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le
and Dat Quoc Nguyen.
5. :doc:`BEiT <model_doc/beit>` (from Microsoft) released with the paper `BEiT: BERT Pre-Training of Image Transformers
<https://arxiv.org/abs/2106.08254>`__ by Hangbo Bao, Li Dong, Furu Wei.
6. :doc:`BERT <model_doc/bert>` (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang,
Kenton Lee and Kristina Toutanova.
7. :doc:`BERTweet <model_doc/bertweet>` (from VinAI Research) released with the paper `BERTweet: A pre-trained language
model for English Tweets <https://aclanthology.org/2020.emnlp-demos.2/>`__ by Dat Quoc Nguyen, Thanh Vu and Anh Tuan
Nguyen.
8. :doc:`BERT For Sequence Generation <model_doc/bertgeneration>` (from Google) released with the paper `Leveraging
Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
Narayan, Aliaksei Severyn.
9. :doc:`BigBird-RoBERTa <model_doc/bigbird>` (from Google Research) released with the paper `Big Bird: Transformers
for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua
Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.
10. :doc:`BigBird-Pegasus <model_doc/bigbird_pegasus>` (from Google Research) released with the paper `Big Bird:
Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by Manzil Zaheer, Guru Guruganesh, Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr
Ahmed.
11. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
12. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building
an open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
13. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
<https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
14. :doc:`ByT5 <model_doc/byt5>` (from Google Research) released with the paper `ByT5: Towards a token-free future with
pre-trained byte-to-byte models <https://arxiv.org/abs/2105.13626>`__ by Linting Xue, Aditya Barua, Noah Constant,
Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
15. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
16. :doc:`CANINE <model_doc/canine>` (from Google Research) released with the paper `CANINE: Pre-training an Efficient
Tokenization-Free Encoder for Language Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark,
Dan Garrette, Iulia Turc, John Wieting.
17. :doc:`CLIP <model_doc/clip>` (from OpenAI) released with the paper `Learning Transferable Visual Models From
Natural Language Supervision <https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy,
Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, Ilya Sutskever.
18. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
19. :doc:`CPM <model_doc/cpm>` (from Tsinghua University) released with the paper `CPM: A Large-scale Generative
Chinese Pre-trained Language Model <https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei
Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng,
Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang,
Juanzi Li, Xiaoyan Zhu, Maosong Sun.
20. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
Lav R. Varshney, Caiming Xiong and Richard Socher.
21. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT with
Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu
Chen.
22. :doc:`DeBERTa-v2 <model_doc/deberta_v2>` (from Microsoft) released with the paper `DeBERTa: Decoding-enhanced BERT
with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
Weizhu Chen.
23. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
distillation through attention <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs
Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
24. :doc:`DETR <model_doc/detr>` (from Facebook) released with the paper `End-to-End Object Detection with Transformers
<https://arxiv.org/abs/2005.12872>`__ by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
Alexander Kirillov, Sergey Zagoruyko.
25. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
26. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
version of DistilBERT.
27. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
28. :doc:`EncoderDecoder <model_doc/encoderdecoder>` (from Google Research) released with the paper `Leveraging
Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
Narayan, Aliaksei Severyn.
29. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
30. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
31. :doc:`FNet <model_doc/fnet>` (from Google Research) released with the paper `FNet: Mixing Tokens with Fourier
Transforms <https://arxiv.org/abs/2105.03824>`__ by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago
Ontanon.
32. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
33. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
and Ilya Sutskever.
34. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
Luan, Dario Amodei** and Ilya Sutskever**.
35. :doc:`GPT-J <model_doc/gptj>` (from EleutherAI) released in the repository `kingoflolz/mesh-transformer-jax
<https://github.com/kingoflolz/mesh-transformer-jax/>`__ by Ben Wang and Aran Komatsuzaki.
36. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
<https://github.com/EleutherAI/gpt-neo>`__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
37. :doc:`Hubert <model_doc/hubert>` (from Facebook) released with the paper `HuBERT: Self-Supervised Speech
Representation Learning by Masked Prediction of Hidden Units <https://arxiv.org/abs/2106.07447>`__ by Wei-Ning Hsu,
Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
38. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
<https://arxiv.org/abs/2101.01321>`__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
39. `ImageGPT <https://huggingface.co/transformers/master/model_doc/imagegpt.html>`__ (from OpenAI) released with the
paper `Generative Pretraining from Pixels <https://openai.com/blog/image-gpt/>`__ by Mark Chen, Alec Radford, Rewon
Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
40. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
41. :doc:`LayoutLMv2 <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutLMv2:
Multi-modal Pre-training for Visually-Rich Document Understanding <https://arxiv.org/abs/2012.14740>`__ by Yang Xu,
Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min
Zhang, Lidong Zhou.
42. :doc:`LayoutXLM <model_doc/layoutlmv2>` (from Microsoft Research Asia) released with the paper `LayoutXLM:
Multimodal Pre-training for Multilingual Visually-rich Document Understanding <https://arxiv.org/abs/2104.08836>`__
by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.
43. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
44. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
45. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
Representations with Entity-aware Self-attention <https://arxiv.org/abs/2010.01057>`__ by Ikuya Yamada, Akari Asai,
Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
46. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
by Hao Tan and Mohit Bansal.
47. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
Machine Translation <https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma,
Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal,
Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
48. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
Translator Team.
49. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
50. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
51. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
52. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
53. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
Jianfeng Lu, Tie-Yan Liu.
54. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
55. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__ by Jingqing Zhang, Yao Zhao,
Mohammad Saleh and Peter J. Liu.
56. `Perceiver IO <https://huggingface.co/transformers/model_doc/master/perceiver.html>`__ (from Deepmind) released
with the paper `Perceiver IO: A General Architecture for Structured Inputs & Outputs
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
57. :doc:`PhoBERT <model_doc/phobert>` (from VinAI Research) released with the paper `PhoBERT: Pre-trained language
models for Vietnamese <https://www.aclweb.org/anthology/2020.findings-emnlp.92/>`__ by Dat Quoc Nguyen and Anh Tuan
Nguyen.
58. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
59. :doc:`QDQBert <model_doc/qdqbert>` (from NVIDIA) released with the paper `Integer Quantization for Deep Learning
Inference: Principles and Empirical Evaluation <https://arxiv.org/abs/2004.09602>`__ by Hao Wu, Patrick Judd,
Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
60. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
61. :doc:`RemBERT <model_doc/rembert>` (from Google Research) released with the paper `Rethinking embedding coupling in
pre-trained language models <https://arxiv.org/pdf/2010.12821.pdf>`__ by Hyung Won Chung, Thibault Févry, Henry
Tsai, M. Johnson, Sebastian Ruder.
62. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
63. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
64. :doc:`SegFormer <model_doc/segformer>` (from NVIDIA) released with the paper `SegFormer: Simple and Efficient
Design for Semantic Segmentation with Transformers <https://arxiv.org/abs/2105.15203>`__ by Enze Xie, Wenhai Wang,
Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
65. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
Han, Kilian Q. Weinberger, Yoav Artzi.
66. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
67. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
68. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
69. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
Jonathan Berant, Amir Globerson, Omer Levy.
70. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
71. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
72. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
`google-research/text-to-text-transfer-transformer
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
Zhou and Wei Li and Peter J. Liu.
73. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
Francesco Piccinno and Julian Martin Eisenschlos.
74. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
75. :doc:`TrOCR <model_doc/trocr>` (from Microsoft), released together with the paper `TrOCR: Transformer-based Optical
Character Recognition with Pre-trained Models <https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei
Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
76. :doc:`UniSpeech <model_doc/unispeech>` (from Microsoft Research) released with the paper `UniSpeech: Unified Speech
Representation Learning with Labeled and Unlabeled Data <https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu
Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
77. :doc:`UniSpeechSat <model_doc/unispeech_sat>` (from Microsoft Research) released with the paper `UNISPEECH-SAT:
UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING <https://arxiv.org/abs/2110.05752>`__ by
Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li,
Xiangzhan Yu.
78. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
79. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
80. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
Zhou, Abdelrahman Mohamed, Michael Auli.
81. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
82. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
83. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer and Veselin Stoyanov.
84. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
85. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
Supported frameworks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The table below represents the current support in the library for each of those models, whether they have a Python
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
Flax), PyTorch, and/or TensorFlow.
..
This table is updated automatically from the auto modules with `make fix-copies`. Do not update manually!
.. rst-class:: center-aligned-table
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
+=============================+================+================+=================+====================+==============+
| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Bert Generation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBird | ✅ | ✅ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BigBirdPegasus | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Canine | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CLIP | ✅ | ✅ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeBERTa-v2 | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DeiT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DistilBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| GPT Neo | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| GPT-J | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LayoutLMv2 | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Longformer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LUKE | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| LXMERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| M2M100 | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Marian | ✅ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mBART | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MegatronBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| MPNet | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| mT5 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| QDQBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RAG | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Reformer | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SegFormer | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Speech2Text2 | ✅ | ❌ | ❌ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Splinter | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Vision Encoder decoder | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLM-RoBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLMProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| XLNet | ✅ | ✅ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
.. toctree::
:maxdepth: 2
:caption: Get started
quicktour
installation
philosophy
glossary
.. toctree::
:maxdepth: 2
:caption: Using 🤗 Transformers
task_summary
model_summary
preprocessing
training
model_sharing
tokenizer_summary
multilingual
.. toctree::
:maxdepth: 2
:caption: Advanced guides
pretrained_models
examples
troubleshooting
custom_datasets
notebooks
sagemaker
community
converting_tensorflow_models
migration
contributing
add_new_model
add_new_pipeline
fast_tokenizers
performance
parallelism
testing
debugging
serialization
pr_checks
.. toctree::
:maxdepth: 2
:caption: Research
bertology
perplexity
benchmarks
.. toctree::
:maxdepth: 2
:caption: Main Classes
main_classes/callback
main_classes/configuration
main_classes/data_collator
main_classes/keras_callbacks
main_classes/logging
main_classes/model
main_classes/optimizer_schedules
main_classes/output
main_classes/pipelines
main_classes/processors
main_classes/tokenizer
main_classes/trainer
main_classes/deepspeed
main_classes/feature_extractor
.. toctree::
:maxdepth: 2
:caption: Models
model_doc/albert
model_doc/auto
model_doc/bart
model_doc/barthez
model_doc/bartpho
model_doc/beit
model_doc/bert
model_doc/bertweet
model_doc/bertgeneration
model_doc/bert_japanese
model_doc/bigbird
model_doc/bigbird_pegasus
model_doc/blenderbot
model_doc/blenderbot_small
model_doc/bort
model_doc/byt5
model_doc/camembert
model_doc/canine
model_doc/clip
model_doc/convbert
model_doc/cpm
model_doc/ctrl
model_doc/deberta
model_doc/deberta_v2
model_doc/deit
model_doc/detr
model_doc/dialogpt
model_doc/distilbert
model_doc/dpr
model_doc/electra
model_doc/encoderdecoder
model_doc/flaubert
model_doc/fnet
model_doc/fsmt
model_doc/funnel
model_doc/herbert
model_doc/ibert
model_doc/imagegpt
model_doc/layoutlm
model_doc/layoutlmv2
model_doc/layoutxlm
model_doc/led
model_doc/longformer
model_doc/luke
model_doc/lxmert
model_doc/marian
model_doc/m2m_100
model_doc/mbart
model_doc/megatron_bert
model_doc/megatron_gpt2
model_doc/mobilebert
model_doc/mpnet
model_doc/mt5
model_doc/gpt
model_doc/gpt2
model_doc/gptj
model_doc/gpt_neo
model_doc/hubert
model_doc/pegasus
model_doc/perceiver
model_doc/phobert
model_doc/prophetnet
model_doc/qdqbert
model_doc/rag
model_doc/reformer
model_doc/rembert
model_doc/retribert
model_doc/roberta
model_doc/roformer
model_doc/segformer
model_doc/sew
model_doc/sew_d
model_doc/speechencoderdecoder
model_doc/speech_to_text
model_doc/speech_to_text_2
model_doc/splinter
model_doc/squeezebert
model_doc/t5
model_doc/t5v1.1
model_doc/tapas
model_doc/transformerxl
model_doc/trocr
model_doc/unispeech
model_doc/unispeech_sat
model_doc/visionencoderdecoder
model_doc/vision_text_dual_encoder
model_doc/vit
model_doc/visual_bert
model_doc/wav2vec2
model_doc/xlm
model_doc/xlmprophetnet
model_doc/xlmroberta
model_doc/xlnet
model_doc/xlsr_wav2vec2
.. toctree::
:maxdepth: 2
:caption: Internal Helpers
internal/modeling_utils
internal/pipelines_utils
internal/tokenization_utils
internal/trainer_utils
internal/generation_utils
internal/file_utils

View File

@@ -0,0 +1,234 @@
..
Copyright 2021 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Perceiver
-----------------------------------------------------------------------------------------------------------------------
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
The abstract from the paper is the following:
*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
performance on Sintel optical flow estimation.*
Here's a TLDR explaining how Perceiver works:
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
randomly initialized, after which they are trained end-to-end using backpropagation.
Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
the :obj:`d_latents` to :obj:`num_labels`.
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
Perceiver uses a vocabulary size of 262 byte IDs).
This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
Perceiver specific outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
:members:
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
:members:
PerceiverConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverConfig
:members:
PerceiverTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
PerceiverFeatureExtractor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverFeatureExtractor
:members:
PerceiverTextPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
:members:
PerceiverImagePreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
:members:
PerceiverOneHotPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
:members:
PerceiverAudioPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
:members:
PerceiverMultimodalPreprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
:members:
PerceiverProjectionPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
:members:
PerceiverAudioPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
:members:
PerceiverClassificationPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
:members:
PerceiverMultimodalPostprocessor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
:members:
PerceiverModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverModel
:members: forward
PerceiverForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMaskedLM
:members: forward
PerceiverForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForSequenceClassification
:members: forward
PerceiverForImageClassificationLearned
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationLearned
:members: forward
PerceiverForImageClassificationFourier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationFourier
:members: forward
PerceiverForImageClassificationConvProcessing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
:members: forward
PerceiverForOpticalFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForOpticalFlow
:members: forward
PerceiverForMultimodalAutoencoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
:members: forward

View File

@@ -253,6 +253,7 @@ _import_structure = {
"models.mt5": ["MT5Config"],
"models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"],
"models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"],
"models.perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig", "PerceiverTokenizer"],
"models.phobert": ["PhobertTokenizer"],
"models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
"models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
@@ -502,6 +503,7 @@ if is_vision_available():
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
_import_structure["models.layoutlmv2"].append("LayoutLMv2Processor")
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
_import_structure["models.segformer"].append("SegformerFeatureExtractor")
_import_structure["models.vit"].append("ViTFeatureExtractor")
else:
@@ -1144,6 +1146,21 @@ if is_torch_available():
_import_structure["models.pegasus"].extend(
["PegasusForCausalLM", "PegasusForConditionalGeneration", "PegasusModel", "PegasusPreTrainedModel"]
)
_import_structure["models.perceiver"].extend(
[
"PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
"PerceiverForImageClassificationConvProcessing",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationLearned",
"PerceiverForMaskedLM",
"PerceiverForMultimodalAutoencoding",
"PerceiverForOpticalFlow",
"PerceiverForSequenceClassification",
"PerceiverLayer",
"PerceiverModel",
"PerceiverPreTrainedModel",
]
)
_import_structure["models.prophetnet"].extend(
[
"PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -2263,6 +2280,7 @@ if TYPE_CHECKING:
from .models.mt5 import MT5Config
from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer
from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer
from .models.perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig, PerceiverTokenizer
from .models.phobert import PhobertTokenizer
from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
@@ -2470,6 +2488,7 @@ if TYPE_CHECKING:
from .models.imagegpt import ImageGPTFeatureExtractor
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
from .models.layoutxlm import LayoutXLMProcessor
from .models.perceiver import PerceiverFeatureExtractor
from .models.segformer import SegformerFeatureExtractor
from .models.vit import ViTFeatureExtractor
else:
@@ -3006,6 +3025,19 @@ if TYPE_CHECKING:
PegasusModel,
PegasusPreTrainedModel,
)
from .models.perceiver import (
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverLayer,
PerceiverModel,
PerceiverPreTrainedModel,
)
from .models.prophetnet import (
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
ProphetNetDecoder,

View File

@@ -78,6 +78,7 @@ from . import (
mt5,
openai,
pegasus,
perceiver,
phobert,
prophetnet,
qdqbert,

View File

@@ -37,6 +37,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("fnet", "FNetConfig"),
("segformer", "SegformerConfig"),
("vision-text-dual-encoder", "VisionTextDualEncoderConfig"),
("perceiver", "PerceiverConfig"),
("gptj", "GPTJConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("beit", "BeitConfig"),
@@ -119,6 +120,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -194,6 +196,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("fnet", "FNet"),
("segformer", "SegFormer"),
("vision-text-dual-encoder", "VisionTextDualEncoder"),
("perceiver", "Perceiver"),
("gptj", "GPT-J"),
("beit", "BEiT"),
("rembert", "RemBERT"),

View File

@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("fnet", "FNetModel"),
("segformer", "SegformerModel"),
("vision-text-dual-encoder", "VisionTextDualEncoderModel"),
("perceiver", "PerceiverModel"),
("gptj", "GPTJModel"),
("layoutlmv2", "LayoutLMv2Model"),
("beit", "BeitModel"),
@@ -247,6 +248,14 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
("beit", "BeitForImageClassification"),
("segformer", "SegformerForImageClassification"),
("imagegpt", "ImageGPTForImageClassification"),
(
"perceiver",
(
"PerceiverForImageClassificationLearned",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationConvProcessing",
),
),
]
)
@@ -266,6 +275,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
[
# Model for Masked LM mapping
("perceiver", "PerceiverForMaskedLM"),
("qdqbert", "QDQBertForMaskedLM"),
("fnet", "FNetForMaskedLM"),
("rembert", "RemBertForMaskedLM"),
@@ -337,6 +347,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
[
# Model for Sequence Classification mapping
("perceiver", "PerceiverForSequenceClassification"),
("qdqbert", "QDQBertForSequenceClassification"),
("fnet", "FNetForSequenceClassification"),
("gptj", "GPTJForSequenceClassification"),

View File

@@ -0,0 +1,72 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2021 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...file_utils import _LazyModule, is_tokenizers_available, is_torch_available, is_vision_available
_import_structure = {
"configuration_perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig"],
"tokenization_perceiver": ["PerceiverTokenizer"],
}
if is_vision_available():
_import_structure["feature_extraction_perceiver"] = ["PerceiverFeatureExtractor"]
if is_torch_available():
_import_structure["modeling_perceiver"] = [
"PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST",
"PerceiverForImageClassificationConvProcessing",
"PerceiverForImageClassificationFourier",
"PerceiverForImageClassificationLearned",
"PerceiverForMaskedLM",
"PerceiverForMultimodalAutoencoding",
"PerceiverForOpticalFlow",
"PerceiverForSequenceClassification",
"PerceiverLayer",
"PerceiverModel",
"PerceiverPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig
from .tokenization_perceiver import PerceiverTokenizer
if is_vision_available():
from .feature_extraction_perceiver import PerceiverFeatureExtractor
if is_torch_available():
from .modeling_perceiver import (
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverLayer,
PerceiverModel,
PerceiverPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)

View File

@@ -0,0 +1,171 @@
# coding=utf-8
# Copyright Deepmind and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Perceiver model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"deepmind/language-perceiver": "https://huggingface.co/deepmind/language-perceiver/resolve/main/config.json",
# See all Perceiver models at https://huggingface.co/models?filter=perceiver
}
class PerceiverConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.PerceiverModel`. It is used
to instantiate an Perceiver model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiver
`deepmind/language-perceiver <https://huggingface.co/deepmind/language-perceiver>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args:
num_latents (:obj:`int`, `optional`, defaults to 256):
The number of latents.
d_latents (:obj:`int`, `optional`, defaults to 1280):
Dimension of the latent embeddings.
d_model (:obj:`int`, `optional`, defaults to 768):
Dimension of the inputs.
num_blocks (:obj:`int`, `optional`, defaults to 1):
Number of blocks in the Transformer encoder.
num_self_attends_per_block (:obj:`int`, `optional`, defaults to 26):
The number of self-attention layers per block.
num_self_attention_heads (:obj:`int`, `optional`, defaults to 8):
Number of attention heads for each self-attention layer in the Transformer encoder.
num_cross_attention_heads (:obj:`int`, `optional`, defaults to 8):
Number of attention heads for each cross-attention layer in the Transformer encoder.
qk_channels (:obj:`int`, `optional`):
Dimension to project the queries + keys before applying attention in the cross-attention and self-attention
layers of the encoder. Will default to preserving the dimension of the queries if not specified.
v_channels (:obj:`int`, `optional`):
Dimension to project the values before applying attention in the cross-attention and self-attention layers
of the encoder. Will default to preserving the dimension of the queries if not specified.
cross_attention_shape_for_attention (:obj:`str`, `optional`, defaults to :obj:`'kv'`):
Dimension to use when downsampling the queries and keys in the cross-attention layer of the encoder.
self_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
Dimension of the feed-forward layer in the cross-attention layer of the Transformer encoder.
cross_attention_widening_factor (:obj:`int`, `optional`, defaults to 1):
Dimension of the feed-forward layer in the self-attention layers of the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
The dropout ratio for the attention probabilities.
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
The epsilon used by the layer normalization layers.
use_query_residual (:obj:`float`, `optional`, defaults to :obj:`True`):
Whether to add a query residual in the cross-attention layer of the encoder.
vocab_size (:obj:`int`, `optional`, defaults to 262):
Vocabulary size for the masked language modeling model.
max_position_embeddings (:obj:`int`, `optional`, defaults to 2048):
The maximum sequence length that the masked language modeling model might ever be used with. Typically set
this to something large just in case (e.g., 512 or 1024 or 2048).
image_size (:obj:`int`, `optional`, defaults to 56):
Size of the images after preprocessing, for :class:`~transformers.PerceiverForImageClassificationLearned`.
train_size (:obj:`List[int]`, `optional`, defaults to [368, 496]):
Training size of the images for the optical flow model.
num_frames (:obj:`int`, `optional`, defaults to 16):
Number of video frames used for the multimodal autoencoding model.
audio_samples_per_frame (:obj:`int`, `optional`, defaults to 1920):
Number of audio samples per frame for the multimodal autoencoding model.
samples_per_patch (:obj:`int`, `optional`, defaults to 16):
Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
output_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[1, 16, 224, 224]`):
Shape of the output for the multimodal autoencoding model.
Example::
>>> from transformers import PerceiverModel, PerceiverConfig
>>> # Initializing a Perceiver deepmind/language-perceiver style configuration
>>> configuration = PerceiverConfig()
>>> # Initializing a model from the deepmind/language-perceiver style configuration
>>> model = PerceiverModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "perceiver"
def __init__(
self,
num_latents=256,
d_latents=1280,
d_model=768,
num_blocks=1,
num_self_attends_per_block=26,
num_self_attention_heads=8,
num_cross_attention_heads=8,
qk_channels=None,
v_channels=None,
cross_attention_shape_for_attention="kv",
self_attention_widening_factor=1,
cross_attention_widening_factor=1,
hidden_act="gelu",
attention_probs_dropout_prob=0.1,
position_embedding_init_scale=0.02,
initializer_range=0.02,
layer_norm_eps=1e-12,
is_encoder_decoder=False,
use_query_residual=True,
vocab_size=262,
max_position_embeddings=2048,
image_size=56,
train_size=[368, 496],
num_frames=16,
audio_samples_per_frame=1920,
samples_per_patch=16,
output_shape=[1, 16, 224, 224],
**kwargs
):
super().__init__(**kwargs)
self.num_latents = num_latents
self.d_latents = d_latents
self.d_model = d_model
self.num_blocks = num_blocks
self.num_self_attends_per_block = num_self_attends_per_block
self.num_self_attention_heads = num_self_attention_heads
self.num_cross_attention_heads = num_cross_attention_heads
self.qk_channels = qk_channels
self.v_channels = v_channels
self.cross_attention_shape_for_attention = cross_attention_shape_for_attention
self.self_attention_widening_factor = self_attention_widening_factor
self.cross_attention_widening_factor = cross_attention_widening_factor
self.hidden_act = hidden_act
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.use_query_residual = use_query_residual
# masked language modeling attributes
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
# image classification attributes
self.image_size = image_size
# flow attributes
self.train_size = train_size
# multimodal autoencoding attributes
self.num_frames = num_frames
self.audio_samples_per_frame = audio_samples_per_frame
self.samples_per_patch = samples_per_patch
self.output_shape = output_shape

View File

@@ -0,0 +1,468 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert Perceiver checkpoints originally implemented in Haiku."""
import argparse
import json
import pickle
from pathlib import Path
import numpy as np
import torch
from PIL import Image
import haiku as hk
import requests
from huggingface_hub import cached_download, hf_hub_url
from transformers import (
PerceiverConfig,
PerceiverFeatureExtractor,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverTokenizer,
)
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
def prepare_img():
# We will verify our results on an image of a dog
url = "https://storage.googleapis.com/perceiver_io/dalmation.jpg"
im = Image.open(requests.get(url, stream=True).raw)
return im
def rename_keys(state_dict, architecture):
for name in list(state_dict):
param = state_dict.pop(name)
# PREPROCESSORS
# rename text preprocessor embeddings (for MLM model)
name = name.replace("embed/embeddings", "input_preprocessor.embeddings.weight")
if name.startswith("trainable_position_encoding/pos_embs"):
name = name.replace(
"trainable_position_encoding/pos_embs", "input_preprocessor.position_embeddings.weight"
)
# rename image preprocessor embeddings (for image classification model with learned position embeddings)
name = name.replace("image_preprocessor/~/conv2_d/w", "input_preprocessor.convnet_1x1.weight")
name = name.replace("image_preprocessor/~/conv2_d/b", "input_preprocessor.convnet_1x1.bias")
name = name.replace(
"image_preprocessor/~_build_network_inputs/trainable_position_encoding/pos_embs",
"input_preprocessor.position_embeddings.position_embeddings",
)
name = name.replace(
"image_preprocessor/~_build_network_inputs/position_encoding_projector/linear/w",
"input_preprocessor.positions_projection.weight",
)
name = name.replace(
"image_preprocessor/~_build_network_inputs/position_encoding_projector/linear/b",
"input_preprocessor.positions_projection.bias",
)
# rename image preprocessor embeddings (for image classification model with conv processing)
if "counter" in name or "hidden" in name:
continue
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/conv/w", "input_preprocessor.convnet.conv.weight"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/offset", "input_preprocessor.convnet.batchnorm.bias"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/scale", "input_preprocessor.convnet.batchnorm.weight"
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/~/mean_ema/average",
"input_preprocessor.convnet.batchnorm.running_mean",
)
name = name.replace(
"image_preprocessor/~/conv2_d_downsample/~/batchnorm/~/var_ema/average",
"input_preprocessor.convnet.batchnorm.running_var",
)
# rename image preprocessor embeddings (for optical flow model)
name = name.replace("image_preprocessor/patches_linear/b", "input_preprocessor.conv_after_patches.bias")
name = name.replace("image_preprocessor/patches_linear/w", "input_preprocessor.conv_after_patches.weight")
# rename multimodal preprocessor embeddings
name = name.replace("multimodal_preprocessor/audio_mask_token/pos_embs", "input_preprocessor.mask.audio")
name = name.replace("multimodal_preprocessor/audio_padding/pos_embs", "input_preprocessor.padding.audio")
name = name.replace("multimodal_preprocessor/image_mask_token/pos_embs", "input_preprocessor.mask.image")
name = name.replace("multimodal_preprocessor/image_padding/pos_embs", "input_preprocessor.padding.image")
name = name.replace("multimodal_preprocessor/label_mask_token/pos_embs", "input_preprocessor.mask.label")
name = name.replace("multimodal_preprocessor/label_padding/pos_embs", "input_preprocessor.padding.label")
# DECODERS
# rename prefix of decoders
# multimodal autoencoding model
name = name.replace(
"multimodal_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("multimodal_decoder/~decoder_query/audio_padding/pos_embs", "decoder.padding.audio")
name = name.replace("multimodal_decoder/~decoder_query/image_padding/pos_embs", "decoder.padding.image")
name = name.replace("multimodal_decoder/~decoder_query/label_padding/pos_embs", "decoder.padding.label")
name = name.replace("multimodal_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
name = name.replace("multimodal_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
if architecture == "multimodal_autoencoding":
name = name.replace(
"classification_decoder/~/basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.modalities.label.decoder.output_position_encodings.position_embeddings",
)
# flow model
name = name.replace(
"flow_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("flow_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
name = name.replace("flow_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
# image models
name = name.replace(
"classification_decoder/~/basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.decoder.output_position_encodings.position_embeddings",
)
name = name.replace(
"basic_decoder/~/trainable_position_encoding/pos_embs",
"decoder.output_position_encodings.position_embeddings",
)
name = name.replace(
"classification_decoder/~/basic_decoder/cross_attention/", "decoder.decoder.decoding_cross_attention."
)
name = name.replace("classification_decoder/~/basic_decoder/output/b", "decoder.decoder.final_layer.bias")
name = name.replace("classification_decoder/~/basic_decoder/output/w", "decoder.decoder.final_layer.weight")
name = name = name.replace("classification_decoder/~/basic_decoder/~/", "decoder.decoder.")
name = name.replace("basic_decoder/cross_attention/", "decoder.decoding_cross_attention.")
name = name.replace("basic_decoder/~/", "decoder.")
# POSTPROCESSORS
name = name.replace(
"projection_postprocessor/linear/b", "output_postprocessor.modalities.image.classifier.bias"
)
name = name.replace(
"projection_postprocessor/linear/w", "output_postprocessor.modalities.image.classifier.weight"
)
name = name.replace(
"classification_postprocessor/linear/b", "output_postprocessor.modalities.label.classifier.bias"
)
name = name.replace(
"classification_postprocessor/linear/w", "output_postprocessor.modalities.label.classifier.weight"
)
name = name.replace("audio_postprocessor/linear/b", "output_postprocessor.modalities.audio.classifier.bias")
name = name.replace("audio_postprocessor/linear/w", "output_postprocessor.modalities.audio.classifier.weight")
# PERCEIVER MODEL
# rename latent embeddings
name = name.replace("perceiver_encoder/~/trainable_position_encoding/pos_embs", "embeddings.latents")
# rename latent embeddings (for multimodal model)
name = name.replace("encoder/~/trainable_position_encoding/pos_embs", "embeddings.latents")
# rename prefixes
if name.startswith("perceiver_encoder/~/"):
if "self_attention" in name:
suffix = "self_attends."
else:
suffix = ""
name = name.replace("perceiver_encoder/~/", "encoder." + suffix)
if name.startswith("encoder/~/"):
if "self_attention" in name:
suffix = "self_attends."
else:
suffix = ""
name = name.replace("encoder/~/", "encoder." + suffix)
# rename layernorm parameters
if "offset" in name:
name = name.replace("offset", "bias")
if "scale" in name:
name = name.replace("scale", "weight")
# in HuggingFace, the layernorm in between attention + MLP is just called "layernorm"
# rename layernorm in between attention + MLP of cross-attention
if "cross_attention" in name and "layer_norm_2" in name:
name = name.replace("layer_norm_2", "layernorm")
# rename layernorm in between attention + MLP of self-attention
if "self_attention" in name and "layer_norm_1" in name:
name = name.replace("layer_norm_1", "layernorm")
# in HuggingFace, the layernorms for queries + keys are called "layernorm1" and "layernorm2"
if "cross_attention" in name and "layer_norm_1" in name:
name = name.replace("layer_norm_1", "attention.self.layernorm2")
if "cross_attention" in name and "layer_norm" in name:
name = name.replace("layer_norm", "attention.self.layernorm1")
if "self_attention" in name and "layer_norm" in name:
name = name.replace("layer_norm", "attention.self.layernorm1")
# rename special characters by dots
name = name.replace("-", ".")
name = name.replace("/", ".")
# rename keys, queries, values and output of attention layers
if ("cross_attention" in name or "self_attention" in name) and "mlp" not in name:
if "linear.b" in name:
name = name.replace("linear.b", "self.query.bias")
if "linear.w" in name:
name = name.replace("linear.w", "self.query.weight")
if "linear_1.b" in name:
name = name.replace("linear_1.b", "self.key.bias")
if "linear_1.w" in name:
name = name.replace("linear_1.w", "self.key.weight")
if "linear_2.b" in name:
name = name.replace("linear_2.b", "self.value.bias")
if "linear_2.w" in name:
name = name.replace("linear_2.w", "self.value.weight")
if "linear_3.b" in name:
name = name.replace("linear_3.b", "output.dense.bias")
if "linear_3.w" in name:
name = name.replace("linear_3.w", "output.dense.weight")
if "self_attention_" in name:
name = name.replace("self_attention_", "")
if "self_attention" in name:
name = name.replace("self_attention", "0")
# rename dense layers of 2-layer MLP
if "mlp" in name:
if "linear.b" in name:
name = name.replace("linear.b", "dense1.bias")
if "linear.w" in name:
name = name.replace("linear.w", "dense1.weight")
if "linear_1.b" in name:
name = name.replace("linear_1.b", "dense2.bias")
if "linear_1.w" in name:
name = name.replace("linear_1.w", "dense2.weight")
# finally, TRANSPOSE if kernel and not embedding layer, and set value
if name[-6:] == "weight" and "embeddings" not in name:
param = np.transpose(param)
# if batchnorm, we need to squeeze it
if "batchnorm" in name:
param = np.squeeze(param)
if "embedding_decoder" not in name:
state_dict["perceiver." + name] = torch.from_numpy(param)
else:
state_dict[name] = torch.from_numpy(param)
@torch.no_grad()
def convert_perceiver_checkpoint(pickle_file, pytorch_dump_folder_path, architecture="MLM"):
"""
Copy/paste/tweak model's weights to our Perceiver structure.
"""
# load parameters as FlatMapping data structure
with open(pickle_file, "rb") as f:
checkpoint = pickle.loads(f.read())
state = None
if isinstance(checkpoint, dict) and architecture in [
"image_classification",
"image_classification_fourier",
"image_classification_conv",
]:
# the image classification_conv checkpoint also has batchnorm states (running_mean and running_var)
params = checkpoint["params"]
state = checkpoint["state"]
else:
params = checkpoint
# turn into initial state dict
state_dict = dict()
for scope_name, parameters in hk.data_structures.to_mutable_dict(params).items():
for param_name, param in parameters.items():
state_dict[scope_name + "/" + param_name] = param
if state is not None:
# add state variables
for scope_name, parameters in hk.data_structures.to_mutable_dict(state).items():
for param_name, param in parameters.items():
state_dict[scope_name + "/" + param_name] = param
# rename keys
rename_keys(state_dict, architecture=architecture)
# load HuggingFace model
config = PerceiverConfig()
subsampling = None
repo_id = "datasets/huggingface/label-files"
if architecture == "MLM":
config.qk_channels = 8 * 32
config.v_channels = 1280
model = PerceiverForMaskedLM(config)
elif "image_classification" in architecture:
config.num_latents = 512
config.d_latents = 1024
config.d_model = 512
config.num_blocks = 8
config.num_self_attends_per_block = 6
config.num_cross_attention_heads = 1
config.num_self_attention_heads = 8
config.qk_channels = None
config.v_channels = None
# set labels
config.num_labels = 1000
filename = "imagenet-1k-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
id2label = {int(k): v for k, v in id2label.items()}
config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()}
if architecture == "image_classification":
config.image_size = 224
model = PerceiverForImageClassificationLearned(config)
elif architecture == "image_classification_fourier":
config.d_model = 261
model = PerceiverForImageClassificationFourier(config)
elif architecture == "image_classification_conv":
config.d_model = 322
model = PerceiverForImageClassificationConvProcessing(config)
else:
raise ValueError(f"Architecture {architecture} not supported")
elif architecture == "optical_flow":
config.num_latents = 2048
config.d_latents = 512
config.d_model = 322
config.num_blocks = 1
config.num_self_attends_per_block = 24
config.num_self_attention_heads = 16
config.num_cross_attention_heads = 1
model = PerceiverForOpticalFlow(config)
elif architecture == "multimodal_autoencoding":
config.num_latents = 28 * 28 * 1
config.d_latents = 512
config.d_model = 704
config.num_blocks = 1
config.num_self_attends_per_block = 8
config.num_self_attention_heads = 8
config.num_cross_attention_heads = 1
config.num_labels = 700
# define dummy inputs + subsampling (as each forward pass is only on a chunk of image + audio data)
images = torch.randn((1, 16, 3, 224, 224))
audio = torch.randn((1, 30720, 1))
nchunks = 128
image_chunk_size = np.prod((16, 224, 224)) // nchunks
audio_chunk_size = audio.shape[1] // config.samples_per_patch // nchunks
# process the first chunk
chunk_idx = 0
subsampling = {
"image": torch.arange(image_chunk_size * chunk_idx, image_chunk_size * (chunk_idx + 1)),
"audio": torch.arange(audio_chunk_size * chunk_idx, audio_chunk_size * (chunk_idx + 1)),
"label": None,
}
model = PerceiverForMultimodalAutoencoding(config)
# set labels
filename = "kinetics700-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
id2label = {int(k): v for k, v in id2label.items()}
config.id2label = id2label
config.label2id = {v: k for k, v in id2label.items()}
else:
raise ValueError(f"Architecture {architecture} not supported")
model.eval()
# load weights
model.load_state_dict(state_dict)
# prepare dummy input
input_mask = None
if architecture == "MLM":
tokenizer = PerceiverTokenizer.from_pretrained("/Users/NielsRogge/Documents/Perceiver/Tokenizer files")
text = "This is an incomplete sentence where some words are missing."
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
# mask " missing.". Note that the model performs much better if the masked chunk starts with a space.
encoding.input_ids[0, 51:60] = tokenizer.mask_token_id
inputs = encoding.input_ids
input_mask = encoding.attention_mask
elif architecture in ["image_classification", "image_classification_fourier", "image_classification_conv"]:
feature_extractor = PerceiverFeatureExtractor()
image = prepare_img()
encoding = feature_extractor(image, return_tensors="pt")
inputs = encoding.pixel_values
elif architecture == "optical_flow":
inputs = torch.randn(1, 2, 27, 368, 496)
elif architecture == "multimodal_autoencoding":
images = torch.randn((1, 16, 3, 224, 224))
audio = torch.randn((1, 30720, 1))
inputs = dict(image=images, audio=audio, label=torch.zeros((images.shape[0], 700)))
# forward pass
if architecture == "multimodal_autoencoding":
outputs = model(inputs=inputs, attention_mask=input_mask, subsampled_output_points=subsampling)
else:
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
if not isinstance(logits, dict):
print("Shape of logits:", logits.shape)
else:
for k, v in logits.items():
print(f"Shape of logits of modality {k}", v.shape)
if architecture == "MLM":
expected_slice = torch.tensor(
[[-11.8336, -11.6850, -11.8483], [-12.8149, -12.5863, -12.7904], [-12.8440, -12.6410, -12.8646]]
)
assert torch.allclose(logits[0, :3, :3], expected_slice)
masked_tokens_predictions = logits[0, 51:60].argmax(dim=-1).tolist()
expected_list = [38, 115, 111, 121, 121, 111, 116, 109, 52]
assert masked_tokens_predictions == expected_list
print("Greedy predictions:")
print(masked_tokens_predictions)
print()
print("Predicted string:")
print(tokenizer.decode(masked_tokens_predictions))
elif architecture in ["image_classification", "image_classification_fourier", "image_classification_conv"]:
print("Predicted class:", model.config.id2label[logits.argmax(-1).item()])
# Finally, save files
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
print(f"Saving model to {pytorch_dump_folder_path}")
model.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument(
"--pickle_file",
type=str,
default=None,
required=True,
help="Path to local pickle file of a Perceiver checkpoint you'd like to convert.",
)
parser.add_argument(
"--pytorch_dump_folder_path",
default=None,
type=str,
required=True,
help="Path to the output PyTorch model directory, provided as a string.",
)
parser.add_argument(
"--architecture",
default="MLM",
type=str,
help="""
Architecture, provided as a string. One of 'MLM', 'image_classification', image_classification_fourier',
image_classification_fourier', 'optical_flow' or 'multimodal_autoencoding'.
""",
)
args = parser.parse_args()
convert_perceiver_checkpoint(args.pickle_file, args.pytorch_dump_folder_path, args.architecture)

View File

@@ -0,0 +1,189 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for Perceiver."""
from typing import Optional, Union
import numpy as np
from PIL import Image
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
from ...file_utils import TensorType
from ...image_utils import (
IMAGENET_DEFAULT_MEAN,
IMAGENET_DEFAULT_STD,
ImageFeatureExtractionMixin,
ImageInput,
is_torch_tensor,
)
from ...utils import logging
logger = logging.get_logger(__name__)
class PerceiverFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
r"""
Constructs a Perceiver feature extractor.
This feature extractor inherits from :class:`~transformers.ImageFeatureExtractionMixin` which contains most of the
main methods. Users should refer to this superclass for more information regarding those methods.
Args:
do_center_crop (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to crop the input at the center. If the input size is smaller than :obj:`crop_size` along any edge,
the image is padded with 0's and then center cropped.
crop_size (:obj:`int`, `optional`, defaults to 256):
Desired output size when applying center-cropping. Only has an effect if :obj:`do_center_crop` is set to
:obj:`True`.
do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether to resize the input to a certain :obj:`size`.
size (:obj:`int` or :obj:`Tuple(int)`, `optional`, defaults to 224):
Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an
integer is provided, then the input will be resized to (size, size). Only has an effect if :obj:`do_resize`
is set to :obj:`True`.
resample (:obj:`int`, `optional`, defaults to :obj:`PIL.Image.BICUBIC`):
An optional resampling filter. This can be one of :obj:`PIL.Image.NEAREST`, :obj:`PIL.Image.BOX`,
:obj:`PIL.Image.BILINEAR`, :obj:`PIL.Image.HAMMING`, :obj:`PIL.Image.BICUBIC` or :obj:`PIL.Image.LANCZOS`.
Only has an effect if :obj:`do_resize` is set to :obj:`True`.
do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to normalize the input with :obj:`image_mean` and :obj:`image_std`.
image_mean (:obj:`List[int]`, defaults to :obj:`[0.485, 0.456, 0.406]`):
The sequence of means for each channel, to be used when normalizing images.
image_std (:obj:`List[int]`, defaults to :obj:`[0.229, 0.224, 0.225]`):
The sequence of standard deviations for each channel, to be used when normalizing images.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_center_crop=True,
crop_size=256,
do_resize=True,
size=224,
resample=Image.BICUBIC,
do_normalize=True,
image_mean=None,
image_std=None,
**kwargs
):
super().__init__(**kwargs)
self.do_center_crop = do_center_crop
self.crop_size = crop_size
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
def center_crop(self, image):
"""
Crops :obj:`image` to `self.crop_size` using a center crop. Note that if the image is too small to be cropped
to the size given, it will be padded (so the returned result has the size asked).
Args:
image (:obj:`PIL.Image.Image` or :obj:`np.ndarray` or :obj:`torch.Tensor`):
The image to resize.
"""
if isinstance(image, Image.Image):
image = self.to_numpy_array(image)
image_height, image_width = image.shape[-2:]
padded_center_crop_size = (
(self.size / (self.crop_size)) * np.minimum(image_height, image_width).astype(np.float32)
).astype(np.int32)
offset_height = ((image_height - padded_center_crop_size) + 1) // 2
offset_width = ((image_width - padded_center_crop_size) + 1) // 2
crop_window = [offset_height, offset_width, padded_center_crop_size, padded_center_crop_size]
image = image[
:, crop_window[0] : crop_window[0] + crop_window[2], crop_window[1] : crop_window[1] + crop_window[3]
]
return image
def __call__(
self, images: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
) -> BatchFeature:
"""
Main method to prepare for the model one or several image(s).
.. warning::
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
PIL images.
Args:
images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width.
return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`, defaults to :obj:`'np'`):
If set, will return tensors of a particular framework. Acceptable values are:
* :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
* :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
* :obj:`'np'`: Return NumPy :obj:`np.ndarray` objects.
* :obj:`'jax'`: Return JAX :obj:`jnp.ndarray` objects.
Returns:
:class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
- **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
width).
"""
# Input type checking for clearer error
valid_images = False
# Check that images has a valid type
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
valid_images = True
elif isinstance(images, (list, tuple)):
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
valid_images = True
if not valid_images:
raise ValueError(
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example),"
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
)
is_batched = bool(
isinstance(images, (list, tuple))
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
)
if not is_batched:
images = [images]
# transformations (center cropping + resizing + normalization)
if self.do_center_crop and self.crop_size is not None:
images = [self.center_crop(image) for image in images]
if self.do_resize and self.size is not None and self.resample is not None:
images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
if self.do_normalize:
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
# return as BatchFeature
data = {"pixel_values": images}
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_inputs

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,204 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Tokenization class for Perceiver."""
from typing import Dict, List, Optional, Tuple
from ...tokenization_utils import AddedToken, PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
class PerceiverTokenizer(PreTrainedTokenizer):
"""
Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.
Args:
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"[BOS]"`):
The BOS token (reserved in the vocab, but not actually used).
eos_token (:obj:`str`, `optional`, defaults to :obj:`"[EOS]"`):
The end of sequence token (reserved in the vocab, but not actually used).
.. note::
When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the :obj:`sep_token`.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The MASK token, useful for masked language modeling.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The CLS token (reserved in the vocab, but not actually used).
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from two sequences.
"""
model_input_names = ["input_ids", "attention_mask"]
def __init__(
self,
pad_token="[PAD]",
bos_token="[BOS]",
eos_token="[EOS]",
mask_token="[MASK]",
cls_token="[CLS]",
sep_token="[SEP]",
model_max_length=2048,
**kwargs
) -> None:
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
mask_token = AddedToken(mask_token, lstrip=False, rstrip=False) if isinstance(mask_token, str) else mask_token
cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
super().__init__(
pad_token=pad_token,
bos_token=bos_token,
eos_token=eos_token,
mask_token=mask_token,
cls_token=cls_token,
sep_token=sep_token,
model_max_length=model_max_length,
**kwargs,
)
self._utf_vocab_size = 2 ** 8 # utf is 8 bits
# define special tokens dict
self.special_tokens_encoder: Dict[int, str] = {
self.pad_token: 0,
self.bos_token: 1,
self.eos_token: 2,
self.mask_token: 3,
self.cls_token: 4,
self.sep_token: 5,
}
self._num_special_tokens = len(self.special_tokens_encoder)
self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()}
@property
def vocab_size(self):
return self._utf_vocab_size + self._num_special_tokens
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.
Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""
if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
)
# normal case: some special tokens
if token_ids_1 is None:
return [1] + [0] * len(token_ids_0) + [1]
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks. A sequence has the
following format:
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
else:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1 + [self.sep_token_id]
def _tokenize(self, text: str) -> List[str]:
"""Take as input a string and return a list of strings (tokens) for words/sub-words"""
tokens = [chr(i) for i in text.encode("utf-8")]
return tokens
def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.special_tokens_encoder:
token_id = self.special_tokens_encoder[token]
elif token in self.added_tokens_encoder:
token_id = self.added_tokens_encoder[token]
elif len(token) != 1:
token_id = self.unk_token_id
else:
token_id = ord(token) + self._num_special_tokens
return token_id
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.special_tokens_decoder:
token = self.special_tokens_decoder[index]
elif index in self.added_tokens_decoder:
token = self.added_tokens_decoder[index]
else:
token = chr(index - self._num_special_tokens)
return token
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
bstring = b""
for token in tokens:
if token in self.special_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.added_tokens_decoder:
tok_string = self.special_tokens_decoder[token].encode("utf-8")
elif token in self.special_tokens_encoder:
tok_string = token.encode("utf-8")
elif token in self.added_tokens_encoder:
tok_string = token.encode("utf-8")
else:
tok_string = bytes([ord(token)])
bstring += tok_string
string = bstring.decode("utf-8", errors="replace")
return string
# PerceiverTokenizer has no vocab file
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
return ()

View File

@@ -3753,6 +3753,87 @@ class PegasusPreTrainedModel:
requires_backends(self, ["torch"])
PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST = None
class PerceiverForImageClassificationConvProcessing:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForImageClassificationFourier:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForImageClassificationLearned:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForMaskedLM:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForMultimodalAutoencoding:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForOpticalFlow:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverForSequenceClassification:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverLayer:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
class PerceiverPreTrainedModel:
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch"])
def forward(self, *args, **kwargs):
requires_backends(self, ["torch"])
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

@@ -64,6 +64,11 @@ class LayoutXLMProcessor:
requires_backends(cls, ["vision"])
class PerceiverFeatureExtractor:
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class SegformerFeatureExtractor:
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])

View File

@@ -0,0 +1,989 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch Perceiver model. """
import copy
import inspect
import math
import tempfile
import unittest
import warnings
from typing import Dict, List, Tuple
import numpy as np
from datasets import load_dataset
from transformers import PerceiverConfig
from transformers.file_utils import is_torch_available, is_vision_available
from transformers.models.auto import get_values
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from torch import nn
from transformers import (
MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
MODEL_FOR_MASKED_LM_MAPPING,
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
MODEL_MAPPING,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForImageClassificationLearned,
PerceiverForMaskedLM,
PerceiverForMultimodalAutoencoding,
PerceiverForOpticalFlow,
PerceiverForSequenceClassification,
PerceiverModel,
PerceiverTokenizer,
)
from transformers.models.perceiver.modeling_perceiver import PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
from PIL import Image
from transformers import PerceiverFeatureExtractor
class PerceiverModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
num_channels=3,
image_size=32,
train_size=[20, 20],
num_frames=5,
audio_samples_per_frame=200,
samples_per_patch=20,
nchunks=20,
num_latents=10,
d_latents=20,
num_blocks=1,
num_self_attends_per_block=2,
num_self_attention_heads=1,
num_cross_attention_heads=1,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_act="gelu",
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
max_position_embeddings=7,
num_labels=3,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.num_channels = num_channels
self.image_size = image_size
self.train_size = train_size
self.num_frames = num_frames
self.audio_samples_per_frame = audio_samples_per_frame
self.samples_per_patch = samples_per_patch
self.nchunks = nchunks
self.num_latents = num_latents
self.d_latents = d_latents
self.num_blocks = num_blocks
self.num_self_attends_per_block = num_self_attends_per_block
self.num_self_attention_heads = num_self_attention_heads
self.num_cross_attention_heads = num_cross_attention_heads
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_act = hidden_act
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.num_labels = num_labels
self.scope = scope
# set subsampling for multimodal model (take first chunk)
image_chunk_size = np.prod((self.num_frames, self.image_size, self.image_size)) // self.nchunks
audio_chunk_size = self.num_frames * self.audio_samples_per_frame // self.samples_per_patch // self.nchunks
self.subsampling = {
"image": torch.arange(0, image_chunk_size),
"audio": torch.arange(0, audio_chunk_size),
"label": None,
}
def prepare_config_and_inputs(self, model_class=None):
config = self.get_config()
input_mask = None
sequence_labels = None
token_labels = None
if self.use_labels:
sequence_labels = ids_tensor([self.batch_size], self.num_labels)
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
if model_class is None or model_class.__name__ == "PerceiverModel":
inputs = floats_tensor([self.batch_size, self.seq_length, config.d_model], self.vocab_size)
return config, inputs, input_mask, sequence_labels, token_labels
elif model_class.__name__ in ["PerceiverForMaskedLM", "PerceiverForSequenceClassification"]:
inputs = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
# input mask is only relevant for text inputs
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
elif model_class.__name__ == "PerceiverForImageClassificationLearned":
config.d_model = 512
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForImageClassificationFourier":
config.d_model = 261
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForImageClassificationConvProcessing":
config.d_model = 322
inputs = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
elif model_class.__name__ == "PerceiverForOpticalFlow":
config.d_model = 322
inputs = floats_tensor([self.batch_size, 2, 27, self.train_size[0], self.train_size[1]])
elif model_class.__name__ == "PerceiverForMultimodalAutoencoding":
config.d_model = 409
images = torch.randn(
(self.batch_size, self.num_frames, self.num_channels, self.image_size, self.image_size),
device=torch_device,
)
audio = torch.randn(
(self.batch_size, self.num_frames * self.audio_samples_per_frame, 1), device=torch_device
)
inputs = dict(
image=images, audio=audio, label=torch.zeros((self.batch_size, self.num_labels), device=torch_device)
)
else:
raise ValueError(f"Model class {model_class} not supported")
return config, inputs, input_mask, sequence_labels, token_labels
def get_config(self):
return PerceiverConfig(
num_latents=self.num_latents,
d_latents=self.d_latents,
num_blocks=self.num_blocks,
num_self_attends_per_block=self.num_self_attends_per_block,
num_self_attention_heads=self.num_self_attention_heads,
num_cross_attention_heads=self.num_cross_attention_heads,
vocab_size=self.vocab_size,
hidden_act=self.hidden_act,
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
initializer_range=self.initializer_range,
max_position_embeddings=self.max_position_embeddings,
image_size=self.image_size,
train_size=self.train_size,
num_frames=self.num_frames,
audio_samples_per_frame=self.audio_samples_per_frame,
samples_per_patch=self.samples_per_patch,
num_labels=self.num_labels,
)
def create_and_check_for_masked_lm(self, config, inputs, input_mask, sequence_labels, token_labels):
model = PerceiverForMaskedLM(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=token_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
def create_and_check_for_sequence_classification(self, config, inputs, input_mask, sequence_labels, token_labels):
# set num_labels
config.num_labels = self.num_labels
model = PerceiverForSequenceClassification(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_learned(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 512
config.num_labels = self.num_labels
model = PerceiverForImageClassificationLearned(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_fourier(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 261
config.num_labels = self.num_labels
model = PerceiverForImageClassificationFourier(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def create_and_check_for_image_classification_conv(
self, config, inputs, input_mask, sequence_labels, token_labels
):
# set d_model and num_labels
config.d_model = 322
config.num_labels = self.num_labels
model = PerceiverForImageClassificationConvProcessing(config=config)
model.to(torch_device)
model.eval()
result = model(inputs, attention_mask=input_mask, labels=sequence_labels)
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, inputs, input_mask, sequence_labels, token_labels = config_and_inputs
inputs_dict = {"inputs": inputs, "attention_mask": input_mask}
return config, inputs_dict
def prepare_config_and_inputs_for_model_class(self, model_class):
config_and_inputs = self.prepare_config_and_inputs(model_class)
config, inputs, input_mask, sequence_labels, token_labels = config_and_inputs
inputs_dict = {"inputs": inputs, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class PerceiverModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (
(
PerceiverModel,
PerceiverForMaskedLM,
PerceiverForImageClassificationLearned,
PerceiverForImageClassificationConvProcessing,
PerceiverForImageClassificationFourier,
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
PerceiverForSequenceClassification,
)
if is_torch_available()
else ()
)
test_pruning = False
test_head_masking = False
test_torchscript = False
maxDiff = None
def setUp(self):
self.model_tester = PerceiverModelTester(self)
self.config_tester = ConfigTester(self, config_class=PerceiverConfig, hidden_size=37)
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = copy.deepcopy(inputs_dict)
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
inputs_dict["subsampled_output_points"] = self.model_tester.subsampling
if return_labels:
if model_class in [
*get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
]:
inputs_dict["labels"] = torch.zeros(
self.model_tester.batch_size, dtype=torch.long, device=torch_device
)
elif model_class in [
*get_values(MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_MASKED_LM_MAPPING),
]:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
)
return inputs_dict
def test_config(self):
# we don't test common_properties and arguments_init as these don't apply for Perceiver
self.config_tester.create_and_test_config_to_json_string()
self.config_tester.create_and_test_config_to_json_file()
self.config_tester.create_and_test_config_from_and_save_pretrained()
self.config_tester.create_and_test_config_with_num_labels()
self.config_tester.check_config_can_be_init_without_params()
def test_for_masked_lm(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class=PerceiverForMaskedLM)
self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
def test_for_sequence_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(model_class=PerceiverForSequenceClassification)
self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
def test_for_image_classification_learned(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationLearned
)
self.model_tester.create_and_check_for_image_classification_learned(*config_and_inputs)
def test_for_image_classification_fourier(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationFourier
)
self.model_tester.create_and_check_for_image_classification_fourier(*config_and_inputs)
def test_for_image_classification_conv(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(
model_class=PerceiverForImageClassificationConvProcessing
)
self.model_tester.create_and_check_for_image_classification_conv(*config_and_inputs)
def test_model_common_attributes(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
# we overwrite this, as the embeddings of Perceiver are an instance of nn.Parameter
# and Perceiver doesn't support get_output_embeddings
self.assertIsInstance(model.get_input_embeddings(), (nn.Parameter))
def test_training(self):
if not self.model_tester.is_training:
return
for model_class in self.all_model_classes:
if model_class in [
*get_values(MODEL_MAPPING),
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
]:
continue
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.train()
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
loss = model(**inputs).loss
loss.backward()
def test_forward_signature(self):
for model_class in self.all_model_classes:
config, _ = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["inputs"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_determinism(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
inputs_dict = self._prepare_for_class(inputs_dict, model_class)
first = model(**inputs_dict)[0]
second = model(**inputs_dict)[0]
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
# model outputs a dictionary with logits per modality, let's verify each modality
for modality in first.keys():
out_1 = first[modality].cpu().numpy()
out_2 = second[modality].cpu().numpy()
out_1 = out_1[~np.isnan(out_1)]
out_2 = out_2[~np.isnan(out_2)]
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
else:
out_1 = first.cpu().numpy()
out_2 = second.cpu().numpy()
out_1 = out_1[~np.isnan(out_1)]
out_2 = out_2[~np.isnan(out_2)]
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
def test_attention_outputs(self):
seq_len = getattr(self.model_tester, "num_latents", None)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.return_dict = True
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self_attentions = outputs.attentions
cross_attentions = outputs.cross_attentions
# check expected number of attentions depending on model class
expected_num_self_attentions = self.model_tester.num_blocks * self.model_tester.num_self_attends_per_block
if model.__class__.__name__ == "PerceiverModel":
# we expect to have 2 cross-attentions, namely one in the PerceiverEncoder, and one in PerceiverBasicDecoder
expected_num_cross_attentions = 1
else:
# we expect to have 2 cross-attentions, namely one in the PerceiverEncoder, and one in PerceiverBasicDecoder
expected_num_cross_attentions = 2
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertEqual(len(cross_attentions), expected_num_cross_attentions)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self_attentions = outputs.attentions
cross_attentions = outputs.cross_attentions
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertEqual(len(cross_attentions), expected_num_cross_attentions)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_self_attention_heads, seq_len, seq_len],
)
out_len = len(outputs)
# Check attention is always last and order is fine
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
self.assertEqual(out_len + 1, len(outputs))
self_attentions = outputs.attentions
self.assertEqual(len(self_attentions), expected_num_self_attentions)
self.assertListEqual(
list(self_attentions[0].shape[-3:]),
[self.model_tester.num_self_attention_heads, seq_len, seq_len],
)
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_layers = self.model_tester.num_blocks * self.model_tester.num_self_attends_per_block + 1
self.assertEqual(len(hidden_states), expected_num_layers)
seq_length = self.model_tester.num_latents
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[seq_length, self.model_tester.d_latents],
)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
def test_model_outputs_equivalence(self):
def set_nan_tensor_to_zero(t):
t[t != t] = 0
return t
def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}):
with torch.no_grad():
tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs)
dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple()
def recursive_check(tuple_object, dict_object):
if isinstance(tuple_object, (List, Tuple)):
for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object):
recursive_check(tuple_iterable_value, dict_iterable_value)
elif isinstance(tuple_object, Dict):
for tuple_iterable_value, dict_iterable_value in zip(
tuple_object.values(), dict_object.values()
):
recursive_check(tuple_iterable_value, dict_iterable_value)
elif tuple_object is None:
return
else:
self.assertTrue(
torch.allclose(
set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
),
msg=f"Tuple and dict output are not equal. Difference: {torch.max(torch.abs(tuple_object - dict_object))}. "
f"Tuple has `nan`: {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. "
f"Dict has `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}.",
)
recursive_check(tuple_output, dict_output)
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs)
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs)
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
if model_class.__name__ not in ["PerceiverForOpticalFlow", "PerceiverForMultimodalAutoencoding"]:
# optical flow + multimodal models don't support training for now
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
check_equivalence(
model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True}
)
def test_retain_grad_hidden_states_attentions(self):
# no need to test all models as different heads yield the same functionality
model_class = PerceiverForMaskedLM
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
config.output_hidden_states = True
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
inputs = self._prepare_for_class(inputs_dict, model_class)
outputs = model(**inputs)
output = outputs[0]
# Encoder-only model
hidden_states = outputs.hidden_states[0]
attentions = outputs.attentions[0]
hidden_states.retain_grad()
attentions.retain_grad()
output.flatten()[0].backward(retain_graph=True)
self.assertIsNotNone(hidden_states.grad)
self.assertIsNotNone(attentions.grad)
def test_feed_forward_chunking(self):
for model_class in self.all_model_classes:
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
torch.manual_seed(0)
config = copy.deepcopy(original_config)
model = model_class(config)
model.to(torch_device)
model.eval()
hidden_states_no_chunk = model(**self._prepare_for_class(inputs_dict, model_class))[0]
torch.manual_seed(0)
config.chunk_size_feed_forward = 1
model = model_class(config)
model.to(torch_device)
model.eval()
hidden_states_with_chunk = model(**self._prepare_for_class(inputs_dict, model_class))[0]
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
# model outputs a dictionary with logits for each modality
for modality in hidden_states_no_chunk.keys():
self.assertTrue(
torch.allclose(hidden_states_no_chunk[modality], hidden_states_with_chunk[modality], atol=1e-3)
)
else:
self.assertTrue(torch.allclose(hidden_states_no_chunk, hidden_states_with_chunk, atol=1e-3))
def test_save_load(self):
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
if model_class.__name__ == "PerceiverForMultimodalAutoencoding":
for modality in outputs[0].keys():
out_2 = outputs[0][modality].cpu().numpy()
out_2[np.isnan(out_2)] = 0
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model = model_class.from_pretrained(tmpdirname)
model.to(torch_device)
with torch.no_grad():
after_outputs = model(**self._prepare_for_class(inputs_dict, model_class))
# Make sure we don't have nans
out_1 = after_outputs[0][modality].cpu().numpy()
out_1[np.isnan(out_1)] = 0
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
else:
out_2 = outputs[0].cpu().numpy()
out_2[np.isnan(out_2)] = 0
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model = model_class.from_pretrained(tmpdirname)
model.to(torch_device)
with torch.no_grad():
after_outputs = model(**self._prepare_for_class(inputs_dict, model_class))
# Make sure we don't have nans
out_1 = after_outputs[0].cpu().numpy()
out_1[np.isnan(out_1)] = 0
max_diff = np.amax(np.abs(out_1 - out_2))
self.assertLessEqual(max_diff, 1e-5)
def test_correct_missing_keys(self):
if not self.test_missing_keys:
return
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
# most Perceiver models don't have a typical head like is the case with BERT
if model_class in [
PerceiverForOpticalFlow,
PerceiverForMultimodalAutoencoding,
*get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING),
*get_values(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING),
]:
continue
model = model_class(config)
base_model_prefix = model.base_model_prefix
if hasattr(model, base_model_prefix):
with tempfile.TemporaryDirectory() as temp_dir_name:
model.base_model.save_pretrained(temp_dir_name)
model, loading_info = model_class.from_pretrained(temp_dir_name, output_loading_info=True)
with self.subTest(msg=f"Missing keys for {model.__class__.__name__}"):
self.assertGreater(len(loading_info["missing_keys"]), 0)
def test_problem_types(self):
problem_types = [
{"title": "multi_label_classification", "num_labels": 2, "dtype": torch.float},
{"title": "single_label_classification", "num_labels": 1, "dtype": torch.long},
{"title": "regression", "num_labels": 1, "dtype": torch.float},
]
for model_class in self.all_model_classes:
if model_class not in get_values(MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING):
continue
config, inputs, input_mask, _, _ = self.model_tester.prepare_config_and_inputs(model_class=model_class)
inputs_dict = dict(inputs=inputs, attention_mask=input_mask)
for problem_type in problem_types:
with self.subTest(msg=f"Testing {model_class} with {problem_type['title']}"):
config.problem_type = problem_type["title"]
config.num_labels = problem_type["num_labels"]
model = model_class(config)
model.to(torch_device)
model.train()
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
if problem_type["num_labels"] > 1:
inputs["labels"] = inputs["labels"].unsqueeze(1).repeat(1, problem_type["num_labels"])
inputs["labels"] = inputs["labels"].to(problem_type["dtype"])
# This tests that we do not trigger the warning form PyTorch "Using a target size that is different
# to the input size. This will likely lead to incorrect results due to broadcasting. Please ensure
# they have the same size." which is a symptom something in wrong for the regression problem.
# See https://github.com/huggingface/transformers/issues/11780
with warnings.catch_warnings(record=True) as warning_list:
loss = model(**inputs).loss
for w in warning_list:
if "Using a target size that is different to the input size" in str(w.message):
raise ValueError(
f"Something is going wrong in the regression problem: intercepted {w.message}"
)
loss.backward()
@unittest.skip(reason="Perceiver models don't have a typical head like is the case with BERT")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="Perceiver models don't have a typical head like is the case with BERT")
def test_save_load_fast_init_to_base(self):
pass
@unittest.skip(reason="Perceiver doesn't support resize_token_embeddings")
def test_resize_tokens_embeddings(self):
pass
@unittest.skip(reason="Perceiver doesn't support resize_token_embeddings")
def test_resize_embeddings_untied(self):
pass
@unittest.skip(reason="Perceiver doesn't support inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="Perceiver doesn't support the AutoModel API")
def test_load_with_mismatched_shapes(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in PERCEIVER_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = PerceiverModel.from_pretrained(model_name)
self.assertIsNotNone(model)
# We will verify our results on an image of cute cats
def prepare_img():
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
return image
# Helper functions for optical flow integration test
def prepare_optical_flow_images():
dataset = load_dataset("hf-internal-testing/fixtures_sintel", split="test")
image1 = Image.open(dataset[0]["file"]).convert("RGB")
image2 = Image.open(dataset[0]["file"]).convert("RGB")
return image1, image2
def normalize(img):
return img / 255.0 * 2 - 1
def extract_image_patches(x, kernel, stride=1, dilation=1):
# Do TF 'SAME' Padding
b, c, h, w = x.shape
h2 = math.ceil(h / stride)
w2 = math.ceil(w / stride)
pad_row = (h2 - 1) * stride + (kernel - 1) * dilation + 1 - h
pad_col = (w2 - 1) * stride + (kernel - 1) * dilation + 1 - w
x = torch.nn.functional.pad(x, (pad_row // 2, pad_row - pad_row // 2, pad_col // 2, pad_col - pad_col // 2))
# Extract patches
patches = x.unfold(2, kernel, stride).unfold(3, kernel, stride)
patches = patches.permute(0, 4, 5, 1, 2, 3).contiguous()
return patches.view(b, -1, patches.shape[-2], patches.shape[-1])
@require_torch
@require_vision
class PerceiverModelIntegrationTest(unittest.TestCase):
@slow
def test_inference_masked_lm(self):
tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")
model.to(torch_device)
# prepare inputs
text = "This is an incomplete sentence where some words are missing."
encoding = tokenizer(text, padding="max_length", return_tensors="pt")
# mask " missing.".
encoding.input_ids[0, 52:61] = tokenizer.mask_token_id
inputs, input_mask = encoding.input_ids.to(torch_device), encoding.attention_mask.to(torch_device)
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, tokenizer.model_max_length, tokenizer.vocab_size))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[[-10.8609, -10.7651, -10.9187], [-12.1689, -11.9389, -12.1479], [-12.1518, -11.9707, -12.2073]]
)
self.assertTrue(torch.allclose(logits[0, :3, :3], expected_slice, atol=1e-4))
expected_greedy_predictions = [38, 115, 111, 121, 121, 111, 116, 109, 52]
masked_tokens_predictions = logits[0, 52:61].argmax(dim=-1).tolist()
self.assertListEqual(expected_greedy_predictions, masked_tokens_predictions)
@slow
def test_inference_image_classification(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1653, -0.1993, -0.7521], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_image_classification_fourier(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1295, -0.2832, 0.3226], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_image_classification_conv(self):
feature_extractor = PerceiverFeatureExtractor()
model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")
model.to(torch_device)
# prepare inputs
image = prepare_img()
inputs = feature_extractor(image, return_tensors="pt").pixel_values.to(torch_device)
input_mask = None
# forward pass
with torch.no_grad():
outputs = model(inputs=inputs, attention_mask=input_mask)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, model.config.num_labels))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor([-1.1186, 0.0554, 0.0897], device=torch_device)
self.assertTrue(torch.allclose(logits[0, :3], expected_slice, atol=1e-4))
@slow
def test_inference_optical_flow(self):
model = PerceiverForOpticalFlow.from_pretrained("deepmind/optical-flow-perceiver")
model.to(torch_device)
# prepare inputs
image1, image2 = prepare_optical_flow_images()
img1 = normalize(np.array(image1))
img2 = normalize(np.array(image1))
# stack images
img1 = torch.tensor(np.moveaxis(img1, -1, 0))
img2 = torch.tensor(np.moveaxis(img2, -1, 0))
images = torch.stack([img1, img2], dim=0)
# extract 3x3 patches
patch_size = model.config.train_size
inputs = images[..., : patch_size[0], : patch_size[1]].unsqueeze(0)
batch_size, _, C, H, W = inputs.shape
patches = extract_image_patches(inputs.view(batch_size * 2, C, H, W), kernel=3)
_, C, H, W = patches.shape
patches = patches.view(batch_size, -1, C, H, W).float()
# forward pass
with torch.no_grad():
outputs = model(inputs=patches)
logits = outputs.logits
# verify logits
expected_shape = torch.Size((1, 368, 496, 2))
self.assertEqual(logits.shape, expected_shape)
expected_slice = torch.tensor(
[
[[0.0025, -0.0050], [0.0025, -0.0049], [0.0025, -0.0048]],
[[0.0026, -0.0049], [0.0026, -0.0048], [0.0026, -0.0047]],
[[0.0026, -0.0049], [0.0026, -0.0048], [0.0026, -0.0046]],
],
device=torch_device,
)
self.assertTrue(torch.allclose(logits[0, :3, :3, :3], expected_slice, atol=1e-4))

View File

@@ -0,0 +1,288 @@
# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import re
import shutil
import tempfile
import unittest
from typing import Tuple
from transformers import AddedToken, BatchEncoding, PerceiverTokenizer
from transformers.file_utils import cached_property, is_tf_available, is_torch_available
from .test_tokenization_common import TokenizerTesterMixin
if is_torch_available():
FRAMEWORK = "pt"
elif is_tf_available():
FRAMEWORK = "tf"
else:
FRAMEWORK = "jax"
class PerceiverTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = PerceiverTokenizer
test_rust_tokenizer = False
def setUp(self):
super().setUp()
tokenizer = PerceiverTokenizer()
tokenizer.save_pretrained(self.tmpdirname)
@cached_property
def perceiver_tokenizer(self):
return PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")
def get_tokenizer(self, **kwargs) -> PerceiverTokenizer:
return self.tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5) -> Tuple[str, list]:
# XXX The default common tokenizer tests assume that every ID is decodable on its own.
# This assumption is invalid for Perceiver because single bytes might not be
# valid utf-8 (byte 128 for instance).
# Here we're overriding the smallest possible method to provide
# a clean sequence without making the same assumption.
toks = []
for i in range(len(tokenizer)):
try:
tok = tokenizer.decode([i], clean_up_tokenization_spaces=False)
except UnicodeDecodeError:
pass
toks.append((i, tok))
toks = list(filter(lambda t: re.match(r"^[ a-zA-Z]+$", t[1]), toks))
toks = list(filter(lambda t: [t[0]] == tokenizer.encode(t[1], add_special_tokens=False), toks))
if max_length is not None and len(toks) > max_length:
toks = toks[:max_length]
if min_length is not None and len(toks) < min_length and len(toks) > 0:
while len(toks) < min_length:
toks = toks + toks
# toks_str = [t[1] for t in toks]
toks_ids = [t[0] for t in toks]
# Ensure consistency
output_txt = tokenizer.decode(toks_ids, clean_up_tokenization_spaces=False)
if " " not in output_txt and len(toks_ids) > 1:
output_txt = (
tokenizer.decode([toks_ids[0]], clean_up_tokenization_spaces=False)
+ " "
+ tokenizer.decode(toks_ids[1:], clean_up_tokenization_spaces=False)
)
if with_prefix_space:
output_txt = " " + output_txt
output_ids = tokenizer.encode(output_txt, add_special_tokens=False)
return output_txt, output_ids
def test_multibytes_char(self):
tokenizer = self.perceiver_tokenizer
src_text = "Unicode €."
encoded = tokenizer(src_text)
encoded_ids = [4, 91, 116, 111, 105, 117, 106, 107, 38, 232, 136, 178, 52, 5]
self.assertEqual(encoded["input_ids"], encoded_ids)
# decoding
decoded = tokenizer.decode(encoded_ids)
self.assertEqual(decoded, "<cls>Unicode €.<sep>")
encoded = tokenizer("e è é ê ë")
encoded_ids = [4, 107, 38, 201, 174, 38, 201, 175, 38, 201, 176, 38, 201, 177, 5]
self.assertEqual(encoded["input_ids"], encoded_ids)
# decoding
decoded = tokenizer.decode(encoded_ids)
self.assertEqual(decoded, "<cls>e è é ê ë<sep>")
# encode/decode, but with `encode` instead of `__call__`
self.assertEqual(tokenizer.decode(tokenizer.encode("e è é ê ë")), "<cls>e è é ê ë<sep>")
def test_prepare_batch_integration(self):
tokenizer = self.perceiver_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
# fmt: off
expected_src_tokens = [4, 71, 38, 114, 117, 116, 109, 38, 118, 103, 120, 103, 109, 120, 103, 118, 110, 38, 108, 117, 120, 38, 121, 123, 115, 115, 103, 120, 111, 128, 103, 122, 111, 117, 116, 52, 5, 0]
# fmt: on
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
self.assertIsInstance(batch, BatchEncoding)
if FRAMEWORK != "jax":
result = list(batch.input_ids.numpy()[0])
else:
result = list(batch.input_ids.tolist()[0])
self.assertListEqual(expected_src_tokens, result)
self.assertEqual((2, 38), batch.input_ids.shape)
self.assertEqual((2, 38), batch.attention_mask.shape)
def test_empty_target_text(self):
tokenizer = self.perceiver_tokenizer
src_text = ["A long paragraph for summarization.", "Another paragraph for summarization."]
batch = tokenizer(src_text, padding=True, return_tensors=FRAMEWORK)
# check if input_ids are returned and no decoder_input_ids
self.assertIn("input_ids", batch)
self.assertIn("attention_mask", batch)
self.assertNotIn("decoder_input_ids", batch)
self.assertNotIn("decoder_attention_mask", batch)
def test_max_length_integration(self):
tokenizer = self.perceiver_tokenizer
tgt_text = [
"Summary of the text.",
"Another summary.",
]
with tokenizer.as_target_tokenizer():
targets = tokenizer(
tgt_text, max_length=32, padding="max_length", truncation=True, return_tensors=FRAMEWORK
)
self.assertEqual(32, targets["input_ids"].shape[1])
# cannot use default save_and_load_tokenzier test method because tokenzier has no vocab
def test_save_and_load_tokenizer(self):
# safety check on max_len default value so we are sure the test works
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
self.assertNotEqual(tokenizer.model_max_length, 42)
# Now let's start the test
tokenizers = self.get_tokenizers()
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
shutil.rmtree(tmpdirname)
tokenizers = self.get_tokenizers(model_max_length=42)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
# Isolate this from the other tests because we save additional tokens/etc
tmpdirname = tempfile.mkdtemp()
sample_text = " He is very happy, UNwant\u00E9d,running"
tokenizer.add_tokens(["bim", "bambam"])
additional_special_tokens = tokenizer.additional_special_tokens
additional_special_tokens.append("new_additional_special_token")
tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
tokenizer.save_pretrained(tmpdirname)
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
self.assertListEqual(before_tokens, after_tokens)
self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
self.assertEqual(after_tokenizer.model_max_length, 42)
tokenizer = tokenizer.__class__.from_pretrained(tmpdirname, model_max_length=43)
self.assertEqual(tokenizer.model_max_length, 43)
shutil.rmtree(tmpdirname)
# There is a conflict between the default value of extra_ids and adding a new special token through additional_special_tokens
# We need to add the extra_ids in the list of the arg additional_special_tokens
def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
tokenizer_list = []
if self.test_slow_tokenizer:
tokenizer_list.append((self.tokenizer_class, self.get_tokenizer()))
if self.test_rust_tokenizer:
tokenizer_list.append((self.rust_tokenizer_class, self.get_rust_tokenizer()))
for tokenizer_class, tokenizer_utils in tokenizer_list:
with tempfile.TemporaryDirectory() as tmp_dir:
tokenizer_utils.save_pretrained(tmp_dir)
with open(os.path.join(tmp_dir, "special_tokens_map.json"), encoding="utf-8") as json_file:
special_tokens_map = json.load(json_file)
with open(os.path.join(tmp_dir, "tokenizer_config.json"), encoding="utf-8") as json_file:
tokenizer_config = json.load(json_file)
added_tokens_extra_ids = [f"<extra_id_{i}>" for i in range(125)]
special_tokens_map["additional_special_tokens"] = added_tokens_extra_ids + [
"an_additional_special_token"
]
tokenizer_config["additional_special_tokens"] = added_tokens_extra_ids + [
"an_additional_special_token"
]
with open(os.path.join(tmp_dir, "special_tokens_map.json"), "w", encoding="utf-8") as outfile:
json.dump(special_tokens_map, outfile)
with open(os.path.join(tmp_dir, "tokenizer_config.json"), "w", encoding="utf-8") as outfile:
json.dump(tokenizer_config, outfile)
# the following checks allow us to verify that our test works as expected, i.e. that the tokenizer takes
# into account the new value of additional_special_tokens given in the "tokenizer_config.json" and
# "special_tokens_map.json" files
tokenizer_without_change_in_init = tokenizer_class.from_pretrained(
tmp_dir,
)
self.assertIn(
"an_additional_special_token", tokenizer_without_change_in_init.additional_special_tokens
)
self.assertEqual(
["an_additional_special_token"],
tokenizer_without_change_in_init.convert_ids_to_tokens(
tokenizer_without_change_in_init.convert_tokens_to_ids(["an_additional_special_token"])
),
)
# Now we test that we can change the value of additional_special_tokens in the from_pretrained
new_added_tokens = added_tokens_extra_ids + [AddedToken("a_new_additional_special_token", lstrip=True)]
tokenizer = tokenizer_class.from_pretrained(
tmp_dir,
additional_special_tokens=new_added_tokens,
)
self.assertIn("a_new_additional_special_token", tokenizer.additional_special_tokens)
self.assertEqual(
["a_new_additional_special_token"],
tokenizer.convert_ids_to_tokens(
tokenizer.convert_tokens_to_ids(["a_new_additional_special_token"])
),
)
def test_decode_invalid_byte_id(self):
tokenizer = self.perceiver_tokenizer
self.assertEqual(tokenizer.decode([178]), "<EFBFBD>")
# tokenizer can be instantiated without any pretrained files, so no need for pretrained tokenizer list
def test_pretrained_model_lists(self):
pass
# tokenizer does not have vocabulary
def test_get_vocab(self):
pass
# inputs cannot be pretokenized since ids depend on whole input string and not just on single characters
def test_pretokenized_inputs(self):
pass
# tests all ids in vocab => vocab doesn't exist so unnecessary to test
def test_conversion_reversible(self):
pass

View File

@@ -102,6 +102,8 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
# should **not** be the rule.
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
# models to ignore for model xxx mapping
"PerceiverForMultimodalAutoencoding",
"PerceiverForOpticalFlow",
"SegformerDecodeHead",
"SegformerForSemanticSegmentation",
"BeitForSemanticSegmentation",