Convert model files from rst to mdx (#14865)

* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-12-22 03:27:30 -05:00
parent d0422de563
commit ec3567fe20
94 changed files with 5373 additions and 6563 deletions
--- a/docs/source/model_doc/albert.mdx
+++ b/docs/source/model_doc/albert.mdx
@@ -0,0 +1,170 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ALBERT
 ## Overview
 The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
 Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
 speed of BERT:
 - Splitting the embedding matrix into two smaller matrices.
 - Using repeating layers split among groups.
 The abstract from the paper is the following:
 *Increasing model size when pretraining natural language representations often results in improved performance on
 downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
 longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
 techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
 that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
 self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
 with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
 SQuAD benchmarks while having fewer parameters compared to BERT-large.*
 Tips:
 - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
  than the left.
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.
 This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
 [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
 ## AlbertConfig
 [[autodoc]] AlbertConfig
 ## AlbertTokenizer
 [[autodoc]] AlbertTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## AlbertTokenizerFast
 [[autodoc]] AlbertTokenizerFast
 ## Albert specific outputs
 [[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput
 [[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
 ## AlbertModel
 [[autodoc]] AlbertModel
    - forward
 ## AlbertForPreTraining
 [[autodoc]] AlbertForPreTraining
    - forward
 ## AlbertForMaskedLM
 [[autodoc]] AlbertForMaskedLM
    - forward
 ## AlbertForSequenceClassification
 [[autodoc]] AlbertForSequenceClassification
    - forward
 ## AlbertForMultipleChoice
 [[autodoc]] AlbertForMultipleChoice
 ## AlbertForTokenClassification
 [[autodoc]] AlbertForTokenClassification
    - forward
 ## AlbertForQuestionAnswering
 [[autodoc]] AlbertForQuestionAnswering
    - forward
 ## TFAlbertModel
 [[autodoc]] TFAlbertModel
    - call
 ## TFAlbertForPreTraining
 [[autodoc]] TFAlbertForPreTraining
    - call
 ## TFAlbertForMaskedLM
 [[autodoc]] TFAlbertForMaskedLM
    - call
 ## TFAlbertForSequenceClassification
 [[autodoc]] TFAlbertForSequenceClassification
    - call
 ## TFAlbertForMultipleChoice
 [[autodoc]] TFAlbertForMultipleChoice
    - call
 ## TFAlbertForTokenClassification
 [[autodoc]] TFAlbertForTokenClassification
    - call
 ## TFAlbertForQuestionAnswering
 [[autodoc]] TFAlbertForQuestionAnswering
    - call
 ## FlaxAlbertModel
 [[autodoc]] FlaxAlbertModel
    - __call__
 ## FlaxAlbertForPreTraining
 [[autodoc]] FlaxAlbertForPreTraining
    - __call__
 ## FlaxAlbertForMaskedLM
 [[autodoc]] FlaxAlbertForMaskedLM
    - __call__
 ## FlaxAlbertForSequenceClassification
 [[autodoc]] FlaxAlbertForSequenceClassification
    - __call__
 ## FlaxAlbertForMultipleChoice
 [[autodoc]] FlaxAlbertForMultipleChoice
    - __call__
 ## FlaxAlbertForTokenClassification
 [[autodoc]] FlaxAlbertForTokenClassification
    - __call__
 ## FlaxAlbertForQuestionAnswering
 [[autodoc]] FlaxAlbertForQuestionAnswering
    - __call__
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -1,226 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 ALBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
 <https://arxiv.org/abs/1909.11942>`__ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
 Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
 speed of BERT:
 - Splitting the embedding matrix into two smaller matrices.
 - Using repeating layers split among groups.
 The abstract from the paper is the following:
 *Increasing model size when pretraining natural language representations often results in improved performance on
 downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
 longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
 techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
 that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
 self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
 with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
 SQuAD benchmarks while having fewer parameters compared to BERT-large.*
 Tips:
 - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
  than the left.
 - ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
  number of (repeating) layers.
 This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. This model jax version was contributed by
 `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
 <https://github.com/google-research/ALBERT>`__.
 AlbertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertConfig
    :members:
 AlbertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 AlbertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertTokenizerFast
    :members:
 Albert specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.albert.modeling_albert.AlbertForPreTrainingOutput
    :members:
 .. autoclass:: transformers.models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
    :members:
 AlbertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertModel
    :members: forward
 AlbertForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForPreTraining
    :members: forward
 AlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForMaskedLM
    :members: forward
 AlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForSequenceClassification
    :members: forward
 AlbertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForMultipleChoice
    :members:
 AlbertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForTokenClassification
    :members: forward
 AlbertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.AlbertForQuestionAnswering
    :members: forward
 TFAlbertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertModel
    :members: call
 TFAlbertForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForPreTraining
    :members: call
 TFAlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForMaskedLM
    :members: call
 TFAlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForSequenceClassification
    :members: call
 TFAlbertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForMultipleChoice
    :members: call
 TFAlbertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForTokenClassification
    :members: call
 TFAlbertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFAlbertForQuestionAnswering
    :members: call
 FlaxAlbertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertModel
    :members: __call__
 FlaxAlbertForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForPreTraining
    :members: __call__
 FlaxAlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForMaskedLM
    :members: __call__
 FlaxAlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForSequenceClassification
    :members: __call__
 FlaxAlbertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForMultipleChoice
    :members: __call__
 FlaxAlbertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForTokenClassification
    :members: __call__
 FlaxAlbertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxAlbertForQuestionAnswering
    :members: __call__
--- a/docs/source/model_doc/bart.mdx
+++ b/docs/source/model_doc/bart.mdx
@@ -0,0 +1,151 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BART
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten
 ## Overview
 The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
 Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
 Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
 According to the abstract,
 - Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
  left-to-right decoder (like GPT).
 - The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
  where spans of text are replaced with a single mask token.
 - BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
  of up to 6 ROUGE.
 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).
 ### Examples
 - Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
 - An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets`
  object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904).
 - [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002).
 ## Implementation Notes
 - Bart doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or
  [`~BartTokenizer.encode`] to get the proper splitting.
 - The forward pass of [`BartModel`] will create the `decoder_input_ids` if they are not passed.
  This is different than some other modeling APIs. A typical use case of this feature is mask filling.
 - Model predictions are intended to be identical to the original implementation when
  `force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
  [`fairseq.encode`] starts with a space.
 - [`~generation_utils.GenerationMixin.generate`] should be used for conditional generation tasks like
  summarization, see the example in that docstrings.
 - Models that load the *facebook/bart-large-cnn* weights will not have a `mask_token_id`, or be able to perform
  mask-filling tasks.
 ## Mask Filling
 The `facebook/bart-base` and `facebook/bart-large` checkpoints can be used to fill multi-token masks.
 ```python
 from transformers import BartForConditionalGeneration, BartTokenizer
 model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
 tok = BartTokenizer.from_pretrained("facebook/bart-large")
 example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
 batch = tok(example_english_phrase, return_tensors='pt')
 generated_ids = model.generate(batch['input_ids'])
 assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
 ```
 ## BartConfig
 [[autodoc]] BartConfig
    - all
 ## BartTokenizer
 [[autodoc]] BartTokenizer
    - all
 ## BartTokenizerFast
 [[autodoc]] BartTokenizerFast
    - all
 ## BartModel
 [[autodoc]] BartModel
    - forward
 ## BartForConditionalGeneration
 [[autodoc]] BartForConditionalGeneration
    - forward
 ## BartForSequenceClassification
 [[autodoc]] BartForSequenceClassification
    - forward
 ## BartForQuestionAnswering
 [[autodoc]] BartForQuestionAnswering
    - forward
 ## BartForCausalLM
 [[autodoc]] BartForCausalLM
    - forward
 ## TFBartModel
 [[autodoc]] TFBartModel
    - call
 ## TFBartForConditionalGeneration
 [[autodoc]] TFBartForConditionalGeneration
    - call
 ## FlaxBartModel
 [[autodoc]] FlaxBartModel
    - __call__
    - encode
    - decode
 ## FlaxBartForConditionalGeneration
 [[autodoc]] FlaxBartForConditionalGeneration
    - __call__
    - encode
    - decode
 ## FlaxBartForSequenceClassification
 [[autodoc]] FlaxBartForSequenceClassification
    - __call__
    - encode
    - decode
 ## FlaxBartForQuestionAnswering
 [[autodoc]] FlaxBartForQuestionAnswering
    - __call__
    - encode
    - decode
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -1,182 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BART
 -----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@patrickvonplaten
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The Bart model was proposed in `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
 Translation, and Comprehension <https://arxiv.org/abs/1910.13461>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
 Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
 According to the abstract,
 - Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
  left-to-right decoder (like GPT).
 - The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
  where spans of text are replaced with a single mask token.
 - BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
  of up to 6 ROUGE.
 This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The Authors' code can be found `here
 <https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
 Examples
 _______________________________________________________________________________________________________________________
 - Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
 - An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
  object can be found in this `forum discussion
  <https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
 - `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
  <https://arxiv.org/abs/2010.13002>`__.
 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
  :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
 - The forward pass of :class:`~transformers.BartModel` will create the ``decoder_input_ids`` if they are not passed.
  This is different than some other modeling APIs. A typical use case of this feature is mask filling.
 - Model predictions are intended to be identical to the original implementation when
  :obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
  :func:`fairseq.encode` starts with a space.
 - :meth:`~transformers.generation_utils.GenerationMixin.generate` should be used for conditional generation tasks like
  summarization, see the example in that docstrings.
 - Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
  mask-filling tasks.
 Mask Filling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
 .. code-block::
    from transformers import BartForConditionalGeneration, BartTokenizer
    model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
    tok = BartTokenizer.from_pretrained("facebook/bart-large")
    example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
    batch = tok(example_english_phrase, return_tensors='pt')
    generated_ids = model.generate(batch['input_ids'])
    assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
 BartConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartConfig
    :members:
 BartTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartTokenizer
    :members:
 BartTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartTokenizerFast
    :members:
 BartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartModel
    :members: forward
 BartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartForConditionalGeneration
    :members: forward
 BartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartForSequenceClassification
    :members: forward
 BartForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartForQuestionAnswering
    :members: forward
 BartForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartForCausalLM
    :members: forward
 TFBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBartModel
    :members: call
 TFBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBartForConditionalGeneration
    :members: call
 FlaxBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBartModel
    :members: __call__, encode, decode
 FlaxBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBartForConditionalGeneration
    :members: __call__, encode, decode
 FlaxBartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBartForSequenceClassification
    :members: __call__, encode, decode
 FlaxBartForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBartForQuestionAnswering
    :members: __call__, encode, decode
--- a/docs/source/model_doc/barthez.mdx
+++ b/docs/source/model_doc/barthez.mdx
@@ -0,0 +1,50 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BARThez
 ## Overview
 The BARThez model was proposed in [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
 2020.
 The abstract of the paper:
 *Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing
 (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language
 understanding tasks. While there are some notable exceptions, most of the available models and research have been
 conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language
 (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research
 that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as
 CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also
 its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel
 summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already
 pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
 provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
 This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez).
 ### Examples
 - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
  [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
 ## BarthezTokenizer
 [[autodoc]] BarthezTokenizer
 ## BarthezTokenizerFast
 [[autodoc]] BarthezTokenizerFast
--- a/docs/source/model_doc/barthez.rst
+++ b/docs/source/model_doc/barthez.rst
@@ -1,60 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BARThez
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BARThez model was proposed in `BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
 <https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
 2020.
 The abstract of the paper:
 *Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing
 (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language
 understanding tasks. While there are some notable exceptions, most of the available models and research have been
 conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language
 (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research
 that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as
 CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also
 its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel
 summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already
 pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
 provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
 This model was contributed by `moussakam <https://huggingface.co/moussakam>`__. The Authors' code can be found `here
 <https://github.com/moussaKam/BARThez>`__.
 Examples
 _______________________________________________________________________________________________________________________
 - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
  :prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
 BarthezTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BarthezTokenizer
    :members:
 BarthezTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BarthezTokenizerFast
    :members:
--- a/docs/source/model_doc/bartpho.mdx
+++ b/docs/source/model_doc/bartpho.mdx
@@ -0,0 +1,80 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BARTpho
 ## Overview
 The BARTpho model was proposed in [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
 The abstract from the paper is the following:
 *We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual
 sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training
 scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments
 on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho
 outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future
 research and applications of generative Vietnamese NLP tasks.*
 Example of use:
 ```python
 >>> import torch
 >>> from transformers import AutoModel, AutoTokenizer
 >>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
 >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
 >>> line = "Chúng tôi là những nghiên cứu viên."
 >>> input_ids = tokenizer(line, return_tensors="pt")
 >>> with torch.no_grad():
 ...     features = bartpho(**input_ids)  # Models outputs are now tuples
 >>> # With TensorFlow 2.0+:
 >>> from transformers import TFAutoModel
 >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
 >>> input_ids = tokenizer(line, return_tensors="tf")
 >>> features = bartpho(**input_ids)
 ```
 Tips:
 - Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
  both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use
  with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.
  For example:
 ```python
 >>> from transformers import MBartForConditionalGeneration
 >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
 >>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
 >>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
 >>> logits = bartpho(input_ids).logits
 >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
 >>> probs = logits[0, masked_index].softmax(dim=0)
 >>> values, predictions = probs.topk(5)
 >>> print(tokenizer.decode(predictions).split())
 ```
 - This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
  extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
  Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
  segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
 This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
 ## BartphoTokenizer
 [[autodoc]] BartphoTokenizer
--- a/docs/source/model_doc/bartpho.rst
+++ b/docs/source/model_doc/bartpho.rst
@@ -1,86 +0,0 @@
 ..
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BARTpho
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BARTpho model was proposed in `BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
 <https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
 The abstract from the paper is the following:
 *We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual
 sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training
 scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments
 on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho
 outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future
 research and applications of generative Vietnamese NLP tasks.*
 Example of use:
 .. code-block::
    >>> import torch
    >>> from transformers import AutoModel, AutoTokenizer
    >>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
    >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
    >>> line = "Chúng tôi là những nghiên cứu viên."
    >>> input_ids = tokenizer(line, return_tensors="pt")
    >>> with torch.no_grad():
    ...     features = bartpho(**input_ids)  # Models outputs are now tuples
    >>> # With TensorFlow 2.0+:
    >>> from transformers import TFAutoModel
    >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
    >>> input_ids = tokenizer(line, return_tensors="tf")
    >>> features = bartpho(**input_ids)
 Tips:
 - Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
  both the encoder and decoder. Thus, usage examples in the :doc:`documentation of BART <bart>`, when adapting to use
  with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.
  For example:
 .. code-block::
    >>> from transformers import MBartForConditionalGeneration
    >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
    >>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
    >>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
    >>> logits = bartpho(input_ids).logits
    >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
    >>> probs = logits[0, masked_index].softmax(dim=0)
    >>> values, predictions = probs.topk(5)
    >>> print(tokenizer.decode(predictions).split())
 - This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
  extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
  Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
  segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
 This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
 <https://github.com/VinAIResearch/BARTpho>`__.
 BartphoTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BartphoTokenizer
    :members:
--- a/docs/source/model_doc/beit.mdx
+++ b/docs/source/model_doc/beit.mdx
@@ -0,0 +1,114 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BEiT
 ## Overview
 The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by
 Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
 Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
 of an image (as done in the [original ViT paper](https://arxiv.org/abs/2010.11929)), BEiT models are pre-trained to
 predict visual tokens from the codebook of OpenAI's [DALL-E model](https://arxiv.org/abs/2102.12092) given masked
 patches.
 The abstract from the paper is the following:
 *We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
 from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
 modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
 patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
 visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
 objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
 directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
 Experimental results on image classification and semantic segmentation show that our model achieves competitive results
 with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
 significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
 Tips:
 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
  outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
  fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
  [`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
  [`ViTForImageClassification`] by [`BeitForImageClassification`]).
 - There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
  performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
 - As the BEiT models expect each image to be of the same size (resolution), one can use
  [`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
  each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
 - The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
  images and 1,000 classes).
 - BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
  pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
  `use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
  position embeddings.
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
 ## BEiT specific outputs
 [[autodoc]] models.beit.modeling_beit.BeitModelOutputWithPooling
 [[autodoc]] models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
 ## BeitConfig
 [[autodoc]] BeitConfig
 ## BeitFeatureExtractor
 [[autodoc]] BeitFeatureExtractor
    - __call__
 ## BeitModel
 [[autodoc]] BeitModel
    - forward
 ## BeitForMaskedImageModeling
 [[autodoc]] BeitForMaskedImageModeling
    - forward
 ## BeitForImageClassification
 [[autodoc]] BeitForImageClassification
    - forward
 ## BeitForSemanticSegmentation
 [[autodoc]] BeitForSemanticSegmentation
    - forward
 ## FlaxBeitModel
 [[autodoc]] FlaxBeitModel
    - __call__
 ## FlaxBeitForMaskedImageModeling
 [[autodoc]] FlaxBeitForMaskedImageModeling
    - __call__
 ## FlaxBeitForImageClassification
 [[autodoc]] FlaxBeitForImageClassification
    - __call__
--- a/docs/source/model_doc/beit.rst
+++ b/docs/source/model_doc/beit.rst
@@ -1,144 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BEiT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BEiT model was proposed in `BEiT: BERT Pre-Training of Image Transformers <https://arxiv.org/abs/2106.08254>`__ by
 Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
 Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
 of an image (as done in the `original ViT paper <https://arxiv.org/abs/2010.11929>`__), BEiT models are pre-trained to
 predict visual tokens from the codebook of OpenAI's `DALL-E model <https://arxiv.org/abs/2102.12092>`__ given masked
 patches.
 The abstract from the paper is the following:
 *We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
 from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
 modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
 patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
 visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
 objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
 directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
 Experimental results on image classification and semantic segmentation show that our model achieves competitive results
 with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
 significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
 Tips:
 - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
  outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
  <deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
  fine-tuning on custom data `here
  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
  :class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
  :class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
 - There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
  performing masked image modeling. You can find it `here
  <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
 - As the BEiT models expect each image to be of the same size (resolution), one can use
  :class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
 - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
  each checkpoint. For example, :obj:`microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
  resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
  <https://huggingface.co/models?search=microsoft/beit>`__.
 - The available checkpoints are either (1) pre-trained on `ImageNet-22k <http://www.image-net.org/>`__ (a collection of
  14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on `ImageNet-1k
  <http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
  images and 1,000 classes).
 - BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
  relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
  bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
  pre-train a model from scratch, one needs to either set the :obj:`use_relative_position_bias` or the
  :obj:`use_relative_position_bias` attribute of :class:`~transformers.BeitConfig` to :obj:`True` in order to add
  position embeddings.
 This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The JAX/FLAX version of this model was
 contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
 <https://github.com/microsoft/unilm/tree/master/beit>`__.
 BEiT specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.beit.modeling_beit.BeitModelOutputWithPooling
    :members:
 .. autoclass:: transformers.models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
    :members:
 BeitConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitConfig
    :members:
 BeitFeatureExtractor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitFeatureExtractor
    :members: __call__
 BeitModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitModel
    :members: forward
 BeitForMaskedImageModeling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitForMaskedImageModeling
    :members: forward
 BeitForImageClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitForImageClassification
    :members: forward
 BeitForSemanticSegmentation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BeitForSemanticSegmentation
    :members: forward
 FlaxBeitModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBeitModel
    :members: __call__
 FlaxBeitForMaskedImageModeling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBeitForMaskedImageModeling
    :members: __call__
 FlaxBeitForImageClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBeitForImageClassification
    :members: __call__
--- a/docs/source/model_doc/bert_japanese.mdx
+++ b/docs/source/model_doc/bert_japanese.mdx
@@ -0,0 +1,74 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BertJapanese
 ## Overview
 The BERT models trained on Japanese text.
 There are models with two different tokenization methods:
 - Tokenize with MeCab and WordPiece. This requires some extra dependencies, [fugashi](https://github.com/polm/fugashi) which is a wrapper around [MeCab](https://taku910.github.io/mecab/).
 - Tokenize into characters.
 To use *MecabTokenizer*, you should `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install
 from source) to install dependencies.
 See [details on cl-tohoku repository](https://github.com/cl-tohoku/bert-japanese).
 Example of using a model with MeCab and WordPiece tokenization:
 ```python
 >>> import torch
 >>> from transformers import AutoModel, AutoTokenizer 
 >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
 >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
 >>> ## Input Japanese Text
 >>> line = "吾輩は猫である。"
 >>> inputs = tokenizer(line, return_tensors="pt")
 >>> print(tokenizer.decode(inputs['input_ids'][0]))
 [CLS] 吾輩 は 猫 で ある 。 [SEP]
 >>> outputs = bertjapanese(**inputs)
 ```
 Example of using a model with Character tokenization:
 ```python
 >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
 >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
 >>> ## Input Japanese Text
 >>> line = "吾輩は猫である。"
 >>> inputs = tokenizer(line, return_tensors="pt")
 >>> print(tokenizer.decode(inputs['input_ids'][0]))
 [CLS] 吾 輩 は 猫 で あ る 。 [SEP]
 >>> outputs = bertjapanese(**inputs)
 ```
 Tips:
 - This implementation is the same as BERT, except for tokenization method. Refer to the [documentation of BERT](bert) for more usage examples.
 This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
 ## BertJapaneseTokenizer
 [[autodoc]] BertJapaneseTokenizer
--- a/docs/source/model_doc/bert_japanese.rst
+++ b/docs/source/model_doc/bert_japanese.rst
@@ -1,80 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BertJapanese
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BERT models trained on Japanese text.
 There are models with two different tokenization methods:
 - Tokenize with MeCab and WordPiece. This requires some extra dependencies, `fugashi
  <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.
 - Tokenize into characters.
 To use `MecabTokenizer`, you should ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install
 from source) to install dependencies.
 See `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__.
 Example of using a model with MeCab and WordPiece tokenization:
 .. code-block::
    >>> import torch
    >>> from transformers import AutoModel, AutoTokenizer 
    >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
    >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
    >>> ## Input Japanese Text
    >>> line = "吾輩は猫である。"
    >>> inputs = tokenizer(line, return_tensors="pt")
    >>> print(tokenizer.decode(inputs['input_ids'][0]))
    [CLS] 吾輩 は 猫 で ある 。 [SEP]
    >>> outputs = bertjapanese(**inputs)
 Example of using a model with Character tokenization:
 .. code-block::
    >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
    >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
    >>> ## Input Japanese Text
    >>> line = "吾輩は猫である。"
    >>> inputs = tokenizer(line, return_tensors="pt")
    >>> print(tokenizer.decode(inputs['input_ids'][0]))
    [CLS] 吾 輩 は 猫 で あ る 。 [SEP]
    >>> outputs = bertjapanese(**inputs)
 Tips:
 - This implementation is the same as BERT, except for tokenization method. Refer to the :doc:`documentation of BERT
  <bert>` for more usage examples.
 This model was contributed by `cl-tohoku <https://huggingface.co/cl-tohoku>`__.
 BertJapaneseTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertJapaneseTokenizer
    :members: 
--- a/docs/source/model_doc/bertgeneration.mdx
+++ b/docs/source/model_doc/bertgeneration.mdx
@@ -0,0 +1,98 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BertGeneration
 ## Overview
 The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
 [`EncoderDecoderModel`] as proposed in [Leveraging Pre-trained Checkpoints for Sequence Generation
 Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 The abstract from the paper is the following:
 *Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
 warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
 benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
 Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
 developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
 GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
 encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
 Text Summarization, Sentence Splitting, and Sentence Fusion.*
 Usage:
 - The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained
  BERT checkpoints for subsequent fine-tuning.
 ```python
 >>> # leverage checkpoints for Bert2Bert model...
 >>> # use BERT's cls token as BOS token and sep token as EOS token
 >>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
 >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
 >>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
 >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
 >>> # create tokenizer...
 >>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
 >>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
 >>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
 >>> # train...
 >>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
 >>> loss.backward()
 ```
 - Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.,
 ```python
 >>> # instantiate sentence fusion model
 >>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
 >>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
 >>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
 >>> outputs = sentence_fuser.generate(input_ids)
 >>> print(tokenizer.decode(outputs[0]))
 ```
 Tips:
 - [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in
  combination with [`EncoderDecoder`].
 - For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
  Therefore, no EOS token should be added to the end of the input.
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
 found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder).
 ## BertGenerationConfig
 [[autodoc]] BertGenerationConfig
 ## BertGenerationTokenizer
 [[autodoc]] BertGenerationTokenizer
    - save_vocabulary
 ## BertGenerationEncoder
 [[autodoc]] BertGenerationEncoder
    - forward
 ## BertGenerationDecoder
 [[autodoc]] BertGenerationDecoder
    - forward
--- a/docs/source/model_doc/bertgeneration.rst
+++ b/docs/source/model_doc/bertgeneration.rst
@@ -1,109 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BertGeneration
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
 :class:`~transformers.EncoderDecoderModel` as proposed in `Leveraging Pre-trained Checkpoints for Sequence Generation
 Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 The abstract from the paper is the following:
 *Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
 warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
 benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
 Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
 developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
 GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
 encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
 Text Summarization, Sentence Splitting, and Sentence Fusion.*
 Usage:
 - The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
  BERT checkpoints for subsequent fine-tuning.
 .. code-block::
    >>> # leverage checkpoints for Bert2Bert model...
    >>> # use BERT's cls token as BOS token and sep token as EOS token
    >>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
    >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
    >>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
    >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
    >>> # create tokenizer...
    >>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
    >>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
    >>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
    >>> # train...
    >>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
    >>> loss.backward()
 - Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
 .. code-block::
    >>> # instantiate sentence fusion model
    >>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
    >>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
    >>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
    >>> outputs = sentence_fuser.generate(input_ids)
    >>> print(tokenizer.decode(outputs[0]))
 Tips:
 - :class:`~transformers.BertGenerationEncoder` and :class:`~transformers.BertGenerationDecoder` should be used in
  combination with :class:`~transformers.EncoderDecoder`.
 - For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
  Therefore, no EOS token should be added to the end of the input.
 This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
 found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
 BertGenerationConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertGenerationConfig
    :members:
 BertGenerationTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertGenerationTokenizer
    :members: save_vocabulary
 BertGenerationEncoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertGenerationEncoder
    :members: forward
 BertGenerationDecoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertGenerationDecoder
    :members: forward
--- a/docs/source/model_doc/bertweet.mdx
+++ b/docs/source/model_doc/bertweet.mdx
@@ -0,0 +1,58 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BERTweet
 ## Overview
 The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
 The abstract from the paper is the following:
 *We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
 the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
 al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
 Part-of-speech tagging, Named-entity recognition and text classification.*
 Example of use:
 ```python
 >>> import torch
 >>> from transformers import AutoModel, AutoTokenizer 
 >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
 >>> # For transformers v4.x+: 
 >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
 >>> # For transformers v3.x: 
 >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
 >>> # INPUT TWEET IS ALREADY NORMALIZED!
 >>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
 >>> input_ids = torch.tensor([tokenizer.encode(line)])
 >>> with torch.no_grad():
 ...     features = bertweet(input_ids)  # Models outputs are now tuples
 >>> # With TensorFlow 2.0+:
 >>> # from transformers import TFAutoModel
 >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 ```
 This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
 ## BertweetTokenizer
 [[autodoc]] BertweetTokenizer
--- a/docs/source/model_doc/bertweet.rst
+++ b/docs/source/model_doc/bertweet.rst
@@ -1,64 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BERTweet
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BERTweet model was proposed in `BERTweet: A pre-trained language model for English Tweets
 <https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf>`__ by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
 The abstract from the paper is the following:
 *We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
 the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
 al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
 Part-of-speech tagging, Named-entity recognition and text classification.*
 Example of use:
 .. code-block::
    >>> import torch
    >>> from transformers import AutoModel, AutoTokenizer 
    >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
    >>> # For transformers v4.x+: 
    >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
    >>> # For transformers v3.x: 
    >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
    >>> # INPUT TWEET IS ALREADY NORMALIZED!
    >>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
    >>> input_ids = torch.tensor([tokenizer.encode(line)])
    >>> with torch.no_grad():
    ...     features = bertweet(input_ids)  # Models outputs are now tuples
    >>> # With TensorFlow 2.0+:
    >>> # from transformers import TFAutoModel
    >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
 This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
 <https://github.com/VinAIResearch/BERTweet>`__.
 BertweetTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BertweetTokenizer
    :members: 
--- a/docs/source/model_doc/bigbird.mdx
+++ b/docs/source/model_doc/bigbird.mdx
@@ -0,0 +1,146 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BigBird
 ## Overview
 The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
 attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
 has been shown that applying sparse, global, and random attention approximates full attention, while being
 computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
 BigBird has shown improved performance on various long document NLP tasks, such as question answering and
 summarization, compared to BERT or RoBERTa.
 The abstract from the paper is the following:
 *Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
 Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
 length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
 reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
 is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
 theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
 sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*
 Tips:
 - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
 - Current implementation doesn't support **num_random_blocks = 0**
 This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found
 [here](https://github.com/google-research/bigbird).
 ## BigBirdConfig
 [[autodoc]] BigBirdConfig
 ## BigBirdTokenizer
 [[autodoc]] BigBirdTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## BigBirdTokenizerFast
 [[autodoc]] BigBirdTokenizerFast
 ## BigBird specific outputs
 [[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
 ## BigBirdModel
 [[autodoc]] BigBirdModel
    - forward
 ## BigBirdForPreTraining
 [[autodoc]] BigBirdForPreTraining
    - forward
 ## BigBirdForCausalLM
 [[autodoc]] BigBirdForCausalLM
    - forward
 ## BigBirdForMaskedLM
 [[autodoc]] BigBirdForMaskedLM
    - forward
 ## BigBirdForSequenceClassification
 [[autodoc]] BigBirdForSequenceClassification
    - forward
 ## BigBirdForMultipleChoice
 [[autodoc]] BigBirdForMultipleChoice
    - forward
 ## BigBirdForTokenClassification
 [[autodoc]] BigBirdForTokenClassification
    - forward
 ## BigBirdForQuestionAnswering
 [[autodoc]] BigBirdForQuestionAnswering
    - forward
 ## FlaxBigBirdModel
 [[autodoc]] FlaxBigBirdModel
    - __call__
 ## FlaxBigBirdForPreTraining
 [[autodoc]] FlaxBigBirdForPreTraining
    - __call__
 ## FlaxBigBirdForMaskedLM
 [[autodoc]] FlaxBigBirdForMaskedLM
    - __call__
 ## FlaxBigBirdForSequenceClassification
 [[autodoc]] FlaxBigBirdForSequenceClassification
    - __call__
 ## FlaxBigBirdForMultipleChoice
 [[autodoc]] FlaxBigBirdForMultipleChoice
    - __call__
 ## FlaxBigBirdForTokenClassification
 [[autodoc]] FlaxBigBirdForTokenClassification
    - __call__
 ## FlaxBigBirdForQuestionAnswering
 [[autodoc]] FlaxBigBirdForQuestionAnswering
    - __call__
--- a/docs/source/model_doc/bigbird.rst
+++ b/docs/source/model_doc/bigbird.rst
@@ -1,185 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BigBird
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
 attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
 has been shown that applying sparse, global, and random attention approximates full attention, while being
 computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
 BigBird has shown improved performance on various long document NLP tasks, such as question answering and
 summarization, compared to BERT or RoBERTa.
 The abstract from the paper is the following:
 *Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
 Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
 length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
 reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
 is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
 theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
 sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*
 Tips:
 - For an in-detail explanation on how BigBird's attention works, see `this blog post
  <https://huggingface.co/blog/big-bird>`__.
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
 - Current implementation doesn't support **num_random_blocks = 0**
 This model was contributed by `vasudevgupta <https://huggingface.co/vasudevgupta>`__. The original code can be found
 `here <https://github.com/google-research/bigbird>`__.
 BigBirdConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdConfig
    :members:
 BigBirdTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 BigBirdTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdTokenizerFast
    :members:
 BigBird specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
    :members:
 BigBirdModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdModel
    :members: forward
 BigBirdForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForPreTraining
    :members: forward
 BigBirdForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForCausalLM
    :members: forward
 BigBirdForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForMaskedLM
    :members: forward
 BigBirdForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForSequenceClassification
    :members: forward
 BigBirdForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForMultipleChoice
    :members: forward
 BigBirdForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForTokenClassification
    :members: forward
 BigBirdForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdForQuestionAnswering
    :members: forward
 FlaxBigBirdModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdModel
    :members: __call__
 FlaxBigBirdForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForPreTraining
    :members: __call__
 FlaxBigBirdForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForMaskedLM
    :members: __call__
 FlaxBigBirdForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForSequenceClassification
    :members: __call__
 FlaxBigBirdForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForMultipleChoice
    :members: __call__
 FlaxBigBirdForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForTokenClassification
    :members: __call__
 FlaxBigBirdForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBigBirdForQuestionAnswering
    :members: __call__
--- a/docs/source/model_doc/bigbird_pegasus.mdx
+++ b/docs/source/model_doc/bigbird_pegasus.mdx
@@ -0,0 +1,81 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # BigBirdPegasus
 ## Overview
 The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
 attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
 has been shown that applying sparse, global, and random attention approximates full attention, while being
 computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
 BigBird has shown improved performance on various long document NLP tasks, such as question answering and
 summarization, compared to BERT or RoBERTa.
 The abstract from the paper is the following:
 *Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
 Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
 length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
 reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
 is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
 theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
 sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*
 Tips:
 - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
 - Current implementation doesn't support **num_random_blocks = 0**.
 - BigBirdPegasus uses the [PegasusTokenizer](https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py).
 The original code can be found [here](https://github.com/google-research/bigbird).
 ## BigBirdPegasusConfig
 [[autodoc]] BigBirdPegasusConfig
    - all
 ## BigBirdPegasusModel
 [[autodoc]] BigBirdPegasusModel
    - forward
 ## BigBirdPegasusForConditionalGeneration
 [[autodoc]] BigBirdPegasusForConditionalGeneration
    - forward
 ## BigBirdPegasusForSequenceClassification
 [[autodoc]] BigBirdPegasusForSequenceClassification
    - forward
 ## BigBirdPegasusForQuestionAnswering
 [[autodoc]] BigBirdPegasusForQuestionAnswering
    - forward
 ## BigBirdPegasusForCausalLM
 [[autodoc]] BigBirdPegasusForCausalLM
    - forward
--- a/docs/source/model_doc/bigbird_pegasus.rst
+++ b/docs/source/model_doc/bigbird_pegasus.rst
@@ -1,98 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 BigBirdPegasus
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
 Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
 Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
 based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
 attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
 has been shown that applying sparse, global, and random attention approximates full attention, while being
 computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
 BigBird has shown improved performance on various long document NLP tasks, such as question answering and
 summarization, compared to BERT or RoBERTa.
 The abstract from the paper is the following:
 *Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
 Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
 length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
 reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
 is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
 theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
 sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
 BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
 propose novel applications to genomics data.*
 Tips:
 - For an in-detail explanation on how BigBird's attention works, see `this blog post
  <https://huggingface.co/blog/big-bird>`__.
 - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
  **original_full** is advised as there is no benefit in using **block_sparse** attention.
 - The code currently uses window size of 3 blocks and 2 global blocks.
 - Sequence length must be divisible by block size.
 - Current implementation supports only **ITC**.
 - Current implementation doesn't support **num_random_blocks = 0**.
 - BigBirdPegasus uses the `PegasusTokenizer
  <https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py>`__.
 The original code can be found `here <https://github.com/google-research/bigbird>`__.
 BigBirdPegasusConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusConfig
    :members:
 BigBirdPegasusModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusModel
    :members: forward
 BigBirdPegasusForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusForConditionalGeneration
    :members: forward
 BigBirdPegasusForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusForSequenceClassification
    :members: forward
 BigBirdPegasusForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusForQuestionAnswering
    :members: forward
 BigBirdPegasusForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BigBirdPegasusForCausalLM
    :members: forward
--- a/docs/source/model_doc/blenderbot.mdx
+++ b/docs/source/model_doc/blenderbot.mdx
@@ -0,0 +1,118 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Blenderbot
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) .
 ## Overview
 The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
 Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
 The abstract of the paper is the following:
 *Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
 scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
 we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
 skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
 their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
 persona. We show that large scale models can learn these skills when given appropriate training data and choice of
 generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
 and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*
 This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .
 ## Implementation Notes
 - Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture.
 - Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
 - This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
  `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
  [BlenderbotSmall](blenderbot_small).
 ## Usage
 Here is an example of model usage:
 ```python
 >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
 >>> mname = 'facebook/blenderbot-400M-distill'
 >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
 >>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
 >>> UTTERANCE = "My friends are cool but they eat too many carbs."
 >>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
 >>> reply_ids = model.generate(**inputs)
 >>> print(tokenizer.batch_decode(reply_ids))
 ["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
 ```
 ## BlenderbotConfig
 [[autodoc]] BlenderbotConfig
 ## BlenderbotTokenizer
 [[autodoc]] BlenderbotTokenizer
    - build_inputs_with_special_tokens
 ## BlenderbotTokenizerFast
 [[autodoc]] BlenderbotTokenizerFast
    - build_inputs_with_special_tokens
 ## BlenderbotModel
 See `transformers.BartModel` for arguments to *forward* and *generate*
 [[autodoc]] BlenderbotModel
    - forward
 ## BlenderbotForConditionalGeneration
 See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* and *generate*
 [[autodoc]] BlenderbotForConditionalGeneration
    - forward
 ## BlenderbotForCausalLM
 [[autodoc]] BlenderbotForCausalLM
    - forward
 ## TFBlenderbotModel
 [[autodoc]] TFBlenderbotModel
    - call
 ## TFBlenderbotForConditionalGeneration
 [[autodoc]] TFBlenderbotForConditionalGeneration
    - call
 ## FlaxBlenderbotModel
 [[autodoc]] FlaxBlenderbotModel
    - __call__
    - encode
    - decode
 ## FlaxBlenderbotForConditionalGeneration
 [[autodoc]] FlaxBlenderbotForConditionalGeneration
    - __call__
    - encode
    - decode
--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
@@ -1,141 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Blenderbot
 -----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
 <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
 Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
 The abstract of the paper is the following:
 *Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
 scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
 we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
 skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
 their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
 persona. We show that large scale models can learn these skills when given appropriate training data and choice of
 generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
 and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*
 This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The authors' code can be found `here
 <https://github.com/facebookresearch/ParlAI>`__ .
 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
 - Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
 - This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
  ``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
  `BlenderbotSmall <blenderbot_small>`__.
 Usage
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Here is an example of model usage:
 .. code-block::
        >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
        >>> mname = 'facebook/blenderbot-400M-distill'
        >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
        >>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
        >>> UTTERANCE = "My friends are cool but they eat too many carbs."
        >>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
        >>> reply_ids = model.generate(**inputs)
        >>> print(tokenizer.batch_decode(reply_ids))
        ["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
 BlenderbotConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotConfig
    :members:
 BlenderbotTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotTokenizer
    :members: build_inputs_with_special_tokens
 BlenderbotTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotTokenizerFast
    :members: build_inputs_with_special_tokens
 BlenderbotModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 See :obj:`transformers.BartModel` for arguments to `forward` and `generate`
 .. autoclass:: transformers.BlenderbotModel
    :members: forward
 BlenderbotForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
 .. autoclass:: transformers.BlenderbotForConditionalGeneration
    :members: forward
 BlenderbotForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotForCausalLM
    :members: forward
 TFBlenderbotModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBlenderbotModel
    :members: call
 TFBlenderbotForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBlenderbotForConditionalGeneration
    :members: call
 FlaxBlenderbotModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBlenderbotModel
    :members: __call__, encode, decode
 FlaxBlenderbotForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBlenderbotForConditionalGeneration
    :members: __call__, encode, decode
--- a/docs/source/model_doc/blenderbot_small.mdx
+++ b/docs/source/model_doc/blenderbot_small.mdx
@@ -0,0 +1,95 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Blenderbot Small
 Note that [`BlenderbotSmallModel`] and
 [`BlenderbotSmallForConditionalGeneration`] are only used in combination with the checkpoint
 [facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M). Larger Blenderbot checkpoints should
 instead be used with [`BlenderbotModel`] and
 [`BlenderbotForConditionalGeneration`]
 ## Overview
 The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
 Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
 The abstract of the paper is the following:
 *Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
 scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
 we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
 skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
 their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
 persona. We show that large scale models can learn these skills when given appropriate training data and choice of
 generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
 and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
 found [here](https://github.com/facebookresearch/ParlAI) .
 ## BlenderbotSmallConfig
 [[autodoc]] BlenderbotSmallConfig
 ## BlenderbotSmallTokenizer
 [[autodoc]] BlenderbotSmallTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## BlenderbotSmallTokenizerFast
 [[autodoc]] BlenderbotSmallTokenizerFast
 ## BlenderbotSmallModel
 [[autodoc]] BlenderbotSmallModel
    - forward
 ## BlenderbotSmallForConditionalGeneration
 [[autodoc]] BlenderbotSmallForConditionalGeneration
    - forward
 ## BlenderbotSmallForCausalLM
 [[autodoc]] BlenderbotSmallForCausalLM
    - forward
 ## TFBlenderbotSmallModel
 [[autodoc]] TFBlenderbotSmallModel
    - call
 ## TFBlenderbotSmallForConditionalGeneration
 [[autodoc]] TFBlenderbotSmallForConditionalGeneration
    - call
 ## FlaxBlenderbotSmallModel
 [[autodoc]] FlaxBlenderbotSmallModel
    - __call__
    - encode
    - decode
 ## FlaxBlenderbotForConditionalGeneration
 [[autodoc]] FlaxBlenderbotSmallForConditionalGeneration
    - __call__
    - encode
    - decode
--- a/docs/source/model_doc/blenderbot_small.rst
+++ b/docs/source/model_doc/blenderbot_small.rst
@@ -1,113 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Blenderbot Small
 -----------------------------------------------------------------------------------------------------------------------
 Note that :class:`~transformers.BlenderbotSmallModel` and
 :class:`~transformers.BlenderbotSmallForConditionalGeneration` are only used in combination with the checkpoint
 `facebook/blenderbot-90M <https://huggingface.co/facebook/blenderbot-90M>`__. Larger Blenderbot checkpoints should
 instead be used with :class:`~transformers.BlenderbotModel` and
 :class:`~transformers.BlenderbotForConditionalGeneration`
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
 <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
 Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
 The abstract of the paper is the following:
 *Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
 scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
 we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
 skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
 their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
 persona. We show that large scale models can learn these skills when given appropriate training data and choice of
 generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
 and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
 dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
 failure cases of our models.*
 This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The authors' code can be
 found `here <https://github.com/facebookresearch/ParlAI>`__ .
 BlenderbotSmallConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallConfig
    :members:
 BlenderbotSmallTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 BlenderbotSmallTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallTokenizerFast
    :members:
 BlenderbotSmallModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallModel
    :members: forward
 BlenderbotSmallForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallForConditionalGeneration
    :members: forward
 BlenderbotSmallForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.BlenderbotSmallForCausalLM
    :members: forward
 TFBlenderbotSmallModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBlenderbotSmallModel
    :members: call
 TFBlenderbotSmallForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFBlenderbotSmallForConditionalGeneration
    :members: call
 FlaxBlenderbotSmallModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBlenderbotSmallModel
    :members: __call__, encode, decode
 FlaxBlenderbotForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxBlenderbotSmallForConditionalGeneration
    :members: __call__, encode, decode
--- a/docs/source/model_doc/bort.mdx
+++ b/docs/source/model_doc/bort.mdx
@@ -1,5 +1,4 @@
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
    Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -9,14 +8,13 @@
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-BORT
+# BORT
 -----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
+The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
 Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
 authors refer to as "Bort".
@@ -34,14 +32,11 @@ absolute, with respect to BERT-large, on multiple public natural language unders
 Tips:
- BORT's model architecture is based on BERT, so one can refer to :doc:`BERT's documentation page <bert>` for the
+- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
  model's API as well as usage examples.
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to :doc:`RoBERTa's documentation page
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
-  <roberta>` for the tokenizer's API as well as usage examples.
+- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
 - BORT requires a specific fine-tuning algorithm, called `Agora
  <https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology>`__ ,
  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
  algorithm to make BORT fine-tuning work.
-This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
+This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
 <https://github.com/alexa/bort/>`__.
--- a/docs/source/model_doc/byt5.mdx
+++ b/docs/source/model_doc/byt5.mdx
@@ -0,0 +1,80 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ByT5
 ## Overview
 The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
 Kale, Adam Roberts, Colin Raffel.
 The abstract from the paper is the following:
 *Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
 Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
 the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
 can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
 removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
 sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
 operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
 minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
 training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
 counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
 tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
 pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
 experiments.*
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
 found [here](https://github.com/google-research/byt5).
 ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
 only differ in how inputs should be prepared for the model, see the code examples below.
 Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
 fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
 ### Example
 ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
 ```python
 from transformers import T5ForConditionalGeneration
 import torch
 model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
 input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
 labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens
 loss = model(input_ids, labels=labels).loss # forward pass
 ```
 For batched inference and training it is however recommended to make use of the tokenizer:
 ```python
 from transformers import T5ForConditionalGeneration, AutoTokenizer
 model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
 tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
 model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
 labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
 loss = model(**model_inputs, labels=labels).loss # forward pass
 ```
 ## ByT5Tokenizer
 [[autodoc]] ByT5Tokenizer
 See [`ByT5Tokenizer`] for all details.
--- a/docs/source/model_doc/byt5.rst
+++ b/docs/source/model_doc/byt5.rst
@@ -1,86 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 ByT5
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The ByT5 model was presented in `ByT5: Towards a token-free future with pre-trained byte-to-byte models
 <https://arxiv.org/abs/2105.13626>`_ by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
 Kale, Adam Roberts, Colin Raffel.
 The abstract from the paper is the following:
 *Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
 Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
 the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
 can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
 removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
 sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
 operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
 minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
 training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
 counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
 tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
 pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
 experiments.*
 This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
 found `here <https://github.com/google-research/byt5>`__.
 ByT5's architecture is based on the T5v1.1 model, so one can refer to :doc:`T5v1.1's documentation page <t5v1.1>`. They
 only differ in how inputs should be prepared for the model, see the code examples below.
 Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
 fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
 Example
 _______________________________________________________________________________________________________________________
 ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
 .. code-block::
    from transformers import T5ForConditionalGeneration
    import torch
    model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
    input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
    labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens
    loss = model(input_ids, labels=labels).loss # forward pass
 For batched inference and training it is however recommended to make use of the tokenizer:
 .. code-block::
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
    tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
    model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
    labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
    loss = model(**model_inputs, labels=labels).loss # forward pass
 ByT5Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ByT5Tokenizer
 See :class:`~transformers.ByT5Tokenizer` for all details.
--- a/docs/source/model_doc/camembert.mdx
+++ b/docs/source/model_doc/camembert.mdx
@@ -0,0 +1,106 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # CamemBERT
 ## Overview
 The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by
 Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
 Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
 trained on 138GB of French text.
 The abstract from the paper is the following:
 *Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
 models have either been trained on English data or on the concatenation of data in multiple languages. This makes
 practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
 we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
 performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
 dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
 for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
 downstream applications for French NLP.*
 Tips:
 - This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
  as well as the information relative to the inputs and outputs.
 This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).
 ## CamembertConfig
 [[autodoc]] CamembertConfig
 ## CamembertTokenizer
 [[autodoc]] CamembertTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## CamembertTokenizerFast
 [[autodoc]] CamembertTokenizerFast
 ## CamembertModel
 [[autodoc]] CamembertModel
 ## CamembertForCausalLM
 [[autodoc]] CamembertForCausalLM
 ## CamembertForMaskedLM
 [[autodoc]] CamembertForMaskedLM
 ## CamembertForSequenceClassification
 [[autodoc]] CamembertForSequenceClassification
 ## CamembertForMultipleChoice
 [[autodoc]] CamembertForMultipleChoice
 ## CamembertForTokenClassification
 [[autodoc]] CamembertForTokenClassification
 ## CamembertForQuestionAnswering
 [[autodoc]] CamembertForQuestionAnswering
 ## TFCamembertModel
 [[autodoc]] TFCamembertModel
 ## TFCamembertForMaskedLM
 [[autodoc]] TFCamembertForMaskedLM
 ## TFCamembertForSequenceClassification
 [[autodoc]] TFCamembertForSequenceClassification
 ## TFCamembertForMultipleChoice
 [[autodoc]] TFCamembertForMultipleChoice
 ## TFCamembertForTokenClassification
 [[autodoc]] TFCamembertForTokenClassification
 ## TFCamembertForQuestionAnswering
 [[autodoc]] TFCamembertForQuestionAnswering
--- a/docs/source/model_doc/camembert.rst
+++ b/docs/source/model_doc/camembert.rst
@@ -1,153 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 CamemBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
 Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
 Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
 trained on 138GB of French text.
 The abstract from the paper is the following:
 *Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
 models have either been trained on English data or on the concatenation of data in multiple languages. This makes
 practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
 we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
 performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
 dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
 for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
 downstream applications for French NLP.*
 Tips:
 - This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
  as well as the information relative to the inputs and outputs.
 This model was contributed by `camembert <https://huggingface.co/camembert>`__. The original code can be found `here
 <https://camembert-model.fr/>`__.
 CamembertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertConfig
    :members:
 CamembertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 CamembertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertTokenizerFast
    :members:
 CamembertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertModel
    :members:
 CamembertForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForCausalLM
    :members:
 CamembertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForMaskedLM
    :members:
 CamembertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForSequenceClassification
    :members:
 CamembertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForMultipleChoice
    :members:
 CamembertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForTokenClassification
    :members:
 CamembertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CamembertForQuestionAnswering
    :members:
 TFCamembertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertModel
    :members:
 TFCamembertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForMaskedLM
    :members:
 TFCamembertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForSequenceClassification
    :members:
 TFCamembertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForMultipleChoice
    :members:
 TFCamembertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForTokenClassification
    :members:
 TFCamembertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCamembertForQuestionAnswering
    :members:
--- a/docs/source/model_doc/canine.mdx
+++ b/docs/source/model_doc/canine.mdx
@@ -0,0 +1,133 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # CANINE
 ## Overview
 The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
 Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
 among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
 Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
 Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
 downsampling strategy, before applying a deep Transformer encoder.
 The abstract from the paper is the following:
 *Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
 still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
 lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
 languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
 a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
 pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
 To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
 sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
 Tips:
 - CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
  layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
  the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
  after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
  downsampling can be found in the paper.
 - CANINE uses a max sequence length of 2048 characters by default. One can use [`CanineTokenizer`]
  to prepare text for the model.
 - Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
  tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
  details for this can be found in the paper.
 -  Models:
  - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
    12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
  - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
    768-hidden, 12-heads, 121M parameters (size ~500 MB).
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
 ### Example
 CANINE works on raw characters, so it can be used without a tokenizer:
 ```python
 >>> from transformers import CanineModel
 >>> import torch
 >>> model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
 >>> text = "hello world"
 >>> # use Python's built-in ord() function to turn each character into its unicode code point id
 >>> input_ids = torch.tensor([[ord(char) for char in text]])
 >>> outputs = model(input_ids) # forward pass
 >>> pooled_output = outputs.pooler_output
 >>> sequence_output = outputs.last_hidden_state
 ```
 For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
 sequences to the same length):
 ```python
 >>> from transformers import CanineTokenizer, CanineModel
 >>> model = CanineModel.from_pretrained('google/canine-c')
 >>> tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
 >>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
 >>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
 >>> outputs = model(**encoding) # forward pass
 >>> pooled_output = outputs.pooler_output
 >>> sequence_output = outputs.last_hidden_state
 ```
 ## CANINE specific outputs
 [[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
 ## CanineConfig
 [[autodoc]] CanineConfig
 ## CanineTokenizer
 [[autodoc]] CanineTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
 ## CanineModel
 [[autodoc]] CanineModel
    - forward
 ## CanineForSequenceClassification
 [[autodoc]] CanineForSequenceClassification
    - forward
 ## CanineForMultipleChoice
 [[autodoc]] CanineForMultipleChoice
    - forward
 ## CanineForTokenClassification
 [[autodoc]] CanineForTokenClassification
    - forward
 ## CanineForQuestionAnswering
 [[autodoc]] CanineForQuestionAnswering
    - forward
--- a/docs/source/model_doc/canine.rst
+++ b/docs/source/model_doc/canine.rst
@@ -1,155 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 CANINE
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The CANINE model was proposed in `CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
 Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
 among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
 Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
 Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
 downsampling strategy, before applying a deep Transformer encoder.
 The abstract from the paper is the following:
 *Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
 still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
 lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
 languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
 a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
 pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
 To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
 sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
 Tips:
 - CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
  layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
  the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
  after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
  downsampling can be found in the paper.
 - CANINE uses a max sequence length of 2048 characters by default. One can use :class:`~transformers.CanineTokenizer`
  to prepare text for the model.
 - Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
  (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
  tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
  details for this can be found in the paper.
 -  Models:
      - `google/canine-c <https://huggingface.co/google/canine-c>`__: Pre-trained with autoregressive character loss,
        12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
      - `google/canine-s <https://huggingface.co/google/canine-s>`__: Pre-trained with subword loss, 12-layer,
        768-hidden, 12-heads, 121M parameters (size ~500 MB).
 This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
 <https://github.com/google-research/language/tree/master/language/canine>`__.
 Example
 _______________________________________________________________________________________________________________________
 CANINE works on raw characters, so it can be used without a tokenizer:
 .. code-block::
    from transformers import CanineModel
    import torch
    model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
    text = "hello world"
    # use Python's built-in ord() function to turn each character into its unicode code point id
    input_ids = torch.tensor([[ord(char) for char in text]])
    outputs = model(input_ids) # forward pass
    pooled_output = outputs.pooler_output
    sequence_output = outputs.last_hidden_state
 For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
 sequences to the same length):
 .. code-block::
    from transformers import CanineTokenizer, CanineModel
    model = CanineModel.from_pretrained('google/canine-c')
    tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
    inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
    encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
    outputs = model(**encoding) # forward pass
    pooled_output = outputs.pooler_output
    sequence_output = outputs.last_hidden_state
 CANINE specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.canine.modeling_canine.CanineModelOutputWithPooling
    :members:
 CanineConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineConfig
    :members:
 CanineTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences
 CanineModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineModel
    :members: forward
 CanineForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineForSequenceClassification
    :members: forward
 CanineForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineForMultipleChoice
    :members: forward
 CanineForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineForTokenClassification
    :members: forward
 CanineForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CanineForQuestionAnswering
    :members: forward
--- a/docs/source/model_doc/clip.mdx
+++ b/docs/source/model_doc/clip.mdx
@@ -0,0 +1,143 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # CLIP
 ## Overview
 The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
 Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
 (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
 instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
 for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
 The abstract from the paper is the following:
 *State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
 restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
 any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
 much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
 with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
 learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
 the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
 such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
 model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
 for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
 without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
 model weights at this https URL.*
 ## Usage
 CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
 classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
 features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
 product between the projected image and text features is then used as a similar score.
 To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
 which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
 also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
 The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
 The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
 [`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
 [`CLIPProcessor`] and [`CLIPModel`].
 ```python
 >>> from PIL import Image
 >>> import requests
 >>> from transformers import CLIPProcessor, CLIPModel
 >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
 >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image = Image.open(requests.get(url, stream=True).raw)
 >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
 >>> outputs = model(**inputs)
 >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
 >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 ```
 This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
 ## CLIPConfig
 [[autodoc]] CLIPConfig
    - from_text_vision_configs
 ## CLIPTextConfig
 [[autodoc]] CLIPTextConfig
 ## CLIPVisionConfig
 [[autodoc]] CLIPVisionConfig
 ## CLIPTokenizer
 [[autodoc]] CLIPTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## CLIPTokenizerFast
 [[autodoc]] CLIPTokenizerFast
 ## CLIPFeatureExtractor
 [[autodoc]] CLIPFeatureExtractor
 ## CLIPProcessor
 [[autodoc]] CLIPProcessor
 ## CLIPModel
 [[autodoc]] CLIPModel
    - forward
    - get_text_features
    - get_image_features
 ## CLIPTextModel
 [[autodoc]] CLIPTextModel
    - forward
 ## CLIPVisionModel
 [[autodoc]] CLIPVisionModel
    - forward
 ## FlaxCLIPModel
 [[autodoc]] FlaxCLIPModel
    - __call__
    - get_text_features
    - get_image_features
 ## FlaxCLIPTextModel
 [[autodoc]] FlaxCLIPTextModel
    - __call__
 ## FlaxCLIPVisionModel
 [[autodoc]] FlaxCLIPVisionModel
    - __call__
--- a/docs/source/model_doc/clip.rst
+++ b/docs/source/model_doc/clip.rst
@@ -1,174 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 CLIP
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The CLIP model was proposed in `Learning Transferable Visual Models From Natural Language Supervision
 <https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
 Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
 (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
 instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
 for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
 The abstract from the paper is the following:
 *State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
 restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
 any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
 much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
 with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
 learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
 the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
 such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
 model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
 for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
 without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
 model weights at this https URL.*
 Usage
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
 classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
 features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
 product between the projected image and text features is then used as a similar score.
 To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
 which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
 also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
 The :class:`~transformers.CLIPFeatureExtractor` can be used to resize (or rescale) and normalize images for the model.
 The :class:`~transformers.CLIPTokenizer` is used to encode the text. The :class:`~transformers.CLIPProcessor` wraps
 :class:`~transformers.CLIPFeatureExtractor` and :class:`~transformers.CLIPTokenizer` into a single instance to both
 encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
 :class:`~transformers.CLIPProcessor` and :class:`~transformers.CLIPModel`.
 .. code-block::
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import CLIPProcessor, CLIPModel
        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
 <https://github.com/openai/CLIP>`__.
 CLIPConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPConfig
    :members: from_text_vision_configs
 CLIPTextConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPTextConfig
    :members:
 CLIPVisionConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPVisionConfig
    :members:
 CLIPTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 CLIPTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPTokenizerFast
    :members:
 CLIPFeatureExtractor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPFeatureExtractor
    :members:
 CLIPProcessor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPProcessor
    :members:
 CLIPModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPModel
    :members: forward, get_text_features, get_image_features
 CLIPTextModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPTextModel
    :members: forward
 CLIPVisionModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CLIPVisionModel
    :members: forward
 FlaxCLIPModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxCLIPModel
    :members: __call__, get_text_features, get_image_features
 FlaxCLIPTextModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxCLIPTextModel
    :members: __call__
 FlaxCLIPVisionModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxCLIPVisionModel
    :members: __call__
--- a/docs/source/model_doc/convbert.mdx
+++ b/docs/source/model_doc/convbert.mdx
@@ -0,0 +1,113 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ConvBERT
 ## Overview
 The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
 Yan.
 The abstract from the paper is the following:
 *Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
 natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
 large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
 generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
 which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
 replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
 rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
 learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
 ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
 fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
 using less than 1/4 training cost. Code and pre-trained models will be released.*
 ConvBERT training tips are similar to those of BERT.
 This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
 here: https://github.com/yitu-opensource/ConvBert
 ## ConvBertConfig
 [[autodoc]] ConvBertConfig
 ## ConvBertTokenizer
 [[autodoc]] ConvBertTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## ConvBertTokenizerFast
 [[autodoc]] ConvBertTokenizerFast
 ## ConvBertModel
 [[autodoc]] ConvBertModel
    - forward
 ## ConvBertForMaskedLM
 [[autodoc]] ConvBertForMaskedLM
    - forward
 ## ConvBertForSequenceClassification
 [[autodoc]] ConvBertForSequenceClassification
    - forward
 ## ConvBertForMultipleChoice
 [[autodoc]] ConvBertForMultipleChoice
    - forward
 ## ConvBertForTokenClassification
 [[autodoc]] ConvBertForTokenClassification
    - forward
 ## ConvBertForQuestionAnswering
 [[autodoc]] ConvBertForQuestionAnswering
    - forward
 ## TFConvBertModel
 [[autodoc]] TFConvBertModel
    - call
 ## TFConvBertForMaskedLM
 [[autodoc]] TFConvBertForMaskedLM
    - call
 ## TFConvBertForSequenceClassification
 [[autodoc]] TFConvBertForSequenceClassification
    - call
 ## TFConvBertForMultipleChoice
 [[autodoc]] TFConvBertForMultipleChoice
    - call
 ## TFConvBertForTokenClassification
 [[autodoc]] TFConvBertForTokenClassification
    - call
 ## TFConvBertForQuestionAnswering
 [[autodoc]] TFConvBertForQuestionAnswering
    - call
--- a/docs/source/model_doc/convbert.rst
+++ b/docs/source/model_doc/convbert.rst
@@ -1,145 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 ConvBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The ConvBERT model was proposed in `ConvBERT: Improving BERT with Span-based Dynamic Convolution
 <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
 Yan.
 The abstract from the paper is the following:
 *Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
 natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
 large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
 generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
 which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
 replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
 rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
 learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
 ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
 fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
 using less than 1/4 training cost. Code and pre-trained models will be released.*
 ConvBERT training tips are similar to those of BERT.
 This model was contributed by `abhishek <https://huggingface.co/abhishek>`__. The original implementation can be found
 here: https://github.com/yitu-opensource/ConvBert
 ConvBertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertConfig
    :members:
 ConvBertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 ConvBertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertTokenizerFast
    :members:
 ConvBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertModel
    :members: forward
 ConvBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertForMaskedLM
    :members: forward
 ConvBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertForSequenceClassification
    :members: forward
 ConvBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertForMultipleChoice
    :members: forward
 ConvBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertForTokenClassification
    :members: forward
 ConvBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ConvBertForQuestionAnswering
    :members: forward
 TFConvBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertModel
    :members: call
 TFConvBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertForMaskedLM
    :members: call
 TFConvBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertForSequenceClassification
    :members: call
 TFConvBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertForMultipleChoice
    :members: call
 TFConvBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertForTokenClassification
    :members: call
 TFConvBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFConvBertForQuestionAnswering
    :members: call
--- a/docs/source/model_doc/cpm.mdx
+++ b/docs/source/model_doc/cpm.mdx
@@ -1,5 +1,4 @@
-..
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
    Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -9,15 +8,13 @@
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-CPM
+# CPM
 -----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The CPM model was proposed in `CPM: A Large-scale Generative Chinese Pre-trained Language Model
+The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
 <https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
 Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
 Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
@@ -33,13 +30,11 @@ language model, which could facilitate several downstream Chinese NLP tasks, suc
 cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
 NLP tasks in the settings of few-shot (even zero-shot) learning.*
-This model was contributed by `canwenxu <https://huggingface.co/canwenxu>`__. The original implementation can be found
+This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
 here: https://github.com/TsinghuaAI/CPM-Generate
 Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.
-CpmTokenizer
+## CpmTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.CpmTokenizer
+[[autodoc]] CpmTokenizer
    :members:
--- a/docs/source/model_doc/ctrl.mdx
+++ b/docs/source/model_doc/ctrl.mdx
@@ -0,0 +1,87 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # CTRL
 ## Overview
 CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
 Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
 of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
 The abstract from the paper is the following:
 *Large-scale language models show promising text generation capabilities, but users cannot easily control particular
 aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
 trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
 derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
 providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
 training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
 via model-based source attribution.*
 Tips:
 - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
  or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
  more information.
 - CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
  observed in the *run_generation.py* example script.
 - The PyTorch models can take the *past* as input, which is the previously computed key/value attention pairs. Using
  this *past* value prevents the model from re-computing pre-computed values in the context of text generation. See
  [reusing the past in generative models](../quickstart#using-the-past) for more information on the usage of
  this argument.
 This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
 [here](https://github.com/salesforce/ctrl).
 ## CTRLConfig
 [[autodoc]] CTRLConfig
 ## CTRLTokenizer
 [[autodoc]] CTRLTokenizer
    - save_vocabulary
 ## CTRLModel
 [[autodoc]] CTRLModel
    - forward
 ## CTRLLMHeadModel
 [[autodoc]] CTRLLMHeadModel
    - forward
 ## CTRLForSequenceClassification
 [[autodoc]] CTRLForSequenceClassification
    - forward
 ## TFCTRLModel
 [[autodoc]] TFCTRLModel
    - call
 ## TFCTRLLMHeadModel
 [[autodoc]] TFCTRLLMHeadModel
    - call
 ## TFCTRLForSequenceClassification
 [[autodoc]] TFCTRLForSequenceClassification
    - call
--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
@@ -1,105 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 CTRL
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
 <https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
 Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
 of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
 The abstract from the paper is the following:
 *Large-scale language models show promising text generation capabilities, but users cannot easily control particular
 aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
 trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
 derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
 providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
 training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
 via model-based source attribution.*
 Tips:
 - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
  or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
  more information.
 - CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
  observed in the `run_generation.py` example script.
 - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
  this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
  `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
  this argument.
 This model was contributed by `keskarnitishr <https://huggingface.co/keskarnitishr>`__. The original code can be found
 `here <https://github.com/salesforce/ctrl>`__.
 CTRLConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLConfig
    :members:
 CTRLTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLTokenizer
    :members: save_vocabulary
 CTRLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLModel
    :members: forward
 CTRLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLLMHeadModel
    :members: forward
 CTRLForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.CTRLForSequenceClassification
    :members: forward
 TFCTRLModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCTRLModel
    :members: call
 TFCTRLLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCTRLLMHeadModel
    :members: call
 TFCTRLForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFCTRLForSequenceClassification
    :members: call
--- a/docs/source/model_doc/deberta.mdx
+++ b/docs/source/model_doc/deberta.mdx
@@ -0,0 +1,117 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DeBERTa
 ## Overview
 The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
 BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
 It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
 RoBERTa.
 The abstract from the paper is the following:
 *Recent progress in pre-trained neural language models has significantly improved the performance of many natural
 language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
 disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
 disentangled attention mechanism, where each word is represented using two vectors that encode its content and
 position, respectively, and the attention weights among words are computed using disentangled matrices on their
 contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
 predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
 of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
 the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
 This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj) . The original code can be found [here](https://github.com/microsoft/DeBERTa).
 ## DebertaConfig
 [[autodoc]] DebertaConfig
 ## DebertaTokenizer
 [[autodoc]] DebertaTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## DebertaTokenizerFast
 [[autodoc]] DebertaTokenizerFast
    - build_inputs_with_special_tokens
    - create_token_type_ids_from_sequences
 ## DebertaModel
 [[autodoc]] DebertaModel
    - forward
 ## DebertaPreTrainedModel
 [[autodoc]] DebertaPreTrainedModel
 ## DebertaForMaskedLM
 [[autodoc]] DebertaForMaskedLM
    - forward
 ## DebertaForSequenceClassification
 [[autodoc]] DebertaForSequenceClassification
    - forward
 ## DebertaForTokenClassification
 [[autodoc]] DebertaForTokenClassification
    - forward
 ## DebertaForQuestionAnswering
 [[autodoc]] DebertaForQuestionAnswering
    - forward
 ## TFDebertaModel
 [[autodoc]] TFDebertaModel
    - call
 ## TFDebertaPreTrainedModel
 [[autodoc]] TFDebertaPreTrainedModel
    - call
 ## TFDebertaForMaskedLM
 [[autodoc]] TFDebertaForMaskedLM
    - call
 ## TFDebertaForSequenceClassification
 [[autodoc]] TFDebertaForSequenceClassification
    - call
 ## TFDebertaForTokenClassification
 [[autodoc]] TFDebertaForTokenClassification
    - call
 ## TFDebertaForQuestionAnswering
 [[autodoc]] TFDebertaForQuestionAnswering
    - call
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
@@ -1,148 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 DeBERTa
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
 <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
 BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
 It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
 RoBERTa.
 The abstract from the paper is the following:
 *Recent progress in pre-trained neural language models has significantly improved the performance of many natural
 language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
 disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
 disentangled attention mechanism, where each word is represented using two vectors that encode its content and
 position, respectively, and the attention weights among words are computed using disentangled matrices on their
 contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
 predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
 of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
 the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
 This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
 contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__ . The original code can be found `here
 <https://github.com/microsoft/DeBERTa>`__.
 DebertaConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaConfig
    :members:
 DebertaTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 DebertaTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaTokenizerFast
    :members: build_inputs_with_special_tokens, create_token_type_ids_from_sequences
 DebertaModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaModel
    :members: forward
 DebertaPreTrainedModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaPreTrainedModel
    :members:
 DebertaForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaForMaskedLM
    :members: forward
 DebertaForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaForSequenceClassification
    :members: forward
 DebertaForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaForTokenClassification
    :members: forward
 DebertaForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaForQuestionAnswering
    :members: forward
 TFDebertaModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaModel
    :members: call
 TFDebertaPreTrainedModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaPreTrainedModel
    :members: call
 TFDebertaForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaForMaskedLM
    :members: call
 TFDebertaForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaForSequenceClassification
    :members: call
 TFDebertaForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaForTokenClassification
    :members: call
 TFDebertaForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaForQuestionAnswering
    :members: call
--- a/docs/source/model_doc/deberta_v2.mdx
+++ b/docs/source/model_doc/deberta_v2.mdx
@@ -0,0 +1,132 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DeBERTa-v2
 ## Overview
 The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
 BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
 It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
 RoBERTa.
 The abstract from the paper is the following:
 *Recent progress in pre-trained neural language models has significantly improved the performance of many natural
 language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
 disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
 disentangled attention mechanism, where each word is represented using two vectors that encode its content and
 position, respectively, and the attention weights among words are computed using disentangled matrices on their
 contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
 predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
 of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
 the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
 The following information is visible directly on the [original implementation
 repository](https://github.com/microsoft/DeBERTa). DeBERTa v2 is the second version of the DeBERTa model. It includes
 the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can
 find more details about this submission in the authors'
 [blog](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/)
 New in v2:
 - **Vocabulary** In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data.
  Instead of a GPT2-based tokenizer, the tokenizer is now
  [sentencepiece-based](https://github.com/google/sentencepiece) tokenizer.
 - **nGiE(nGram Induced Input Encoding)** The DeBERTa-v2 model uses an additional convolution layer aside with the first
  transformer layer to better learn the local dependency of input tokens.
 - **Sharing position projection matrix with content projection matrix in attention layer** Based on previous
  experiments, this can save parameters without affecting the performance.
 - **Apply bucket to encode relative positions** The DeBERTa-v2 model uses log bucket to encode relative positions
  similar to T5.
 - **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
  performance of downstream tasks.
 This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa).
 ## DebertaV2Config
 [[autodoc]] DebertaV2Config
 ## DebertaV2Tokenizer
 [[autodoc]] DebertaV2Tokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## DebertaV2Model
 [[autodoc]] DebertaV2Model
    - forward
 ## DebertaV2PreTrainedModel
 [[autodoc]] DebertaV2PreTrainedModel
    - forward
 ## DebertaV2ForMaskedLM
 [[autodoc]] DebertaV2ForMaskedLM
    - forward
 ## DebertaV2ForSequenceClassification
 [[autodoc]] DebertaV2ForSequenceClassification
    - forward
 ## DebertaV2ForTokenClassification
 [[autodoc]] DebertaV2ForTokenClassification
    - forward
 ## DebertaV2ForQuestionAnswering
 [[autodoc]] DebertaV2ForQuestionAnswering
    - forward
 ## TFDebertaV2Model
 [[autodoc]] TFDebertaV2Model
    - call
 ## TFDebertaV2PreTrainedModel
 [[autodoc]] TFDebertaV2PreTrainedModel
    - call
 ## TFDebertaV2ForMaskedLM
 [[autodoc]] TFDebertaV2ForMaskedLM
    - call
 ## TFDebertaV2ForSequenceClassification
 [[autodoc]] TFDebertaV2ForSequenceClassification
    - call
 ## TFDebertaV2ForTokenClassification
 [[autodoc]] TFDebertaV2ForTokenClassification
    - call
 ## TFDebertaV2ForQuestionAnswering
 [[autodoc]] TFDebertaV2ForQuestionAnswering
    - call
--- a/docs/source/model_doc/deberta_v2.rst
+++ b/docs/source/model_doc/deberta_v2.rst
@@ -1,162 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 DeBERTa-v2
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
 <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
 BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
 It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
 RoBERTa.
 The abstract from the paper is the following:
 *Recent progress in pre-trained neural language models has significantly improved the performance of many natural
 language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
 disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
 disentangled attention mechanism, where each word is represented using two vectors that encode its content and
 position, respectively, and the attention weights among words are computed using disentangled matrices on their
 contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
 predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
 of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
 the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
 The following information is visible directly on the [original implementation
 repository](https://github.com/microsoft/DeBERTa). DeBERTa v2 is the second version of the DeBERTa model. It includes
 the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can
 find more details about this submission in the authors'
 [blog](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/)
 New in v2:
 - **Vocabulary** In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data.
  Instead of a GPT2-based tokenizer, the tokenizer is now
  [sentencepiece-based](https://github.com/google/sentencepiece) tokenizer.
 - **nGiE(nGram Induced Input Encoding)** The DeBERTa-v2 model uses an additional convolution layer aside with the first
  transformer layer to better learn the local dependency of input tokens.
 - **Sharing position projection matrix with content projection matrix in attention layer** Based on previous
  experiments, this can save parameters without affecting the performance.
 - **Apply bucket to encode relative positions** The DeBERTa-v2 model uses log bucket to encode relative positions
  similar to T5.
 - **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
  performance of downstream tasks.
 This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
 contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
 <https://github.com/microsoft/DeBERTa>`__.
 DebertaV2Config
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2Config
    :members:
 DebertaV2Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2Tokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 DebertaV2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2Model
    :members: forward
 DebertaV2PreTrainedModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2PreTrainedModel
    :members: forward
 DebertaV2ForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2ForMaskedLM
    :members: forward
 DebertaV2ForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2ForSequenceClassification
    :members: forward
 DebertaV2ForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2ForTokenClassification
    :members: forward
 DebertaV2ForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DebertaV2ForQuestionAnswering
    :members: forward
 TFDebertaV2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2Model
    :members: call
 TFDebertaV2PreTrainedModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2PreTrainedModel
    :members: call
 TFDebertaV2ForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2ForMaskedLM
    :members: call
 TFDebertaV2ForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2ForSequenceClassification
    :members: call
 TFDebertaV2ForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2ForTokenClassification
    :members: call
 TFDebertaV2ForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDebertaV2ForQuestionAnswering
    :members: call
--- a/docs/source/model_doc/deit.mdx
+++ b/docs/source/model_doc/deit.mdx
@@ -1,5 +1,4 @@
-.. 
+<!--Copyright 2021 The HuggingFace Team. All rights reserved.
    Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -9,24 +8,21 @@
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-DeiT
+# DeiT
 -----------------------------------------------------------------------------------------------------------------------
-.. note::
+<Tip>
 This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
-    breaking changes to fix it in the future. If you see something strange, file a `Github Issue
+breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
    <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
 </Tip>
-Overview
+## Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The DeiT model was proposed in `Training data-efficient image transformers & distillation through attention
+The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
-<https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
+Sablayrolles, Hervé Jégou. The [Vision Transformer (ViT)](vit) introduced in [Dosovitskiy et al., 2020](https://arxiv.org/abs/2010.11929) has shown that one can match or even outperform existing convolutional neural
 Sablayrolles, Hervé Jégou. The `Vision Transformer (ViT) <vit>`__ introduced in `Dosovitskiy et al., 2020
 <https://arxiv.org/abs/2010.11929>`__ has shown that one can match or even outperform existing convolutional neural
 networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
 expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
 efficiently trained transformers for image classification, requiring far less data and far less computing resources
@@ -58,54 +54,44 @@ Tips:
  distillation head and the label predicted by the teacher). At inference time, one takes the average prediction
  between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a
  teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to
-  :class:`~transformers.DeiTForImageClassification` and (2) corresponds to
+  [`DeiTForImageClassification`] and (2) corresponds to
-  :class:`~transformers.DeiTForImageClassificationWithTeacher`.
+  [`DeiTForImageClassificationWithTeacher`].
 - Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is
  trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results.
 - All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in
  contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
  pre-training.
 - The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into
-  :class:`~transformers.ViTModel` or :class:`~transformers.ViTForImageClassification`. Techniques like data
+  [`ViTModel`] or [`ViTForImageClassification`]. Techniques like data
  augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
  (while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
-  `facebook/deit-tiny-patch16-224`, `facebook/deit-small-patch16-224`, `facebook/deit-base-patch16-224` and
+  *facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
-  `facebook/deit-base-patch16-384`. Note that one should use :class:`~transformers.DeiTFeatureExtractor` in order to
+  *facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
  prepare images for the model.
-This model was contributed by `nielsr <https://huggingface.co/nielsr>`__.
+This model was contributed by [nielsr](https://huggingface.co/nielsr).
-DeiTConfig
+## DeiTConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DeiTConfig
+[[autodoc]] DeiTConfig
    :members:
 ## DeiTFeatureExtractor
-DeiTFeatureExtractor
+[[autodoc]] DeiTFeatureExtractor
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - __call__
-.. autoclass:: transformers.DeiTFeatureExtractor
+## DeiTModel
    :members: __call__
 [[autodoc]] DeiTModel
    - forward
-DeiTModel
+## DeiTForImageClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.DeiTModel
+[[autodoc]] DeiTForImageClassification
-    :members: forward
+    - forward
 ## DeiTForImageClassificationWithTeacher
-DeiTForImageClassification
+[[autodoc]] DeiTForImageClassificationWithTeacher
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    - forward
 .. autoclass:: transformers.DeiTForImageClassification
    :members: forward
 DeiTForImageClassificationWithTeacher
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DeiTForImageClassificationWithTeacher
    :members: forward
--- a/docs/source/model_doc/dialogpt.mdx
+++ b/docs/source/model_doc/dialogpt.mdx
@@ -1,5 +1,4 @@
-.. 
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
    Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
@@ -9,15 +8,13 @@
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
-DialoGPT
+# DialoGPT
 -----------------------------------------------------------------------------------------------------------------------
-Overview
+## Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
+DialoGPT was proposed in [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
 <https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
 Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
 Reddit.
@@ -37,8 +34,7 @@ Tips:
  than the left.
 - DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
  at response generation in open-domain dialogue systems.
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
+- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on [DialoGPT's model card](https://huggingface.co/microsoft/DialoGPT-medium).
  <https://huggingface.co/microsoft/DialoGPT-medium>`_.
 Training:
@@ -48,6 +44,6 @@ modeling. We first concatenate all dialog turns within a dialogue session into a
 sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
-DialoGPT's architecture is based on the GPT2 model, so one can refer to :doc:`GPT2's documentation page <gpt2>`.
+DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2).
-The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
+The original code can be found [here](https://github.com/microsoft/DialoGPT).
--- a/docs/source/model_doc/distilbert.mdx
+++ b/docs/source/model_doc/distilbert.mdx
@@ -0,0 +1,149 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DistilBERT
 ## Overview
 The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
 distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, a
 distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a
 small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
 *bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
 understanding benchmark.
 The abstract from the paper is the following:
 *As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
 operating these large models in on-the-edge and/or under constrained computational training or inference budgets
 remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
 model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
 counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
 knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
 biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
 distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
 demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
 study.*
 Tips:
 - DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
  separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
 - DistilBERT doesn't have options to select the input positions (`position_ids` input). This could be added if
  necessary though, just let us know if you need this option.
 This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
 contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation).
 ## DistilBertConfig
 [[autodoc]] DistilBertConfig
 ## DistilBertTokenizer
 [[autodoc]] DistilBertTokenizer
 ## DistilBertTokenizerFast
 [[autodoc]] DistilBertTokenizerFast
 ## DistilBertModel
 [[autodoc]] DistilBertModel
    - forward
 ## DistilBertForMaskedLM
 [[autodoc]] DistilBertForMaskedLM
    - forward
 ## DistilBertForSequenceClassification
 [[autodoc]] DistilBertForSequenceClassification
    - forward
 ## DistilBertForMultipleChoice
 [[autodoc]] DistilBertForMultipleChoice
    - forward
 ## DistilBertForTokenClassification
 [[autodoc]] DistilBertForTokenClassification
    - forward
 ## DistilBertForQuestionAnswering
 [[autodoc]] DistilBertForQuestionAnswering
    - forward
 ## TFDistilBertModel
 [[autodoc]] TFDistilBertModel
    - call
 ## TFDistilBertForMaskedLM
 [[autodoc]] TFDistilBertForMaskedLM
    - call
 ## TFDistilBertForSequenceClassification
 [[autodoc]] TFDistilBertForSequenceClassification
    - call
 ## TFDistilBertForMultipleChoice
 [[autodoc]] TFDistilBertForMultipleChoice
    - call
 ## TFDistilBertForTokenClassification
 [[autodoc]] TFDistilBertForTokenClassification
    - call
 ## TFDistilBertForQuestionAnswering
 [[autodoc]] TFDistilBertForQuestionAnswering
    - call
 ## FlaxDistilBertModel
 [[autodoc]] FlaxDistilBertModel
    - __call__
 ## FlaxDistilBertForMaskedLM
 [[autodoc]] FlaxDistilBertForMaskedLM
    - __call__
 ## FlaxDistilBertForSequenceClassification
 [[autodoc]] FlaxDistilBertForSequenceClassification
    - __call__
 ## FlaxDistilBertForMultipleChoice
 [[autodoc]] FlaxDistilBertForMultipleChoice
    - __call__
 ## FlaxDistilBertForTokenClassification
 [[autodoc]] FlaxDistilBertForTokenClassification
    - __call__
 ## FlaxDistilBertForQuestionAnswering
 [[autodoc]] FlaxDistilBertForQuestionAnswering
    - __call__
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@@ -1,197 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 DistilBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
 distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
 distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
 small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
 `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
 understanding benchmark.
 The abstract from the paper is the following:
 *As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
 operating these large models in on-the-edge and/or under constrained computational training or inference budgets
 remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
 model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
 counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
 knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
 biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
 distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
 demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
 study.*
 Tips:
 - DistilBERT doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
  separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`[SEP]`).
 - DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
  necessary though, just let us know if you need this option.
 This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. This model jax version was
 contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found :prefix_link:`here
 <examples/research_projects/distillation>`.
 DistilBertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertConfig
    :members:
 DistilBertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertTokenizer
    :members:
 DistilBertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertTokenizerFast
    :members:
 DistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertModel
    :members: forward
 DistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForMaskedLM
    :members: forward
 DistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForSequenceClassification
    :members: forward
 DistilBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForMultipleChoice
    :members: forward
 DistilBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForTokenClassification
    :members: forward
 DistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DistilBertForQuestionAnswering
    :members: forward
 TFDistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertModel
    :members: call
 TFDistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForMaskedLM
    :members: call
 TFDistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForSequenceClassification
    :members: call
 TFDistilBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForMultipleChoice
    :members: call
 TFDistilBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForTokenClassification
    :members: call
 TFDistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDistilBertForQuestionAnswering
    :members: call
 FlaxDistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertModel
    :members: __call__
 FlaxDistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertForMaskedLM
    :members: __call__
 FlaxDistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertForSequenceClassification
    :members: __call__
 FlaxDistilBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertForMultipleChoice
    :members: __call__
 FlaxDistilBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertForTokenClassification
    :members: __call__
 FlaxDistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxDistilBertForQuestionAnswering
    :members: __call__
--- a/docs/source/model_doc/dpr.mdx
+++ b/docs/source/model_doc/dpr.mdx
@@ -0,0 +1,98 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # DPR
 ## Overview
 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
 introduced in [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by
 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 The abstract from the paper is the following:
 *Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
 sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
 be practically implemented using dense representations alone, where embeddings are learned from a small number of
 questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
 our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
 retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
 benchmarks.*
 This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).
 ## DPRConfig
 [[autodoc]] DPRConfig
 ## DPRContextEncoderTokenizer
 [[autodoc]] DPRContextEncoderTokenizer
 ## DPRContextEncoderTokenizerFast
 [[autodoc]] DPRContextEncoderTokenizerFast
 ## DPRQuestionEncoderTokenizer
 [[autodoc]] DPRQuestionEncoderTokenizer
 ## DPRQuestionEncoderTokenizerFast
 [[autodoc]] DPRQuestionEncoderTokenizerFast
 ## DPRReaderTokenizer
 [[autodoc]] DPRReaderTokenizer
 ## DPRReaderTokenizerFast
 [[autodoc]] DPRReaderTokenizerFast
 ## DPR specific outputs
 [[autodoc]] models.dpr.modeling_dpr.DPRContextEncoderOutput
 [[autodoc]] models.dpr.modeling_dpr.DPRQuestionEncoderOutput
 [[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput
 ## DPRContextEncoder
 [[autodoc]] DPRContextEncoder
    - forward
 ## DPRQuestionEncoder
 [[autodoc]] DPRQuestionEncoder
    - forward
 ## DPRReader
 [[autodoc]] DPRReader
    - forward
 ## TFDPRContextEncoder
 [[autodoc]] TFDPRContextEncoder
    - call
 ## TFDPRQuestionEncoder
 [[autodoc]] TFDPRQuestionEncoder
    - call
 ## TFDPRReader
 [[autodoc]] TFDPRReader
    - call
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@@ -1,133 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 DPR
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
 introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 The abstract from the paper is the following:
 *Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
 sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
 be practically implemented using dense representations alone, where embeddings are learned from a small number of
 questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
 our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
 retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
 benchmarks.*
 This model was contributed by `lhoestq <https://huggingface.co/lhoestq>`__. The original code can be found `here
 <https://github.com/facebookresearch/DPR>`__.
 DPRConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRConfig
    :members:
 DPRContextEncoderTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRContextEncoderTokenizer
    :members:
 DPRContextEncoderTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRContextEncoderTokenizerFast
    :members:
 DPRQuestionEncoderTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRQuestionEncoderTokenizer
    :members:
 DPRQuestionEncoderTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRQuestionEncoderTokenizerFast
    :members:
 DPRReaderTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRReaderTokenizer
    :members:
 DPRReaderTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRReaderTokenizerFast
    :members:
 DPR specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.dpr.modeling_dpr.DPRContextEncoderOutput
    :members:
 .. autoclass:: transformers.models.dpr.modeling_dpr.DPRQuestionEncoderOutput
    :members:
 .. autoclass:: transformers.models.dpr.modeling_dpr.DPRReaderOutput
    :members:
 DPRContextEncoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRContextEncoder
    :members: forward
 DPRQuestionEncoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRQuestionEncoder
    :members: forward
 DPRReader
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.DPRReader
    :members: forward
 TFDPRContextEncoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDPRContextEncoder
    :members: call
 TFDPRQuestionEncoder
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDPRQuestionEncoder
    :members: call
 TFDPRReader
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFDPRReader
    :members: call
--- a/docs/source/model_doc/electra.mdx
+++ b/docs/source/model_doc/electra.mdx
@@ -0,0 +1,179 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # ELECTRA
 ## Overview
 The ELECTRA model was proposed in the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
 Generators](https://openreview.net/pdf?id=r1xMH1BtvB). ELECTRA is a new pretraining approach which trains two
 transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and
 is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to
 identify which tokens were replaced by the generator in the sequence.
 The abstract from the paper is the following:
 *Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
 and then train a model to reconstruct the original tokens. While they produce good results when transferred to
 downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
 more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
 corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
 of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
 predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
 demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
 rather than just the small subset that was masked out. As a result, the contextual representations learned by our
 approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
 particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
 using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
 where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
 using the same amount of compute.*
 Tips:
 - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
  only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
  while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
  embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
  layer is used.
 - The ELECTRA checkpoints saved using [Google Research's implementation](https://github.com/google-research/electra)
  contain both the generator and discriminator. The conversion script requires the user to name which model to export
  into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
  available ELECTRA models, however. This means that the discriminator may be loaded in the
  [`ElectraForMaskedLM`] model, and the generator may be loaded in the
  [`ElectraForPreTraining`] model (the classification head will be randomly initialized as it
  doesn't exist in the generator).
 This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra).
 ## ElectraConfig
 [[autodoc]] ElectraConfig
 ## ElectraTokenizer
 [[autodoc]] ElectraTokenizer
 ## ElectraTokenizerFast
 [[autodoc]] ElectraTokenizerFast
 ## Electra specific outputs
 [[autodoc]] models.electra.modeling_electra.ElectraForPreTrainingOutput
 [[autodoc]] models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
 ## ElectraModel
 [[autodoc]] ElectraModel
    - forward
 ## ElectraForPreTraining
 [[autodoc]] ElectraForPreTraining
    - forward
 ## ElectraForMaskedLM
 [[autodoc]] ElectraForMaskedLM
    - forward
 ## ElectraForSequenceClassification
 [[autodoc]] ElectraForSequenceClassification
    - forward
 ## ElectraForMultipleChoice
 [[autodoc]] ElectraForMultipleChoice
    - forward
 ## ElectraForTokenClassification
 [[autodoc]] ElectraForTokenClassification
    - forward
 ## ElectraForQuestionAnswering
 [[autodoc]] ElectraForQuestionAnswering
    - forward
 ## TFElectraModel
 [[autodoc]] TFElectraModel
    - call
 ## TFElectraForPreTraining
 [[autodoc]] TFElectraForPreTraining
    - call
 ## TFElectraForMaskedLM
 [[autodoc]] TFElectraForMaskedLM
    - call
 ## TFElectraForSequenceClassification
 [[autodoc]] TFElectraForSequenceClassification
    - call
 ## TFElectraForMultipleChoice
 [[autodoc]] TFElectraForMultipleChoice
    - call
 ## TFElectraForTokenClassification
 [[autodoc]] TFElectraForTokenClassification
    - call
 ## TFElectraForQuestionAnswering
 [[autodoc]] TFElectraForQuestionAnswering
    - call
 ## FlaxElectraModel
 [[autodoc]] FlaxElectraModel
    - __call__
 ## FlaxElectraForPreTraining
 [[autodoc]] FlaxElectraForPreTraining
    - __call__
 ## FlaxElectraForMaskedLM
 [[autodoc]] FlaxElectraForMaskedLM
    - __call__
 ## FlaxElectraForSequenceClassification
 [[autodoc]] FlaxElectraForSequenceClassification
    - __call__
 ## FlaxElectraForMultipleChoice
 [[autodoc]] FlaxElectraForMultipleChoice
    - __call__
 ## FlaxElectraForTokenClassification
 [[autodoc]] FlaxElectraForTokenClassification
    - __call__
 ## FlaxElectraForQuestionAnswering
 [[autodoc]] FlaxElectraForQuestionAnswering
    - __call__
--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst
@@ -1,236 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 ELECTRA
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The ELECTRA model was proposed in the paper `ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
 Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__. ELECTRA is a new pretraining approach which trains two
 transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and
 is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to
 identify which tokens were replaced by the generator in the sequence.
 The abstract from the paper is the following:
 *Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
 and then train a model to reconstruct the original tokens. While they produce good results when transferred to
 downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
 more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
 corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
 of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
 predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
 demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
 rather than just the small subset that was masked out. As a result, the contextual representations learned by our
 approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
 particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
 using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
 where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
 using the same amount of compute.*
 Tips:
 - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
  only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
  while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
  embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
  layer is used.
 - The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
  contain both the generator and discriminator. The conversion script requires the user to name which model to export
  into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
  available ELECTRA models, however. This means that the discriminator may be loaded in the
  :class:`~transformers.ElectraForMaskedLM` model, and the generator may be loaded in the
  :class:`~transformers.ElectraForPreTraining` model (the classification head will be randomly initialized as it
  doesn't exist in the generator).
 This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. The original code can be found `here
 <https://github.com/google-research/electra>`__.
 ElectraConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraConfig
    :members:
 ElectraTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraTokenizer
    :members:
 ElectraTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraTokenizerFast
    :members:
 Electra specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.electra.modeling_electra.ElectraForPreTrainingOutput
    :members:
 .. autoclass:: transformers.models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
    :members:
 ElectraModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraModel
    :members: forward
 ElectraForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForPreTraining
    :members: forward
 ElectraForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForMaskedLM
    :members: forward
 ElectraForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForSequenceClassification
    :members: forward
 ElectraForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForMultipleChoice
    :members: forward
 ElectraForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForTokenClassification
    :members: forward
 ElectraForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.ElectraForQuestionAnswering
    :members: forward
 TFElectraModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraModel
    :members: call
 TFElectraForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForPreTraining
    :members: call
 TFElectraForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForMaskedLM
    :members: call
 TFElectraForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForSequenceClassification
    :members: call
 TFElectraForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForMultipleChoice
    :members: call
 TFElectraForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForTokenClassification
    :members: call
 TFElectraForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFElectraForQuestionAnswering
    :members: call
 FlaxElectraModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraModel
    :members: __call__
 FlaxElectraForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForPreTraining
    :members: __call__
 FlaxElectraForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForMaskedLM
    :members: __call__
 FlaxElectraForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForSequenceClassification
    :members: __call__
 FlaxElectraForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForMultipleChoice
    :members: __call__
 FlaxElectraForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForTokenClassification
    :members: __call__
 FlaxElectraForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxElectraForQuestionAnswering
    :members: __call__
--- a/docs/source/model_doc/encoderdecoder.mdx
+++ b/docs/source/model_doc/encoderdecoder.mdx
@@ -0,0 +1,68 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Encoder Decoder Models
 The [`EncoderDecoderModel`] can be used to initialize a sequence-to-sequence model with any
 pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.
 The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation tasks
 was shown in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by
 Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 After such an [`EncoderDecoderModel`] has been trained/fine-tuned, it can be saved/loaded just like
 any other models (see the examples for more information).
 An application of this architecture could be to leverage two pretrained [`BertModel`] as the encoder
 and decoder for a summarization model as was shown in: [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) by Yang Liu and Mirella Lapata.
 The [`~TFEncoderDecoderModel.from_pretrained`] currently doesn't support initializing the model from a
 pytorch checkpoint. Passing `from_pt=True` to this method will throw an exception. If there are only pytorch
 checkpoints for a particular encoder-decoder model, a workaround is:
 ```python
 >>> # a workaround to load from pytorch checkpoint
 >>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
 >>> _model.encoder.save_pretrained("./encoder")
 >>> _model.decoder.save_pretrained("./decoder")
 >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
 ...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
 ... )
 >>> # This is only for copying some specific attributes of this particular model.
 >>> model.config = _model.config
 ```
 This model was contributed by [thomwolf](https://github.com/thomwolf). This model's TensorFlow and Flax versions
 were contributed by [ydshieh](https://github.com/ydshieh).
 ## EncoderDecoderConfig
 [[autodoc]] EncoderDecoderConfig
 ## EncoderDecoderModel
 [[autodoc]] EncoderDecoderModel
    - forward
    - from_encoder_decoder_pretrained
 ## TFEncoderDecoderModel
 [[autodoc]] TFEncoderDecoderModel
    - call
    - from_encoder_decoder_pretrained
 ## FlaxEncoderDecoderModel
 [[autodoc]] FlaxEncoderDecoderModel
    - __call__
    - from_encoder_decoder_pretrained
--- a/docs/source/model_doc/encoderdecoder.rst
+++ b/docs/source/model_doc/encoderdecoder.rst
@@ -1,75 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Encoder Decoder Models
 -----------------------------------------------------------------------------------------------------------------------
 The :class:`~transformers.EncoderDecoderModel` can be used to initialize a sequence-to-sequence model with any
 pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.
 The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation tasks
 was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by
 Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 After such an :class:`~transformers.EncoderDecoderModel` has been trained/fine-tuned, it can be saved/loaded just like
 any other models (see the examples for more information).
 An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
 and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
 <https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
 The :meth:`~transformers.TFEncoderDecoderModel.from_pretrained` currently doesn't support initializing the model from a
 pytorch checkpoint. Passing ``from_pt=True`` to this method will throw an exception. If there are only pytorch
 checkpoints for a particular encoder-decoder model, a workaround is:
 .. code-block::
    >>> # a workaround to load from pytorch checkpoint
    >>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
    >>> _model.encoder.save_pretrained("./encoder")
    >>> _model.decoder.save_pretrained("./decoder")
    >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
    ...     "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
    ... )
    >>> # This is only for copying some specific attributes of this particular model.
    >>> model.config = _model.config
 This model was contributed by `thomwolf <https://github.com/thomwolf>`__. This model's TensorFlow and Flax versions
 were contributed by `ydshieh <https://github.com/ydshieh>`__.
 EncoderDecoderConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.EncoderDecoderConfig
    :members:
 EncoderDecoderModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.EncoderDecoderModel
    :members: forward, from_encoder_decoder_pretrained
 TFEncoderDecoderModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFEncoderDecoderModel
    :members: call, from_encoder_decoder_pretrained
 FlaxEncoderDecoderModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxEncoderDecoderModel
    :members: __call__, from_encoder_decoder_pretrained
--- a/docs/source/model_doc/flaubert.mdx
+++ b/docs/source/model_doc/flaubert.mdx
@@ -0,0 +1,109 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # FlauBERT
 ## Overview
 The FlauBERT model was proposed in the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le et al. It's a transformer model pretrained using a masked language
 modeling (MLM) objective (like BERT).
 The abstract from the paper is the following:
 *Language models have become a key step to achieve state-of-the art results in many different Natural Language
 Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
 to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
 contextualization at the sentence level. This has been widely demonstrated for English using contextualized
 representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
 heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
 Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
 classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
 time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
 protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
 community for further reproducible experiments in French NLP.*
 This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).
 ## FlaubertConfig
 [[autodoc]] FlaubertConfig
 ## FlaubertTokenizer
 [[autodoc]] FlaubertTokenizer
 ## FlaubertModel
 [[autodoc]] FlaubertModel
    - forward
 ## FlaubertWithLMHeadModel
 [[autodoc]] FlaubertWithLMHeadModel
    - forward
 ## FlaubertForSequenceClassification
 [[autodoc]] FlaubertForSequenceClassification
    - forward
 ## FlaubertForMultipleChoice
 [[autodoc]] FlaubertForMultipleChoice
    - forward
 ## FlaubertForTokenClassification
 [[autodoc]] FlaubertForTokenClassification
    - forward
 ## FlaubertForQuestionAnsweringSimple
 [[autodoc]] FlaubertForQuestionAnsweringSimple
    - forward
 ## FlaubertForQuestionAnswering
 [[autodoc]] FlaubertForQuestionAnswering
    - forward
 ## TFFlaubertModel
 [[autodoc]] TFFlaubertModel
    - call
 ## TFFlaubertWithLMHeadModel
 [[autodoc]] TFFlaubertWithLMHeadModel
    - call
 ## TFFlaubertForSequenceClassification
 [[autodoc]] TFFlaubertForSequenceClassification
    - call
 ## TFFlaubertForMultipleChoice
 [[autodoc]] TFFlaubertForMultipleChoice
    - call
 ## TFFlaubertForTokenClassification
 [[autodoc]] TFFlaubertForTokenClassification
    - call
 ## TFFlaubertForQuestionAnsweringSimple
 [[autodoc]] TFFlaubertForQuestionAnsweringSimple
    - call
--- a/docs/source/model_doc/flaubert.rst
+++ b/docs/source/model_doc/flaubert.rst
@@ -1,144 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 FlauBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The FlauBERT model was proposed in the paper `FlauBERT: Unsupervised Language Model Pre-training for French
 <https://arxiv.org/abs/1912.05372>`__ by Hang Le et al. It's a transformer model pretrained using a masked language
 modeling (MLM) objective (like BERT).
 The abstract from the paper is the following:
 *Language models have become a key step to achieve state-of-the art results in many different Natural Language
 Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
 to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
 contextualization at the sentence level. This has been widely demonstrated for English using contextualized
 representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
 heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
 Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
 classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
 time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
 protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
 community for further reproducible experiments in French NLP.*
 This model was contributed by `formiel <https://huggingface.co/formiel>`__. The original code can be found `here
 <https://github.com/getalp/Flaubert>`__.
 FlaubertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertConfig
    :members:
 FlaubertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertTokenizer
    :members:
 FlaubertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertModel
    :members: forward
 FlaubertWithLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertWithLMHeadModel
    :members: forward
 FlaubertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertForSequenceClassification
    :members: forward
 FlaubertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertForMultipleChoice
    :members: forward
 FlaubertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertForTokenClassification
    :members: forward
 FlaubertForQuestionAnsweringSimple
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertForQuestionAnsweringSimple
    :members: forward
 FlaubertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaubertForQuestionAnswering
    :members: forward
 TFFlaubertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertModel
    :members: call
 TFFlaubertWithLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertWithLMHeadModel
    :members: call
 TFFlaubertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertForSequenceClassification
    :members: call
 TFFlaubertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertForMultipleChoice
    :members: call
 TFFlaubertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertForTokenClassification
    :members: call
 TFFlaubertForQuestionAnsweringSimple
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFlaubertForQuestionAnsweringSimple
    :members: call
--- a/docs/source/model_doc/fnet.mdx
+++ b/docs/source/model_doc/fnet.mdx
@@ -0,0 +1,98 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # FNet
 ## Overview
 The FNet model was proposed in [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by
 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT
 model with a fourier transform which returns only the real parts of the transform. The model is significantly faster
 than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97%
 accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model. The abstract from the
 paper is the following:
 *We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the
 self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with
 standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text
 classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder
 with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE
 benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths,
 our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena
 benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all
 sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint
 and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models
 outperform Transformer counterparts.*
 Tips on usage:
 - The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with
  maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum
  sequence length for fine-tuning and inference.
 This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/google-research/google-research/tree/master/f_net).
 ## FNetConfig
 [[autodoc]] FNetConfig
 ## FNetTokenizer
 [[autodoc]] FNetTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## FNetTokenizerFast
 [[autodoc]] FNetTokenizerFast
 ## FNetModel
 [[autodoc]] FNetModel
    - forward
 ## FNetForPreTraining
 [[autodoc]] FNetForPreTraining
    - forward
 ## FNetForMaskedLM
 [[autodoc]] FNetForMaskedLM
    - forward
 ## FNetForNextSentencePrediction
 [[autodoc]] FNetForNextSentencePrediction
    - forward
 ## FNetForSequenceClassification
 [[autodoc]] FNetForSequenceClassification
    - forward
 ## FNetForMultipleChoice
 [[autodoc]] FNetForMultipleChoice
    - forward
 ## FNetForTokenClassification
 [[autodoc]] FNetForTokenClassification
    - forward
 ## FNetForQuestionAnswering
 [[autodoc]] FNetForQuestionAnswering
    - forward
--- a/docs/source/model_doc/fnet.rst
+++ b/docs/source/model_doc/fnet.rst
@@ -1,121 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 FNet
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The FNet model was proposed in `FNet: Mixing Tokens with Fourier Transforms <https://arxiv.org/abs/2105.03824>`__ by
 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT
 model with a fourier transform which returns only the real parts of the transform. The model is significantly faster
 than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97%
 accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model. The abstract from the
 paper is the following:
 *We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the
 self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with
 standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text
 classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder
 with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE
 benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths,
 our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena
 benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all
 sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint
 and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models
 outperform Transformer counterparts.*
 Tips on usage:
 - The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with
  maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum
  sequence length for fine-tuning and inference.
 This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here
 <https://github.com/google-research/google-research/tree/master/f_net>`__.
 FNetConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetConfig
    :members:
 FNetTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 FNetTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetTokenizerFast
    :members:
 FNetModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetModel
    :members: forward
 FNetForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForPreTraining
    :members: forward
 FNetForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForMaskedLM
    :members: forward
 FNetForNextSentencePrediction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForNextSentencePrediction
    :members: forward
 FNetForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForSequenceClassification
    :members: forward
 FNetForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForMultipleChoice
    :members: forward
 FNetForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForTokenClassification
    :members: forward
 FNetForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FNetForQuestionAnswering
    :members: forward
--- a/docs/source/model_doc/fsmt.mdx
+++ b/docs/source/model_doc/fsmt.mdx
@@ -0,0 +1,63 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # FSMT
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@stas00.
 ## Overview
 FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616) by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
 The abstract of the paper is the following:
 *This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two
 language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from
 last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling
 toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes,
 as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific
 data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the
 human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
 This system improves upon our WMT'18 submission by 4.5 BLEU points.*
 This model was contributed by [stas](https://huggingface.co/stas). The original code can be found
 [here](https://github.com/pytorch/fairseq/tree/master/examples/wmt19).
 ## Implementation Notes
 - FSMT uses source and target vocabulary pairs that aren't combined into one. It doesn't share embeddings tokens
  either. Its tokenizer is very similar to [`XLMTokenizer`] and the main model is derived from
  [`BartModel`].
 ## FSMTConfig
 [[autodoc]] FSMTConfig
 ## FSMTTokenizer
 [[autodoc]] FSMTTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## FSMTModel
 [[autodoc]] FSMTModel
    - forward
 ## FSMTForConditionalGeneration
 [[autodoc]] FSMTForConditionalGeneration
    - forward
--- a/docs/source/model_doc/fsmt.rst
+++ b/docs/source/model_doc/fsmt.rst
@@ -1,74 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 FSMT
 -----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@stas00.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 FSMT (FairSeq MachineTranslation) models were introduced in `Facebook FAIR's WMT19 News Translation Task Submission
 <https://arxiv.org/abs/1907.06616>`__ by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
 The abstract of the paper is the following:
 *This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two
 language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from
 last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling
 toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes,
 as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific
 data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the
 human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
 This system improves upon our WMT'18 submission by 4.5 BLEU points.*
 This model was contributed by `stas <https://huggingface.co/stas>`__. The original code can be found here
 <https://github.com/pytorch/fairseq/tree/master/examples/wmt19>__.
 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - FSMT uses source and target vocabulary pairs that aren't combined into one. It doesn't share embeddings tokens
  either. Its tokenizer is very similar to :class:`~transformers.XLMTokenizer` and the main model is derived from
  :class:`~transformers.BartModel`.
 FSMTConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FSMTConfig
    :members:
 FSMTTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FSMTTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 FSMTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FSMTModel
    :members: forward
 FSMTForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FSMTForConditionalGeneration
    :members: forward
--- a/docs/source/model_doc/funnel.mdx
+++ b/docs/source/model_doc/funnel.mdx
@@ -0,0 +1,153 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Funnel Transformer
 ## Overview
 The Funnel Transformer model was proposed in the paper [Funnel-Transformer: Filtering out Sequential Redundancy for
 Efficient Language Processing](https://arxiv.org/abs/2006.03236). It is a bidirectional transformer model, like
 BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks
 (CNN) in computer vision.
 The abstract from the paper is the following:
 *With the success of language pretraining, it is highly desirable to develop more efficient architectures of good
 scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the
 much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only
 require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which
 gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More
 importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further
 improve the model capacity. In addition, to perform token-level predictions as required by common pretraining
 objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence
 via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on
 a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading
 comprehension.*
 Tips:
 - Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers.
  The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
  directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
  tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
  sequence length as the input.
 - The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
  used for [`FunnelModel`], [`FunnelForPreTraining`],
  [`FunnelForMaskedLM`], [`FunnelForTokenClassification`] and
  class:*~transformers.FunnelForQuestionAnswering*. The second ones should be used for
  [`FunnelBaseModel`], [`FunnelForSequenceClassification`] and
  [`FunnelForMultipleChoice`].
 This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer).
 ## FunnelConfig
 [[autodoc]] FunnelConfig
 ## FunnelTokenizer
 [[autodoc]] FunnelTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## FunnelTokenizerFast
 [[autodoc]] FunnelTokenizerFast
 ## Funnel specific outputs
 [[autodoc]] models.funnel.modeling_funnel.FunnelForPreTrainingOutput
 [[autodoc]] models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput
 ## FunnelBaseModel
 [[autodoc]] FunnelBaseModel
    - forward
 ## FunnelModel
 [[autodoc]] FunnelModel
    - forward
 ## FunnelModelForPreTraining
 [[autodoc]] FunnelForPreTraining
    - forward
 ## FunnelForMaskedLM
 [[autodoc]] FunnelForMaskedLM
    - forward
 ## FunnelForSequenceClassification
 [[autodoc]] FunnelForSequenceClassification
    - forward
 ## FunnelForMultipleChoice
 [[autodoc]] FunnelForMultipleChoice
    - forward
 ## FunnelForTokenClassification
 [[autodoc]] FunnelForTokenClassification
    - forward
 ## FunnelForQuestionAnswering
 [[autodoc]] FunnelForQuestionAnswering
    - forward
 ## TFFunnelBaseModel
 [[autodoc]] TFFunnelBaseModel
    - call
 ## TFFunnelModel
 [[autodoc]] TFFunnelModel
    - call
 ## TFFunnelModelForPreTraining
 [[autodoc]] TFFunnelForPreTraining
    - call
 ## TFFunnelForMaskedLM
 [[autodoc]] TFFunnelForMaskedLM
    - call
 ## TFFunnelForSequenceClassification
 [[autodoc]] TFFunnelForSequenceClassification
    - call
 ## TFFunnelForMultipleChoice
 [[autodoc]] TFFunnelForMultipleChoice
    - call
 ## TFFunnelForTokenClassification
 [[autodoc]] TFFunnelForTokenClassification
    - call
 ## TFFunnelForQuestionAnswering
 [[autodoc]] TFFunnelForQuestionAnswering
    - call
--- a/docs/source/model_doc/funnel.rst
+++ b/docs/source/model_doc/funnel.rst
@@ -1,197 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Funnel Transformer
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The Funnel Transformer model was proposed in the paper `Funnel-Transformer: Filtering out Sequential Redundancy for
 Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__. It is a bidirectional transformer model, like
 BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks
 (CNN) in computer vision.
 The abstract from the paper is the following:
 *With the success of language pretraining, it is highly desirable to develop more efficient architectures of good
 scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the
 much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only
 require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which
 gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More
 importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further
 improve the model capacity. In addition, to perform token-level predictions as required by common pretraining
 objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence
 via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on
 a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading
 comprehension.*
 Tips:
 - Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers.
  The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
  directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
  tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
  sequence length as the input.
 - The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
  used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
  :class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and
  class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for
  :class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
  :class:`~transformers.FunnelForMultipleChoice`.
 This model was contributed by `sgugger <https://huggingface.co/sgugger>`__. The original code can be found `here
 <https://github.com/laiguokun/Funnel-Transformer>`__.
 FunnelConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelConfig
    :members:
 FunnelTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 FunnelTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelTokenizerFast
    :members:
 Funnel specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.funnel.modeling_funnel.FunnelForPreTrainingOutput
    :members:
 .. autoclass:: transformers.models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput
    :members:
 FunnelBaseModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelBaseModel
    :members: forward
 FunnelModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelModel
    :members: forward
 FunnelModelForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForPreTraining
    :members: forward
 FunnelForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForMaskedLM
    :members: forward
 FunnelForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForSequenceClassification
    :members: forward
 FunnelForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForMultipleChoice
    :members: forward
 FunnelForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForTokenClassification
    :members: forward
 FunnelForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FunnelForQuestionAnswering
    :members: forward
 TFFunnelBaseModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelBaseModel
    :members: call
 TFFunnelModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelModel
    :members: call
 TFFunnelModelForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForPreTraining
    :members: call
 TFFunnelForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForMaskedLM
    :members: call
 TFFunnelForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForSequenceClassification
    :members: call
 TFFunnelForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForMultipleChoice
    :members: call
 TFFunnelForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForTokenClassification
    :members: call
 TFFunnelForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFFunnelForQuestionAnswering
    :members: call
--- a/docs/source/model_doc/gpt.mdx
+++ b/docs/source/model_doc/gpt.mdx
@@ -0,0 +1,117 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # OpenAI GPT
 ## Overview
 OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
 pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
 The abstract from the paper is the following:
 *Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
 semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
 labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
 perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
 language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
 contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
 effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
 approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
 discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
 the state of the art in 9 out of the 12 tasks studied.*
 Tips:
 - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
  observed in the *run_generation.py* example script.
 [Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
 showcasing the generative capabilities of several models. GPT is one of them.
 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
 Note:
 If you want to reproduce the original tokenization process of the *OpenAI GPT* paper, you will need to install `ftfy`
 and `SpaCy`:
 ```bash
 pip install spacy ftfy==4.4.3
 python -m spacy download en
 ```
 If you don't install `ftfy` and `SpaCy`, the [`OpenAIGPTTokenizer`] will default to tokenize
 using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
 ## OpenAIGPTConfig
 [[autodoc]] OpenAIGPTConfig
 ## OpenAIGPTTokenizer
 [[autodoc]] OpenAIGPTTokenizer
    - save_vocabulary
 ## OpenAIGPTTokenizerFast
 [[autodoc]] OpenAIGPTTokenizerFast
 ## OpenAI specific outputs
 [[autodoc]] models.openai.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
 [[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
 ## OpenAIGPTModel
 [[autodoc]] OpenAIGPTModel
    - forward
 ## OpenAIGPTLMHeadModel
 [[autodoc]] OpenAIGPTLMHeadModel
    - forward
 ## OpenAIGPTDoubleHeadsModel
 [[autodoc]] OpenAIGPTDoubleHeadsModel
    - forward
 ## OpenAIGPTForSequenceClassification
 [[autodoc]] OpenAIGPTForSequenceClassification
    - forward
 ## TFOpenAIGPTModel
 [[autodoc]] TFOpenAIGPTModel
    - call
 ## TFOpenAIGPTLMHeadModel
 [[autodoc]] TFOpenAIGPTLMHeadModel
    - call
 ## TFOpenAIGPTDoubleHeadsModel
 [[autodoc]] TFOpenAIGPTDoubleHeadsModel
    - call
 ## TFOpenAIGPTForSequenceClassification
 [[autodoc]] TFOpenAIGPTForSequenceClassification
    - call
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -1,147 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 OpenAI GPT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
 <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
 pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
 The abstract from the paper is the following:
 *Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
 semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
 labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
 perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
 language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
 contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
 effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
 approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
 discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
 the state of the art in 9 out of the 12 tasks studied.*
 Tips:
 - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
  observed in the `run_generation.py` example script.
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
 showcasing the generative capabilities of several models. GPT is one of them.
 This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
 <https://github.com/openai/finetune-transformer-lm>`__.
 Note:
 If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install ``ftfy``
 and ``SpaCy``:
 .. code-block:: bash
    pip install spacy ftfy==4.4.3
    python -m spacy download en
 If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
 using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
 OpenAIGPTConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTConfig
    :members:
 OpenAIGPTTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTTokenizer
    :members: save_vocabulary
 OpenAIGPTTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTTokenizerFast
    :members:
 OpenAI specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.openai.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
    :members:
 .. autoclass:: transformers.models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
    :members:
 OpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTModel
    :members: forward
 OpenAIGPTLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTLMHeadModel
    :members: forward
 OpenAIGPTDoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTDoubleHeadsModel
    :members: forward
 OpenAIGPTForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.OpenAIGPTForSequenceClassification
    :members: forward
 TFOpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTModel
    :members: call
 TFOpenAIGPTLMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTLMHeadModel
    :members: call
 TFOpenAIGPTDoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel
    :members: call
 TFOpenAIGPTForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFOpenAIGPTForSequenceClassification
    :members: call
--- a/docs/source/model_doc/gpt2.mdx
+++ b/docs/source/model_doc/gpt2.mdx
@@ -0,0 +1,131 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # OpenAI GPT2
 ## Overview
 OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec
 Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
 transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
 The abstract from the paper is the following:
 *GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
 web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
 text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
 across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
 10X the amount of data.*
 Tips:
 - GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
  observed in the *run_generation.py* example script.
 - The model can take the *past_key_values* (for PyTorch) or *past* (for TF) as input, which is the previously computed
  key/value attention pairs. Using this (*past_key_values* or *past*) value prevents the model from re-computing
  pre-computed values in the context of text generation. For PyTorch, see *past_key_values* argument of the
  [`GPT2Model.forward`] method, or for TF the *past* argument of the
  [`TFGPT2Model.call`] method for more information on its usage.
 - Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability
  improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only).
 [Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
 different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*.
 This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/).
 ## GPT2Config
 [[autodoc]] GPT2Config
 ## GPT2Tokenizer
 [[autodoc]] GPT2Tokenizer
    - save_vocabulary
 ## GPT2TokenizerFast
 [[autodoc]] GPT2TokenizerFast
 ## GPT2 specific outputs
 [[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
 [[autodoc]] models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
 ## GPT2Model
 [[autodoc]] GPT2Model
    - forward
    - parallelize
    - deparallelize
 ## GPT2LMHeadModel
 [[autodoc]] GPT2LMHeadModel
    - forward
    - parallelize
    - deparallelize
 ## GPT2DoubleHeadsModel
 [[autodoc]] GPT2DoubleHeadsModel
    - forward
 ## GPT2ForSequenceClassification
 [[autodoc]] GPT2ForSequenceClassification
    - forward
 ## GPT2ForTokenClassification
 [[autodoc]] GPT2ForTokenClassification
    - forward
 ## TFGPT2Model
 [[autodoc]] TFGPT2Model
    - call
 ## TFGPT2LMHeadModel
 [[autodoc]] TFGPT2LMHeadModel
    - call
 ## TFGPT2DoubleHeadsModel
 [[autodoc]] TFGPT2DoubleHeadsModel
    - call
 ## TFGPT2ForSequenceClassification
 [[autodoc]] TFGPT2ForSequenceClassification
    - call
 ## TFSequenceClassifierOutputWithPast
 [[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutputWithPast
 ## FlaxGPT2Model
 [[autodoc]] FlaxGPT2Model
    - __call__
 ## FlaxGPT2LMHeadModel
 [[autodoc]] FlaxGPT2LMHeadModel
    - __call__
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -1,165 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 OpenAI GPT2
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
 <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ by Alec
 Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
 transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
 The abstract from the paper is the following:
 *GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
 web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
 text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
 across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
 10X the amount of data.*
 Tips:
 - GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
  the left.
 - GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
  token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
  observed in the `run_generation.py` example script.
 - The model can take the `past_key_values` (for PyTorch) or `past` (for TF) as input, which is the previously computed
  key/value attention pairs. Using this (`past_key_values` or `past`) value prevents the model from re-computing
  pre-computed values in the context of text generation. For PyTorch, see `past_key_values` argument of the
  :meth:`~transformers.GPT2Model.forward` method, or for TF the `past` argument of the
  :meth:`~transformers.TFGPT2Model.call` method for more information on its usage.
 - Enabling the `scale_attn_by_inverse_layer_idx` and `reorder_and_upcast_attn` flags will apply the training stability
  improvements from `Mistral <https://github.com/stanford-crfm/mistral/>`__ (for PyTorch only).
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
 different sizes: small, medium, large, xl and a distilled version of the small checkpoint: `distilgpt-2`.
 This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
 <https://openai.com/blog/better-language-models/>`__.
 GPT2Config
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Config
    :members:
 GPT2Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Tokenizer
    :members: save_vocabulary
 GPT2TokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2TokenizerFast
    :members:
 GPT2 specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
    :members:
 .. autoclass:: transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
    :members:
 GPT2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2Model
    :members: forward, parallelize, deparallelize
 GPT2LMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2LMHeadModel
    :members: forward, parallelize, deparallelize
 GPT2DoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2DoubleHeadsModel
    :members: forward
 GPT2ForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2ForSequenceClassification
    :members: forward
 GPT2ForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPT2ForTokenClassification
    :members: forward
 TFGPT2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2Model
    :members: call
 TFGPT2LMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2LMHeadModel
    :members: call
 TFGPT2DoubleHeadsModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2DoubleHeadsModel
    :members: call
 TFGPT2ForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFGPT2ForSequenceClassification
    :members: call
 TFSequenceClassifierOutputWithPast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.modeling_tf_outputs.TFSequenceClassifierOutputWithPast
    :members:
 FlaxGPT2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPT2Model
    :members: __call__
 FlaxGPT2LMHeadModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPT2LMHeadModel
    :members: __call__
--- a/docs/source/model_doc/gpt_neo.mdx
+++ b/docs/source/model_doc/gpt_neo.mdx
@@ -0,0 +1,72 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # GPT Neo
 ## Overview
 The GPTNeo model was released in the [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) repository by Sid
 Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the
 [Pile](https://pile.eleuther.ai/) dataset.
 The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
 256 tokens.
 This model was contributed by [valhalla](https://huggingface.co/valhalla).
 ### Generation
 The `generate()` method can be used to generate text using GPT Neo model.
 ```python
 >>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer
 >>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
 >>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
 >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
 ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
 ...          "researchers was the fact that the unicorns spoke perfect English."
 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
 >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```
 ## GPTNeoConfig
 [[autodoc]] GPTNeoConfig
 ## GPTNeoModel
 [[autodoc]] GPTNeoModel
    - forward
 ## GPTNeoForCausalLM
 [[autodoc]] GPTNeoForCausalLM
    - forward
 ## GPTNeoForSequenceClassification
 [[autodoc]] GPTNeoForSequenceClassification
    - forward
 ## FlaxGPTNeoModel
 [[autodoc]] FlaxGPTNeoModel
    - __call__
 ## FlaxGPTNeoForCausalLM
 [[autodoc]] FlaxGPTNeoForCausalLM
    - __call__
--- a/docs/source/model_doc/gpt_neo.rst
+++ b/docs/source/model_doc/gpt_neo.rst
@@ -1,86 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 GPT Neo
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The GPTNeo model was released in the `EleutherAI/gpt-neo <https://github.com/EleutherAI/gpt-neo>`__ repository by Sid
 Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the
 `Pile <https://pile.eleuther.ai/>`__ dataset.
 The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
 256 tokens.
 This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
 Generation
 _______________________________________________________________________________________________________________________
 The :obj:`generate()` method can be used to generate text using GPT Neo model.
 .. code-block::
    >>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer
    >>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
    >>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
    >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
    ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
    ...          "researchers was the fact that the unicorns spoke perfect English."
    >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
    >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 GPTNeoConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTNeoConfig
    :members:
 GPTNeoModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTNeoModel
    :members: forward
 GPTNeoForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTNeoForCausalLM
    :members: forward
 GPTNeoForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTNeoForSequenceClassification
    :members: forward
 FlaxGPTNeoModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPTNeoModel
    :members: __call__
 FlaxGPTNeoForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPTNeoForCausalLM
    :members: __call__
--- a/docs/source/model_doc/gptj.mdx
+++ b/docs/source/model_doc/gptj.mdx
@@ -0,0 +1,124 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # GPT-J
 ## Overview
 The GPT-J model was released in the [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like
 causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.
 This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).
 Tips:
 - To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size CPU
  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
  RAM to just load the model. To reduce the CPU RAM usage there are a few options. The `torch_dtype` argument can be
  used to initialize the model in half-precision. And the `low_cpu_mem_usage` argument can be used to keep the RAM
  usage to 1x. There is also a [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores
  the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
  12.1GB of CPU RAM to load the model.
 ```python
 >>> from transformers import GPTJForCausalLM
 >>> import torch
 >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
 ```
 - The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
  optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients.
  So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This
  is not including the activations and data batches, which would again require some more GPU RAM. So one should explore
  solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to
  train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for
  that could be found [here](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md)
 - Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra
  tokens are added for the sake of efficiency on TPUs. To avoid the mis-match between embedding matrix size and vocab
  size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens
  `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.
 ### Generation
 The [`~generation_utils.GenerationMixin.generate`] method can be used to generate text using GPT-J
 model.
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
 >>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
 >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
 ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
 ...          "researchers was the fact that the unicorns spoke perfect English."
 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
 >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```
 ...or in float16 precision:
 ```python
 >>> from transformers import GPTJForCausalLM, AutoTokenizer
 >>> import torch
 >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
 >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
 ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
 ...          "researchers was the fact that the unicorns spoke perfect English."
 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
 >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```
 ## GPTJConfig
 [[autodoc]] GPTJConfig
    - all
 ## GPTJModel
 [[autodoc]] GPTJModel
    - forward
 ## GPTJForCausalLM
 [[autodoc]] GPTJForCausalLM
    - forward
 ## GPTJForSequenceClassification
 [[autodoc]] GPTJForSequenceClassification
    - forward
 ## GPTJForQuestionAnswering
 [[autodoc]] GPTJForQuestionAnswering
    - forward
 ## FlaxGPTJModel
 [[autodoc]] FlaxGPTJModel
    - __call__
 ## FlaxGPTJForCausalLM
 [[autodoc]] FlaxGPTJForCausalLM
    - __call__
--- a/docs/source/model_doc/gptj.rst
+++ b/docs/source/model_doc/gptj.rst
@@ -1,142 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 GPT-J
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The GPT-J model was released in the `kingoflolz/mesh-transformer-jax
 <https://github.com/kingoflolz/mesh-transformer-jax>`__ repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like
 causal language model trained on `the Pile <https://pile.eleuther.ai/>`__ dataset.
 This model was contributed by `Stella Biderman <https://huggingface.co/stellaathena>`__.
 Tips:
 - To load `GPT-J <https://huggingface.co/EleutherAI/gpt-j-6B>`__ in float32 one would need at least 2x model size CPU
  RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
  RAM to just load the model. To reduce the CPU RAM usage there are a few options. The ``torch_dtype`` argument can be
  used to initialize the model in half-precision. And the ``low_cpu_mem_usage`` argument can be used to keep the RAM
  usage to 1x. There is also a `fp16 branch <https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16>`__ which stores
  the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
  12.1GB of CPU RAM to load the model.
 .. code-block::
    >>> from transformers import GPTJForCausalLM
    >>> import torch
    >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
 - The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
  optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients.
  So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This
  is not including the activations and data batches, which would again require some more GPU RAM. So one should explore
  solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to
  train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for
  that could be found `here <https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md>`__
 - Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra
  tokens are added for the sake of efficiency on TPUs. To avoid the mis-match between embedding matrix size and vocab
  size, the tokenizer for `GPT-J <https://huggingface.co/EleutherAI/gpt-j-6B>`__ contains 143 extra tokens
  ``<|extratoken_1|>... <|extratoken_143|>``, so the ``vocab_size`` of tokenizer also becomes 50400.
 Generation
 _______________________________________________________________________________________________________________________
 The :meth:`~transformers.generation_utils.GenerationMixin.generate` method can be used to generate text using GPT-J
 model.
 .. code-block::
    >>> from transformers import AutoModelForCausalLM, AutoTokenizer
    >>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
    >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
    >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
    ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
    ...          "researchers was the fact that the unicorns spoke perfect English."
    >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
    >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ...or in float16 precision:
 .. code-block::
    >>> from transformers import GPTJForCausalLM, AutoTokenizer
    >>> import torch
    >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
    >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
    >>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
    ...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
    ...          "researchers was the fact that the unicorns spoke perfect English."
    >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    >>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
    >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 GPTJConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTJConfig
    :members:
 GPTJModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTJModel
    :members: forward
 GPTJForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTJForCausalLM
    :members: forward
 GPTJForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTJForSequenceClassification
    :members: forward
 GPTJForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.GPTJForQuestionAnswering
    :members: forward
 FlaxGPTJModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPTJModel
    :members: __call__
 FlaxGPTJForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxGPTJForCausalLM
    :members: __call__
--- a/docs/source/model_doc/herbert.mdx
+++ b/docs/source/model_doc/herbert.mdx
@@ -0,0 +1,65 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # HerBERT
 ## Overview
 The HerBERT model was proposed in [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and
 Ireneusz Gawlik. It is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic
 masking of whole words.
 The abstract from the paper is the following:
 *In recent years, a series of Transformer-based models unlocked major improvements in general natural language
 understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which
 allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of
 languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language
 understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing
 datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new
 sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and
 promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and
 applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language,
 which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an
 extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
 models.*
 Examples of use:
 ```python
 >>> from transformers import HerbertTokenizer, RobertaModel
 >>> tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
 >>> model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
 >>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
 >>> outputs = model(encoded_input)
 >>> # HerBERT can also be loaded using AutoTokenizer and AutoModel:
 >>> import torch
 >>> from transformers import AutoModel, AutoTokenizer
 >>> tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
 >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
 ```
 This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
 [here](https://github.com/allegro/HerBERT).
 ## HerbertTokenizer
 [[autodoc]] HerbertTokenizer
 ## HerbertTokenizerFast
 [[autodoc]] HerbertTokenizerFast
--- a/docs/source/model_doc/herbert.rst
+++ b/docs/source/model_doc/herbert.rst
@@ -1,73 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 HerBERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The HerBERT model was proposed in `KLEJ: Comprehensive Benchmark for Polish Language Understanding
 <https://www.aclweb.org/anthology/2020.acl-main.111.pdf>`__ by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and
 Ireneusz Gawlik. It is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic
 masking of whole words.
 The abstract from the paper is the following:
 *In recent years, a series of Transformer-based models unlocked major improvements in general natural language
 understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which
 allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of
 languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language
 understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing
 datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new
 sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and
 promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and
 applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language,
 which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an
 extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
 models.*
 Examples of use:
 .. code-block::
    >>> from transformers import HerbertTokenizer, RobertaModel
    >>> tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
    >>> model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
    >>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
    >>> outputs = model(encoded_input)
    >>> # HerBERT can also be loaded using AutoTokenizer and AutoModel:
    >>> import torch
    >>> from transformers import AutoModel, AutoTokenizer
    >>> tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
    >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
 This model was contributed by `rmroczkowski <https://huggingface.co/rmroczkowski>`__. The original code can be found
 `here <https://github.com/allegro/HerBERT>`__.
 HerbertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HerbertTokenizer
    :members: 
 HerbertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HerbertTokenizerFast
    :members: 
--- a/docs/source/model_doc/hubert.mdx
+++ b/docs/source/model_doc/hubert.mdx
@@ -0,0 +1,71 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Hubert
 ## Overview
 Hubert was proposed in [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
 Salakhutdinov, Abdelrahman Mohamed.
 The abstract from the paper is the following:
 *Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are
 multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training
 phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we
 propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an
 offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our
 approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined
 acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised
 clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means
 teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the
 state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h,
 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
 reduction on the more challenging dev-other and test-other evaluation subsets.*
 Tips:
 - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
  using [`Wav2Vec2CTCTokenizer`].
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
 ## HubertConfig
 [[autodoc]] HubertConfig
 ## HubertModel
 [[autodoc]] HubertModel
    - forward
 ## HubertForCTC
 [[autodoc]] HubertForCTC
    - forward
 ## HubertForSequenceClassification
 [[autodoc]] HubertForSequenceClassification
    - forward
 ## TFHubertModel
 [[autodoc]] TFHubertModel
    - call
 ## TFHubertForCTC
 [[autodoc]] TFHubertForCTC
    - call
--- a/docs/source/model_doc/hubert.rst
+++ b/docs/source/model_doc/hubert.rst
@@ -1,86 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Hubert
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Hubert was proposed in `HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
 <https://arxiv.org/abs/2106.07447>`__ by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
 Salakhutdinov, Abdelrahman Mohamed.
 The abstract from the paper is the following:
 *Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are
 multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training
 phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we
 propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an
 offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our
 approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined
 acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised
 clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means
 teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the
 state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h,
 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
 reduction on the more challenging dev-other and test-other evaluation subsets.*
 Tips:
 - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
 - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
  using :class:`~transformers.Wav2Vec2CTCTokenizer`.
 This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
 HubertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HubertConfig
    :members:
 HubertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HubertModel
    :members: forward
 HubertForCTC
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HubertForCTC
    :members: forward
 HubertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.HubertForSequenceClassification
    :members: forward
 TFHubertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFHubertModel
    :members: call
 TFHubertForCTC
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFHubertForCTC
    :members: call
--- a/docs/source/model_doc/ibert.mdx
+++ b/docs/source/model_doc/ibert.mdx
@@ -0,0 +1,72 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # I-BERT
 ## Overview
 The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by
 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running
 inference up to four times faster.
 The abstract from the paper is the following:
 *Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language
 Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for
 efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this,
 previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot
 efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM
 processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes
 the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for
 nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT
 inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using
 RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to
 the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for
 INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
 been open-sourced.*
 This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).
 ## IBertConfig
 [[autodoc]] IBertConfig
 ## IBertModel
 [[autodoc]] IBertModel
    - forward
 ## IBertForMaskedLM
 [[autodoc]] IBertForMaskedLM
    - forward
 ## IBertForSequenceClassification
 [[autodoc]] IBertForSequenceClassification
    - forward
 ## IBertForMultipleChoice
 [[autodoc]] IBertForMultipleChoice
    - forward
 ## IBertForTokenClassification
 [[autodoc]] IBertForTokenClassification
    - forward
 ## IBertForQuestionAnswering
 [[autodoc]] IBertForQuestionAnswering
    - forward
--- a/docs/source/model_doc/ibert.rst
+++ b/docs/source/model_doc/ibert.rst
@@ -1,89 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 I-BERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The I-BERT model was proposed in `I-BERT: Integer-only BERT Quantization <https://arxiv.org/abs/2101.01321>`__ by
 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running
 inference up to four times faster.
 The abstract from the paper is the following:
 *Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language
 Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for
 efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this,
 previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot
 efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM
 processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes
 the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for
 nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT
 inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using
 RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to
 the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for
 INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
 been open-sourced.*
 This model was contributed by `kssteven <https://huggingface.co/kssteven>`__. The original code can be found `here
 <https://github.com/kssteven418/I-BERT>`__.
 IBertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertConfig
    :members:
 IBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertModel
    :members: forward
 IBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertForMaskedLM
    :members: forward
 IBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertForSequenceClassification
    :members: forward
 IBertForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertForMultipleChoice
    :members: forward
 IBertForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertForTokenClassification
    :members: forward
 IBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.IBertForQuestionAnswering
    :members: forward
--- a/docs/source/model_doc/layoutlm.mdx
+++ b/docs/source/model_doc/layoutlm.mdx
@@ -0,0 +1,124 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LayoutLM
 <a id='Overview'></a>
 ## Overview
 The LayoutLM model was proposed in the paper [LayoutLM: Pre-training of Text and Layout for Document Image
 Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
 Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
 information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
 on several downstream tasks:
 - form understanding: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a collection of 199 annotated
  forms comprising more than 30,000 words).
 - receipt understanding: the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for
  training and 347 receipts for testing).
 - document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
  400,000 images belonging to one of 16 classes).
 The abstract from the paper is the following:
 *Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
 widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
 while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
 the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
 beneficial for a great number of real-world document image understanding tasks such as information extraction from
 scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
 To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
 document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
 understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
 (from 93.07 to 94.42).*
 Tips:
 - In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
  as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1) format, where
  (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
  position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
  scale. To normalize, you can use the following function:
 ```python
 def normalize_bbox(bbox, width, height):
     return [
         int(1000 * (bbox[0] / width)),
         int(1000 * (bbox[1] / height)),
         int(1000 * (bbox[2] / width)),
         int(1000 * (bbox[3] / height)),
     ]
 ```
 Here, `width` and `height` correspond to the width and height of the original document in which the token
 occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
 ```python
 from PIL import Image
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
 width, height = image.size
 ```
 - For a demo which shows how to fine-tune [`LayoutLMForTokenClassification`] on the [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/) (a collection of annotated forms), see [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb).
  It includes an inference part, which shows how to use Google's Tesseract on a new document.
 This model was contributed by [liminghao1630](https://huggingface.co/liminghao1630). The original code can be found
 [here](https://github.com/microsoft/unilm/tree/master/layoutlm).
 ## LayoutLMConfig
 [[autodoc]] LayoutLMConfig
 ## LayoutLMTokenizer
 [[autodoc]] LayoutLMTokenizer
 ## LayoutLMTokenizerFast
 [[autodoc]] LayoutLMTokenizerFast
 ## LayoutLMModel
 [[autodoc]] LayoutLMModel
 ## LayoutLMForMaskedLM
 [[autodoc]] LayoutLMForMaskedLM
 ## LayoutLMForSequenceClassification
 [[autodoc]] LayoutLMForSequenceClassification
 ## LayoutLMForTokenClassification
 [[autodoc]] LayoutLMForTokenClassification
 ## TFLayoutLMModel
 [[autodoc]] TFLayoutLMModel
 ## TFLayoutLMForMaskedLM
 [[autodoc]] TFLayoutLMForMaskedLM
 ## TFLayoutLMForSequenceClassification
 [[autodoc]] TFLayoutLMForSequenceClassification
 ## TFLayoutLMForTokenClassification
 [[autodoc]] TFLayoutLMForTokenClassification
--- a/docs/source/model_doc/layoutlm.rst
+++ b/docs/source/model_doc/layoutlm.rst
@@ -1,161 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LayoutLM
 -----------------------------------------------------------------------------------------------------------------------
 .. _Overview:
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
 Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
 Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
 information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
 on several downstream tasks:
 - form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
  forms comprising more than 30,000 words).
 - receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
  training and 347 receipts for testing).
 - document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
  400,000 images belonging to one of 16 classes).
 The abstract from the paper is the following:
 *Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
 widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
 while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
 the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
 beneficial for a great number of real-world document image understanding tasks such as information extraction from
 scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
 To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
 document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
 understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
 (from 93.07 to 94.42).*
 Tips:
 - In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
  the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
  as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
  <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
  (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
  position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
  scale. To normalize, you can use the following function:
 .. code-block::
    def normalize_bbox(bbox, width, height):
         return [
             int(1000 * (bbox[0] / width)),
             int(1000 * (bbox[1] / height)),
             int(1000 * (bbox[2] / width)),
             int(1000 * (bbox[3] / height)),
         ]
 Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
 occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
 .. code-block::
    from PIL import Image
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
    width, height = image.size
 - For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
  <https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
  <https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
  It includes an inference part, which shows how to use Google's Tesseract on a new document.
 This model was contributed by `liminghao1630 <https://huggingface.co/liminghao1630>`__. The original code can be found
 `here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
 LayoutLMConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMConfig
    :members:
 LayoutLMTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMTokenizer
    :members:
 LayoutLMTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMTokenizerFast
    :members:
 LayoutLMModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMModel
    :members:
 LayoutLMForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMForMaskedLM
    :members:
 LayoutLMForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMForSequenceClassification
    :members:
 LayoutLMForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMForTokenClassification
    :members:
 TFLayoutLMModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLayoutLMModel
    :members:
 TFLayoutLMForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLayoutLMForMaskedLM
    :members:
 TFLayoutLMForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLayoutLMForSequenceClassification
    :members:
 TFLayoutLMForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLayoutLMForTokenClassification
    :members:
--- a/docs/source/model_doc/layoutlmv2.mdx
+++ b/docs/source/model_doc/layoutlmv2.mdx
@@ -0,0 +1,287 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LayoutLMV2
 ## Overview
 The LayoutLMV2 model was proposed in [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
 Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves [LayoutLM](layoutlm) to obtain
 state-of-the-art results across several document image understanding benchmarks:
 - information extraction from scanned documents: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a
  collection of 199 annotated forms comprising more than 30,000 words), the [CORD](https://github.com/clovaai/cord)
  dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for training and 347 receipts for testing)
  and the [Kleister-NDA](https://github.com/applicaai/kleister-nda) dataset (a collection of non-disclosure
  agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
  documents for testing).
 - document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
  400,000 images belonging to one of 16 classes).
 - document visual question answering: the [DocVQA](https://arxiv.org/abs/2007.00398) dataset (a collection of 50,000
  questions defined on 12,000+ document images).
 The abstract from the paper is the following:
 *Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
 its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
 paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
 architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
 visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
 stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
 mechanism into the Transformer architecture, so that the model can fully understand the relative positional
 relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
 achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
 including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
 RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
 this https URL.*
 Tips:
 - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
  pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
 - LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
  the self-attention layers. Details can be found on page 5 of the [paper](https://arxiv.org/abs/2012.14740).
 - Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found [here](https://github.com/NielsRogge/Transformers-Tutorials).
 - LayoutLMv2 uses Facebook AI's [Detectron2](https://github.com/facebookresearch/detectron2/) package for its visual
  backbone. See [this link](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) for installation
  instructions.
 - In addition to `input_ids`, [`~LayoutLMv2Model.forward`] expects 2 additional inputs, namely
  `image` and `bbox`. The `image` input corresponds to the original document image in which the text
  tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
  document images, `image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
  `torch.Tensor` or a `Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
  done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
  in Detectron2 are pre-trained using the BGR format. The `bbox` input are the bounding boxes (i.e. 2D-positions)
  of the input text tokens. This is identical to [`LayoutLMModel`]. These can be obtained using an
  external OCR engine such as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python
  wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1)
  format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
  represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
  a 0-1000 scale. To normalize, you can use the following function:
 ```python
 def normalize_bbox(bbox, width, height):
     return [
         int(1000 * (bbox[0] / width)),
         int(1000 * (bbox[1] / height)),
         int(1000 * (bbox[2] / width)),
         int(1000 * (bbox[3] / height)),
     ]
 ```
 Here, `width` and `height` correspond to the width and height of the original document in which the token
 occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
 follows:
 ```python
 from PIL import Image
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
 width, height = image.size
 ```
 However, this model includes a brand new [`~transformers.LayoutLMv2Processor`] which can be used to directly
 prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
 section below.
 - Internally, [`~transformers.LayoutLMv2Model`] will send the `image` input through its visual backbone to
  obtain a lower-resolution feature map, whose shape is equal to the `image_feature_pool_shape` attribute of
  [`~transformers.LayoutLMv2Config`]. This feature map is then flattened to obtain a sequence of image tokens. As
  the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
  tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
  length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
  will have a shape of `seq_length` + `image_feature_pool_shape[0]` *
  `config.image_feature_pool_shape[1]`.
 - When calling [`~transformers.LayoutLMv2Model.from_pretrained`], a warning will be printed with a long list of
  parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
  statistics, which are going to have values when fine-tuning on a custom dataset.
 - If you want to train the model in a distributed environment, make sure to call [`synchronize_batch_norm`] on the
  model in order to properly synchronize the batch normalization layers of the visual backbone.
 In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
 [LayoutXLM's documentation page](layoutxlm).
 ## Usage: LayoutLMv2Processor
 The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally
 combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer
 ([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor
 handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
 for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
 modality.
 ```python
 from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
 feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
 tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
 processor = LayoutLMv2Processor(feature_extractor, tokenizer)
 ```
 In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`],
 and it will create the inputs expected by the model. Internally, the processor first uses
 [`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized
 bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and
 normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or
 [`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`,
 `attention_mask`, `token_type_ids`, `bbox`. Optionally, one can provide word labels to the processor,
 which are turned into token-level `labels`.
 [`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python
 wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
 choice, and provide the words and normalized boxes yourself. This requires initializing
 [`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`.
 In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
 use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
 **Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
 True**
 This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
 the words and normalized bounding boxes.
 ```python
 from transformers import LayoutLMv2Processor
 from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
 **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
 In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to
 `False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
 the processor.
 ```python
 from transformers import LayoutLMv2Processor
 from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 words = ["hello", "world"]
 boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
 encoding = processor(image, words, boxes=boxes, return_tensors="pt")
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
 **Use case 3: token classification (training), apply_ocr=False**
 For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
 labels in order to train a model. The processor will then convert these into token-level `labels`. By default, it
 will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
 `ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
 initialize the tokenizer with `only_label_first_subword` set to `False`.
 ```python
 from transformers import LayoutLMv2Processor
 from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 words = ["hello", "world"]
 boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
 word_labels = [1, 2]
 encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
 ```
 **Use case 4: visual question answering (inference), apply_ocr=True**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
 processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
 ```python
 from transformers import LayoutLMv2Processor
 from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 question = "What's his name?"
 encoding = processor(image, question, return_tensors="pt") 
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
 **Use case 5: visual question answering (inference), apply_ocr=False**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
 perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
 ```python
 from transformers import LayoutLMv2Processor
 from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 question = "What's his name?"
 words = ["hello", "world"]
 boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
 encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")  
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
 ## LayoutLMv2Config
 [[autodoc]] LayoutLMv2Config
 ## LayoutLMv2FeatureExtractor
 [[autodoc]] LayoutLMv2FeatureExtractor
    - __call__
 ## LayoutLMv2Tokenizer
 [[autodoc]] LayoutLMv2Tokenizer
    - __call__
    - save_vocabulary
 ## LayoutLMv2TokenizerFast
 [[autodoc]] LayoutLMv2TokenizerFast
    - __call__
 ## LayoutLMv2Processor
 [[autodoc]] LayoutLMv2Processor
    - __call__
 ## LayoutLMv2Model
 [[autodoc]] LayoutLMv2Model
    - forward
 ## LayoutLMv2ForSequenceClassification
 [[autodoc]] LayoutLMv2ForSequenceClassification
 ## LayoutLMv2ForTokenClassification
 [[autodoc]] LayoutLMv2ForTokenClassification
 ## LayoutLMv2ForQuestionAnswering
 [[autodoc]] LayoutLMv2ForQuestionAnswering
--- a/docs/source/model_doc/layoutlmv2.rst
+++ b/docs/source/model_doc/layoutlmv2.rst
@@ -1,313 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LayoutLMV2
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The LayoutLMV2 model was proposed in `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
 <https://arxiv.org/abs/2012.14740>`__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
 Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM <layoutlm>`__ to obtain
 state-of-the-art results across several document image understanding benchmarks:
 - information extraction from scanned documents: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a
  collection of 199 annotated forms comprising more than 30,000 words), the `CORD <https://github.com/clovaai/cord>`__
  dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the `SROIE
  <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for training and 347 receipts for testing)
  and the `Kleister-NDA <https://github.com/applicaai/kleister-nda>`__ dataset (a collection of non-disclosure
  agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
  documents for testing).
 - document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
  400,000 images belonging to one of 16 classes).
 - document visual question answering: the `DocVQA <https://arxiv.org/abs/2007.00398>`__ dataset (a collection of 50,000
  questions defined on 12,000+ document images).
 The abstract from the paper is the following:
 *Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
 its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
 paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
 architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
 visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
 stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
 mechanism into the Transformer architecture, so that the model can fully understand the relative positional
 relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
 achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
 including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
 RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
 this https URL.*
 Tips:
 - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
  pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
 - LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
  the self-attention layers. Details can be found on page 5 of the `paper <https://arxiv.org/abs/2012.14740>`__.
 - Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found `here
  <https://github.com/NielsRogge/Transformers-Tutorials>`__.
 - LayoutLMv2 uses Facebook AI's `Detectron2 <https://github.com/facebookresearch/detectron2/>`__ package for its visual
  backbone. See `this link <https://detectron2.readthedocs.io/en/latest/tutorials/install.html>`__ for installation
  instructions.
 - In addition to :obj:`input_ids`, :meth:`~transformer.LayoutLMv2Model.forward` expects 2 additional inputs, namely
  :obj:`image` and :obj:`bbox`. The :obj:`image` input corresponds to the original document image in which the text
  tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
  document images, :obj:`image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
  :obj:`torch.Tensor` or a :obj:`Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
  done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
  in Detectron2 are pre-trained using the BGR format. The :obj:`bbox` input are the bounding boxes (i.e. 2D-positions)
  of the input text tokens. This is identical to :class:`~transformer.LayoutLMModel`. These can be obtained using an
  external OCR engine such as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python
  wrapper <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1)
  format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
  represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
  a 0-1000 scale. To normalize, you can use the following function:
 .. code-block::
    def normalize_bbox(bbox, width, height):
         return [
             int(1000 * (bbox[0] / width)),
             int(1000 * (bbox[1] / height)),
             int(1000 * (bbox[2] / width)),
             int(1000 * (bbox[3] / height)),
         ]
 Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
 occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
 follows:
 .. code-block::
    from PIL import Image
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
    width, height = image.size
 However, this model includes a brand new :class:`~transformer.LayoutLMv2Processor` which can be used to directly
 prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
 section below.
 - Internally, :class:`~transformer.LayoutLMv2Model` will send the :obj:`image` input through its visual backbone to
  obtain a lower-resolution feature map, whose shape is equal to the :obj:`image_feature_pool_shape` attribute of
  :class:`~transformer.LayoutLMv2Config`. This feature map is then flattened to obtain a sequence of image tokens. As
  the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
  tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
  length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
  will have a shape of :obj:`seq_length` + :obj:`image_feature_pool_shape[0]` *
  :obj:`config.image_feature_pool_shape[1]`.
 - When calling :meth:`~transformer.LayoutLMv2Model.from_pretrained`, a warning will be printed with a long list of
  parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
  statistics, which are going to have values when fine-tuning on a custom dataset.
 - If you want to train the model in a distributed environment, make sure to call :meth:`synchronize_batch_norm` on the
  model in order to properly synchronize the batch normalization layers of the visual backbone.
 In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
 :doc:`LayoutXLM's documentation page <layoutxlm>`.
 Usage: LayoutLMv2Processor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The easiest way to prepare data for the model is to use :class:`~transformer.LayoutLMv2Processor`, which internally
 combines a feature extractor (:class:`~transformer.LayoutLMv2FeatureExtractor`) and a tokenizer
 (:class:`~transformer.LayoutLMv2Tokenizer` or :class:`~transformer.LayoutLMv2TokenizerFast`). The feature extractor
 handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
 for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
 modality.
 .. code-block::
    from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
    feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
    tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
    processor = LayoutLMv2Processor(feature_extractor, tokenizer)
 In short, one can provide a document image (and possibly additional data) to :class:`~transformer.LayoutLMv2Processor`,
 and it will create the inputs expected by the model. Internally, the processor first uses
 :class:`~transformer.LayoutLMv2FeatureExtractor` to apply OCR on the image to get a list of words and normalized
 bounding boxes, as well to resize the image to a given size in order to get the :obj:`image` input. The words and
 normalized bounding boxes are then provided to :class:`~transformer.LayoutLMv2Tokenizer` or
 :class:`~transformer.LayoutLMv2TokenizerFast`, which converts them to token-level :obj:`input_ids`,
 :obj:`attention_mask`, :obj:`token_type_ids`, :obj:`bbox`. Optionally, one can provide word labels to the processor,
 which are turned into token-level :obj:`labels`.
 :class:`~transformer.LayoutLMv2Processor` uses `PyTesseract <https://pypi.org/project/pytesseract/>`__, a Python
 wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
 choice, and provide the words and normalized boxes yourself. This requires initializing
 :class:`~transformer.LayoutLMv2FeatureExtractor` with :obj:`apply_ocr` set to :obj:`False`.
 In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
 use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
 **Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
 True**
 This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
 the words and normalized bounding boxes.
 .. code-block::
    from transformers import LayoutLMv2Processor
    from PIL import Image
    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
    encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
    print(encoding.keys())
    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
 In case one wants to do OCR themselves, one can initialize the feature extractor with :obj:`apply_ocr` set to
 :obj:`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
 the processor.
 .. code-block::
    from transformers import LayoutLMv2Processor
    from PIL import Image
    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
    words = ["hello", "world"]
    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
    encoding = processor(image, words, boxes=boxes, return_tensors="pt")
    print(encoding.keys())
    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 **Use case 3: token classification (training), apply_ocr=False**
 For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
 labels in order to train a model. The processor will then convert these into token-level :obj:`labels`. By default, it
 will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
 :obj:`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
 initialize the tokenizer with :obj:`only_label_first_subword` set to :obj:`False`.
 .. code-block::
    from transformers import LayoutLMv2Processor
    from PIL import Image
    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
    words = ["hello", "world"]
    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
    word_labels = [1, 2]
    encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
    print(encoding.keys())
    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
 **Use case 4: visual question answering (inference), apply_ocr=True**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
 processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
 .. code-block::
    from transformers import LayoutLMv2Processor
    from PIL import Image
    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
    question = "What's his name?"
    encoding = processor(image, question, return_tensors="pt") 
    print(encoding.keys())
    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 **Use case 5: visual question answering (inference), apply_ocr=False**
 For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
 perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
 .. code-block::
    from transformers import LayoutLMv2Processor
    from PIL import Image
    processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
    image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
    question = "What's his name?"
    words = ["hello", "world"]
    boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
    encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")  
    print(encoding.keys())
    # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 LayoutLMv2Config
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2Config
    :members:
 LayoutLMv2FeatureExtractor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2FeatureExtractor
    :members: __call__
 LayoutLMv2Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2Tokenizer
    :members: __call__, save_vocabulary
 LayoutLMv2TokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2TokenizerFast
    :members: __call__
 LayoutLMv2Processor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2Processor
    :members: __call__
 LayoutLMv2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2Model
    :members: forward
 LayoutLMv2ForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2ForSequenceClassification
    :members:
 LayoutLMv2ForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2ForTokenClassification
    :members:
 LayoutLMv2ForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutLMv2ForQuestionAnswering
    :members:
--- a/docs/source/model_doc/layoutxlm.mdx
+++ b/docs/source/model_doc/layoutxlm.mdx
@@ -0,0 +1,77 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LayoutXLM
 ## Overview
 LayoutXLM was proposed in [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
 Zhang, Furu Wei. It's a multilingual extension of the [LayoutLMv2 model](https://arxiv.org/abs/2012.14740) trained
 on 53 languages.
 The abstract from the paper is the following:
 *Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
 understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
 this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
 bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
 introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
 for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
 cross-lingual pre-trained models on the XFUN dataset.*
 One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
 ```python
 from transformers import LayoutLMv2Model
 model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
 ```
 Note that LayoutXLM has its own tokenizer, based on
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`]. You can initialize it as
 follows:
 ```python
 from transformers import LayoutXLMTokenizer
 tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')
 ```
 Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
 [`LayoutLMv2FeatureExtractor`] and
 [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
 data for the model.
 As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
 This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
 ## LayoutXLMTokenizer
 [[autodoc]] LayoutXLMTokenizer
    - __call__
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## LayoutXLMTokenizerFast
 [[autodoc]] LayoutXLMTokenizerFast
    - __call__
 ## LayoutXLMProcessor
 [[autodoc]] LayoutXLMProcessor
    - __call__
--- a/docs/source/model_doc/layoutxlm.rst
+++ b/docs/source/model_doc/layoutxlm.rst
@@ -1,84 +0,0 @@
 .. 
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LayoutXLM
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 LayoutXLM was proposed in `LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
 <https://arxiv.org/abs/2104.08836>`__ by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
 Zhang, Furu Wei. It's a multilingual extension of the `LayoutLMv2 model <https://arxiv.org/abs/2012.14740>`__ trained
 on 53 languages.
 The abstract from the paper is the following:
 *Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
 understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
 this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
 bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
 introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
 for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
 cross-lingual pre-trained models on the XFUN dataset.*
 One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
 .. code-block::
    from transformers import LayoutLMv2Model
    model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base') 
 Note that LayoutXLM has its own tokenizer, based on
 :class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast`. You can initialize it as
 follows:
 .. code-block::
    from transformers import LayoutXLMTokenizer
    tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base') 
 Similar to LayoutLMv2, you can use :class:`~transformers.LayoutXLMProcessor` (which internally applies
 :class:`~transformers.LayoutLMv2FeatureExtractor` and
 :class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast` in sequence) to prepare all
 data for the model.
 As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to :doc:`LayoutLMv2's documentation page
 <layoutlmv2>` for all tips, code examples and notebooks.
 This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
 <https://github.com/microsoft/unilm>`__.
 LayoutXLMTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutXLMTokenizer
    :members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 LayoutXLMTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutXLMTokenizerFast
    :members: __call__
 LayoutXLMProcessor
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LayoutXLMProcessor
    :members: __call__
--- a/docs/source/model_doc/led.mdx
+++ b/docs/source/model_doc/led.mdx
@@ -0,0 +1,117 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LED
 ## Overview
 The LED model was proposed in [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz
 Beltagy, Matthew E. Peters, Arman Cohan.
 The abstract from the paper is the following:
 *Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
 quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
 mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
 longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
 windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
 evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
 contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
 pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
 WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
 long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
 dataset.*
 Tips:
 - [`LEDForConditionalGeneration`] is an extension of
  [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
  *Longformer*'s *chunked self-attention* layer. [`LEDTokenizer`] is an alias of
  [`BartTokenizer`].
 - LED works very well on long-range *sequence-to-sequence* tasks where the `input_ids` largely exceed a length of
  1024 tokens.
 - LED pads the `input_ids` to be a multiple of `config.attention_window` if required. Therefore a small speed-up is
  gained, when [`LEDTokenizer`] is used with the `pad_to_multiple_of` argument.
 - LED makes use of *global attention* by means of the `global_attention_mask` (see
  [`LongformerModel`]). For summarization, it is advised to put *global attention* only on the first
  `<s>` token. For question answering, it is advised to put *global attention* on all tokens of the question.
 - To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
  `model.gradient_checkpointing_enable()`.
 - A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
 - A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
 This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
 ## LEDConfig
 [[autodoc]] LEDConfig
 ## LEDTokenizer
 [[autodoc]] LEDTokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## LEDTokenizerFast
 [[autodoc]] LEDTokenizerFast
 ## LED specific outputs
 [[autodoc]] models.led.modeling_led.LEDEncoderBaseModelOutput
 [[autodoc]] models.led.modeling_led.LEDSeq2SeqModelOutput
 [[autodoc]] models.led.modeling_led.LEDSeq2SeqLMOutput
 [[autodoc]] models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
 [[autodoc]] models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
 [[autodoc]] models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
 [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
 [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
 ## LEDModel
 [[autodoc]] LEDModel
    - forward
 ## LEDForConditionalGeneration
 [[autodoc]] LEDForConditionalGeneration
    - forward
 ## LEDForSequenceClassification
 [[autodoc]] LEDForSequenceClassification
    - forward
 ## LEDForQuestionAnswering
 [[autodoc]] LEDForQuestionAnswering
    - forward
 ## TFLEDModel
 [[autodoc]] TFLEDModel
    - call
 ## TFLEDForConditionalGeneration
 [[autodoc]] TFLEDForConditionalGeneration
    - call
--- a/docs/source/model_doc/led.rst
+++ b/docs/source/model_doc/led.rst
@@ -1,150 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LED
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The LED model was proposed in `Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz
 Beltagy, Matthew E. Peters, Arman Cohan.
 The abstract from the paper is the following:
 *Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
 quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
 mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
 longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
 windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
 evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
 contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
 pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
 WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
 long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
 dataset.*
 Tips:
 - :class:`~transformers.LEDForConditionalGeneration` is an extension of
  :class:`~transformers.BartForConditionalGeneration` exchanging the traditional *self-attention* layer with
  *Longformer*'s *chunked self-attention* layer. :class:`~transformers.LEDTokenizer` is an alias of
  :class:`~transformers.BartTokenizer`.
 - LED works very well on long-range *sequence-to-sequence* tasks where the ``input_ids`` largely exceed a length of
  1024 tokens.
 - LED pads the ``input_ids`` to be a multiple of ``config.attention_window`` if required. Therefore a small speed-up is
  gained, when :class:`~transformers.LEDTokenizer` is used with the ``pad_to_multiple_of`` argument.
 - LED makes use of *global attention* by means of the ``global_attention_mask`` (see
  :class:`~transformers.LongformerModel`). For summarization, it is advised to put *global attention* only on the first
  ``<s>`` token. For question answering, it is advised to put *global attention* on all tokens of the question.
 - To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
  ``model.gradient_checkpointing_enable()``.
 - A notebook showing how to evaluate LED, can be accessed `here
  <https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing>`__.
 - A notebook showing how to fine-tune LED, can be accessed `here
  <https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing>`__.
 This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
 LEDConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDConfig
    :members:
 LEDTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDTokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 LEDTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDTokenizerFast
    :members:
 LED specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.led.modeling_led.LEDEncoderBaseModelOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqModelOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqLMOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
    :members: 
 .. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
    :members: 
 LEDModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDModel
    :members: forward
 LEDForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDForConditionalGeneration
    :members: forward
 LEDForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDForSequenceClassification
    :members: forward
 LEDForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LEDForQuestionAnswering
    :members: forward
 TFLEDModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLEDModel
    :members: call
 TFLEDForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLEDForConditionalGeneration
    :members: call
--- a/docs/source/model_doc/longformer.mdx
+++ b/docs/source/model_doc/longformer.mdx
@@ -0,0 +1,184 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Longformer
 **DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
 ## Overview
 The Longformer model was presented in [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
 The abstract from the paper is the following:
 *Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
 quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
 mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
 longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
 windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
 evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
 contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
 pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
 WikiHop and TriviaQA.*
 Tips:
 - Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which
  token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or
  `</s>`).
 This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).
 ## Longformer Self Attention
 Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
 attend "locally" to each other meaning that each token attends to its \\(\frac{1}{2} w\\) previous tokens and
 \\(\frac{1}{2} w\\) succeding tokens with \\(w\\) being the window length as defined in
 `config.attention_window`. Note that `config.attention_window` can be of type `List` to define a
 different \\(w\\) for each layer. A selected few tokens attend "globally" to all other tokens, as it is
 conventionally done for all tokens in `BertSelfAttention`.
 Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
 that every "locally" attending token not only attends to tokens within its window \\(w\\), but also to all "globally"
 attending tokens so that global attention is *symmetric*.
 The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
 `global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
 `global_attention_mask`:
 - 0: the token attends "locally",
 - 1: the token attends "globally".
 For more information please also refer to [`~LongformerModel.forward`] method.
 Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
 represents the memory and time bottleneck, can be reduced from \\(\mathcal{O}(n_s \times n_s)\\) to
 \\(\mathcal{O}(n_s \times w)\\), with \\(n_s\\) being the sequence length and \\(w\\) being the average window
 size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
 "locally" attending tokens.
 For more information, please refer to the official [paper](https://arxiv.org/pdf/2004.05150.pdf).
 ## Training
 [`LongformerForMaskedLM`] is trained the exact same way [`RobertaForMaskedLM`] is
 trained and should be used as follows:
 ```python
 input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
 mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
 loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 ```
 ## LongformerConfig
 [[autodoc]] LongformerConfig
 ## LongformerTokenizer
 [[autodoc]] LongformerTokenizer
 ## LongformerTokenizerFast
 [[autodoc]] LongformerTokenizerFast
 ## Longformer specific outputs
 [[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutput
 [[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
 [[autodoc]] models.longformer.modeling_longformer.LongformerMaskedLMOutput
 [[autodoc]] models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
 [[autodoc]] models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
 [[autodoc]] models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
 [[autodoc]] models.longformer.modeling_longformer.LongformerTokenClassifierOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
 [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
 ## LongformerModel
 [[autodoc]] LongformerModel
    - forward
 ## LongformerForMaskedLM
 [[autodoc]] LongformerForMaskedLM
    - forward
 ## LongformerForSequenceClassification
 [[autodoc]] LongformerForSequenceClassification
    - forward
 ## LongformerForMultipleChoice
 [[autodoc]] LongformerForMultipleChoice
    - forward
 ## LongformerForTokenClassification
 [[autodoc]] LongformerForTokenClassification
    - forward
 ## LongformerForQuestionAnswering
 [[autodoc]] LongformerForQuestionAnswering
    - forward
 ## TFLongformerModel
 [[autodoc]] TFLongformerModel
    - call
 ## TFLongformerForMaskedLM
 [[autodoc]] TFLongformerForMaskedLM
    - call
 ## TFLongformerForQuestionAnswering
 [[autodoc]] TFLongformerForQuestionAnswering
    - call
 ## TFLongformerForSequenceClassification
 [[autodoc]] TFLongformerForSequenceClassification
    - call
 ## TFLongformerForTokenClassification
 [[autodoc]] TFLongformerForTokenClassification
    - call
 ## TFLongformerForMultipleChoice
 [[autodoc]] TFLongformerForMultipleChoice
    - call
--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@@ -1,239 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 Longformer
 -----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The Longformer model was presented in `Longformer: The Long-Document Transformer
 <https://arxiv.org/pdf/2004.05150.pdf>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
 The abstract from the paper is the following:
 *Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
 quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
 mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
 longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
 windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
 evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
 contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
 pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
 WikiHop and TriviaQA.*
 Tips:
 - Since the Longformer is based on RoBERTa, it doesn't have :obj:`token_type_ids`. You don't need to indicate which
  token belongs to which segment. Just separate your segments with the separation token :obj:`tokenizer.sep_token` (or
  :obj:`</s>`).
 This model was contributed by `beltagy <https://huggingface.co/beltagy>`__. The Authors' code can be found `here
 <https://github.com/allenai/longformer>`__.
 Longformer Self Attention
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
 attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
 :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
 :obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
 different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
 conventionally done for all tokens in :obj:`BertSelfAttention`.
 Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
 that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
 attending tokens so that global attention is *symmetric*.
 The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
 :obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
 :obj:`global_attention_mask`:
 - 0: the token attends "locally",
 - 1: the token attends "globally".
 For more information please also refer to :meth:`~transformers.LongformerModel.forward` method.
 Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
 represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to
 :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window
 size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
 "locally" attending tokens.
 For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`__.
 Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 :class:`~transformers.LongformerForMaskedLM` is trained the exact same way :class:`~transformers.RobertaForMaskedLM` is
 trained and should be used as follows:
 .. code-block::
    input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
    mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
    loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 LongformerConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerConfig
    :members:
 LongformerTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerTokenizer
    :members: 
 LongformerTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerTokenizerFast
    :members: 
 Longformer specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMaskedLMOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_longformer.LongformerTokenClassifierOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
    :members: 
 .. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
    :members: 
 LongformerModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerModel
    :members: forward
 LongformerForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerForMaskedLM
    :members: forward
 LongformerForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerForSequenceClassification
    :members: forward
 LongformerForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerForMultipleChoice
    :members: forward
 LongformerForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerForTokenClassification
    :members: forward
 LongformerForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LongformerForQuestionAnswering
    :members: forward
 TFLongformerModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerModel
    :members: call
 TFLongformerForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerForMaskedLM
    :members: call
 TFLongformerForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerForQuestionAnswering
    :members: call
 TFLongformerForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerForSequenceClassification
    :members: call
 TFLongformerForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerForTokenClassification
    :members: call
 TFLongformerForMultipleChoice
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLongformerForMultipleChoice
    :members: call
--- a/docs/source/model_doc/luke.mdx
+++ b/docs/source/model_doc/luke.mdx
@@ -0,0 +1,151 @@
 <!--Copyright 2021 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LUKE
 ## Overview
 The LUKE model was proposed in [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
 It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
 improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
 extractive and cloze-style question answering, entity typing, and relation classification.
 The abstract from the paper is the following:
 *Entity representations are useful in natural language tasks involving entities. In this paper, we propose new
 pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed
 model treats words and entities in a given text as independent tokens, and outputs contextualized representations of
 them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves
 predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also
 propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the
 transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model
 achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains
 state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification),
 CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
 answering).*
 Tips:
 - This implementation is the same as [`RobertaModel`] with the addition of entity embeddings as well
  as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
 - LUKE treats entities as input tokens; therefore, it takes `entity_ids`, `entity_attention_mask`,
  `entity_token_type_ids` and `entity_position_ids` as extra input. You can obtain those using
  [`LukeTokenizer`].
 - [`LukeTokenizer`] takes `entities` and `entity_spans` (character-based start and end
  positions of the entities in the input text) as extra input. `entities` typically consist of [MASK] entities or
  Wikipedia entities. The brief description when inputting these entities are as follows:
  - *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
    predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by
    gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address
    downstream tasks requiring the information of entities in text such as entity typing, relation classification, and
    named entity recognition.
  - *Inputting Wikipedia entities to compute knowledge-enhanced token representations*: LUKE learns rich information
    (or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By
    using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in
    the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as
    question answering.
 - There are three head models for the former use case:
  - [`LukeForEntityClassification`], for tasks to classify a single entity in an input text such as
    entity typing, e.g. the [Open Entity dataset](https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html).
    This model places a linear head on top of the output entity representation.
  - [`LukeForEntityPairClassification`], for tasks to classify the relationship between two entities
    such as relation classification, e.g. the [TACRED dataset](https://nlp.stanford.edu/projects/tacred/). This
    model places a linear head on top of the concatenated output representation of the pair of given entities.
  - [`LukeForEntitySpanClassification`], for tasks to classify the sequence of entity spans, such as
    named entity recognition (NER). This model places a linear head on top of the output entity representations. You
    can address NER using this model by inputting all possible entity spans in the text to the model.
  [`LukeTokenizer`] has a `task` argument, which enables you to easily create an input to these
  head models by specifying `task="entity_classification"`, `task="entity_pair_classification"`, or
  `task="entity_span_classification"`. Please refer to the example code of each head models.
  A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation
  classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE).
  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
  the HuggingFace implementation of LUKE. They can be found [here](https://github.com/studio-ousia/luke/tree/master/notebooks).
 Example:
 ```python
 >>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
 >>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
 >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
 # Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
 >>> text = "Beyoncé lives in Los Angeles."
 >>> entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"
 >>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> word_last_hidden_state = outputs.last_hidden_state
 >>> entity_last_hidden_state = outputs.entity_last_hidden_state
 # Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
 >>> entities = ["Beyoncé", "Los Angeles"]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
 >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
 >>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> word_last_hidden_state = outputs.last_hidden_state
 >>> entity_last_hidden_state = outputs.entity_last_hidden_state
 # Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
 >>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
 >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
 >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
 >>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> logits = outputs.logits
 >>> predicted_class_idx = int(logits[0].argmax())
 >>> print("Predicted class:", model.config.id2label[predicted_class_idx])
 ```
 This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke).
 ## LukeConfig
 [[autodoc]] LukeConfig
 ## LukeTokenizer
 [[autodoc]] LukeTokenizer
    - __call__
    - save_vocabulary
 ## LukeModel
 [[autodoc]] LukeModel
    - forward
 ## LukeForMaskedLM
 [[autodoc]] LukeForMaskedLM
    - forward
 ## LukeForEntityClassification
 [[autodoc]] LukeForEntityClassification
    - forward
 ## LukeForEntityPairClassification
 [[autodoc]] LukeForEntityPairClassification
    - forward
 ## LukeForEntitySpanClassification
 [[autodoc]] LukeForEntitySpanClassification
    - forward
--- a/docs/source/model_doc/luke.rst
+++ b/docs/source/model_doc/luke.rst
@@ -1,168 +0,0 @@
 ..
    Copyright 2021 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LUKE
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The LUKE model was proposed in `LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
 <https://arxiv.org/abs/2010.01057>`_ by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
 It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
 improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
 extractive and cloze-style question answering, entity typing, and relation classification.
 The abstract from the paper is the following:
 *Entity representations are useful in natural language tasks involving entities. In this paper, we propose new
 pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed
 model treats words and entities in a given text as independent tokens, and outputs contextualized representations of
 them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves
 predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also
 propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the
 transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model
 achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains
 state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification),
 CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
 answering).*
 Tips:
 - This implementation is the same as :class:`~transformers.RobertaModel` with the addition of entity embeddings as well
  as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
 - LUKE treats entities as input tokens; therefore, it takes :obj:`entity_ids`, :obj:`entity_attention_mask`,
  :obj:`entity_token_type_ids` and :obj:`entity_position_ids` as extra input. You can obtain those using
  :class:`~transformers.LukeTokenizer`.
 - :class:`~transformers.LukeTokenizer` takes :obj:`entities` and :obj:`entity_spans` (character-based start and end
  positions of the entities in the input text) as extra input. :obj:`entities` typically consist of [MASK] entities or
  Wikipedia entities. The brief description when inputting these entities are as follows:
  - *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
    predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by
    gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address
    downstream tasks requiring the information of entities in text such as entity typing, relation classification, and
    named entity recognition.
  - *Inputting Wikipedia entities to compute knowledge-enhanced token representations*: LUKE learns rich information
    (or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By
    using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in
    the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as
    question answering.
 - There are three head models for the former use case:
  - :class:`~transformers.LukeForEntityClassification`, for tasks to classify a single entity in an input text such as
    entity typing, e.g. the `Open Entity dataset <https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html>`__.
    This model places a linear head on top of the output entity representation.
  - :class:`~transformers.LukeForEntityPairClassification`, for tasks to classify the relationship between two entities
    such as relation classification, e.g. the `TACRED dataset <https://nlp.stanford.edu/projects/tacred/>`__. This
    model places a linear head on top of the concatenated output representation of the pair of given entities.
  - :class:`~transformers.LukeForEntitySpanClassification`, for tasks to classify the sequence of entity spans, such as
    named entity recognition (NER). This model places a linear head on top of the output entity representations. You
    can address NER using this model by inputting all possible entity spans in the text to the model.
  :class:`~transformers.LukeTokenizer` has a ``task`` argument, which enables you to easily create an input to these
  head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
  ``task="entity_span_classification"``. Please refer to the example code of each head models.
  A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
  classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
  There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
  the HuggingFace implementation of LUKE. They can be found `here
  <https://github.com/studio-ousia/luke/tree/master/notebooks>`__.
 Example:
 .. code-block::
    >>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
    >>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
    >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
    # Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
    >>> text = "Beyoncé lives in Los Angeles."
    >>> entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"
    >>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
    >>> outputs = model(**inputs)
    >>> word_last_hidden_state = outputs.last_hidden_state
    >>> entity_last_hidden_state = outputs.entity_last_hidden_state
    # Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
    >>> entities = ["Beyoncé", "Los Angeles"]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
    >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
    >>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
    >>> outputs = model(**inputs)
    >>> word_last_hidden_state = outputs.last_hidden_state
    >>> entity_last_hidden_state = outputs.entity_last_hidden_state
    # Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
    >>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
    >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
    >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
    >>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
    >>> outputs = model(**inputs)
    >>> logits = outputs.logits
    >>> predicted_class_idx = int(logits[0].argmax())
    >>> print("Predicted class:", model.config.id2label[predicted_class_idx])
 This model was contributed by `ikuyamada <https://huggingface.co/ikuyamada>`__ and `nielsr
 <https://huggingface.co/nielsr>`__. The original code can be found `here <https://github.com/studio-ousia/luke>`__.
 LukeConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeConfig
    :members:
 LukeTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeTokenizer
    :members: __call__, save_vocabulary
 LukeModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeModel
    :members: forward
 LukeForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeForMaskedLM
    :members: forward
 LukeForEntityClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeForEntityClassification
    :members: forward
 LukeForEntityPairClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeForEntityPairClassification
    :members: forward
 LukeForEntitySpanClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LukeForEntitySpanClassification
    :members: forward
--- a/docs/source/model_doc/lxmert.mdx
+++ b/docs/source/model_doc/lxmert.mdx
@@ -0,0 +1,102 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # LXMERT
 ## Overview
 The LXMERT model was proposed in [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
 (one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
 combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
 visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
 consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
 The abstract from the paper is the following:
 *Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
 the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
 Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
 build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
 encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
 semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
 pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
 cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
 cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
 results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
 pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
 best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
 model components and pretraining strategies significantly contribute to our strong results; and also present several
 attention visualizations for the different encoders*
 Tips:
 - Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
  will work.
 - Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
  cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
  itself, select the vision/language hidden states from the first input in the tuple.
 - The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
  as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
 This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert).
 ## LxmertConfig
 [[autodoc]] LxmertConfig
 ## LxmertTokenizer
 [[autodoc]] LxmertTokenizer
 ## LxmertTokenizerFast
 [[autodoc]] LxmertTokenizerFast
 ## Lxmert specific outputs
 [[autodoc]] models.lxmert.modeling_lxmert.LxmertModelOutput
 [[autodoc]] models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
 [[autodoc]] models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
 [[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
 [[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
 ## LxmertModel
 [[autodoc]] LxmertModel
    - forward
 ## LxmertForPreTraining
 [[autodoc]] LxmertForPreTraining
    - forward
 ## LxmertForQuestionAnswering
 [[autodoc]] LxmertForQuestionAnswering
    - forward
 ## TFLxmertModel
 [[autodoc]] TFLxmertModel
    - call
 ## TFLxmertForPreTraining
 [[autodoc]] TFLxmertForPreTraining
    - call
--- a/docs/source/model_doc/lxmert.rst
+++ b/docs/source/model_doc/lxmert.rst
@@ -1,128 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 LXMERT
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
 <https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
 (one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
 combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
 visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
 consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
 The abstract from the paper is the following:
 *Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
 the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
 Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
 build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
 encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
 semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
 pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
 cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
 cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
 results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
 pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
 best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
 model components and pretraining strategies significantly contribute to our strong results; and also present several
 attention visualizations for the different encoders*
 Tips:
 - Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
  will work.
 - Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
  cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
  itself, select the vision/language hidden states from the first input in the tuple.
 - The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
  as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
  both self attention outputs are disregarded.
 This model was contributed by `eltoto1219 <https://huggingface.co/eltoto1219>`__. The original code can be found `here
 <https://github.com/airsplay/lxmert>`__.
 LxmertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertConfig
    :members:
 LxmertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertTokenizer
    :members:
 LxmertTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertTokenizerFast
    :members:
 Lxmert specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
    :members:
 .. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
    :members:
 .. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
    :members:
 .. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
    :members:
 .. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
    :members:
 LxmertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertModel
    :members: forward
 LxmertForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertForPreTraining
    :members: forward
 LxmertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.LxmertForQuestionAnswering
    :members: forward
 TFLxmertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLxmertModel
    :members: call
 TFLxmertForPreTraining
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFLxmertForPreTraining
    :members: call
--- a/docs/source/model_doc/m2m_100.mdx
+++ b/docs/source/model_doc/m2m_100.mdx
@@ -0,0 +1,116 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # M2M100
 ## Overview
 The M2M100 model was proposed in [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
 Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
 Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
 The abstract from the paper is the following:
 *Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
 single model able to translate between any pair of languages. However, much of this work is English-Centric by training
 only on data which was translated from or to English. While this is supported by large sources of training data, it
 does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
 model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
 covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
 to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
 to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
 translating between non-English directions while performing competitively to the best single systems of WMT. We
 open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
 This model was contributed by [valhalla](https://huggingface.co/valhalla).
 ### Training and Generation
 M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
 multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
 source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
 id for source text and target language id for target text, with `X` being the source or target text.
 The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
 examples. To install `sentencepiece` run `pip install sentencepiece`.
 - Supervised Training
 ```python
 from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
 model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
 tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
 src_text = "Life is like a box of chocolates."
 tgt_text = "La vie est comme une boîte de chocolat."
 model_inputs = tokenizer(src_text, return_tensors="pt")
 with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids
 loss = model(**model_inputs, labels=labels) # forward pass
 ```
 - Generation
  M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id
  being forced as the first generated token. To force the target language id as the first generated token, pass the
  *forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between
  Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint.
 ```python
 >>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
 >>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
 >>> chinese_text = "生活就像一盒巧克力。"
 >>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
 >>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
 >>> # translate Hindi to French
 >>> tokenizer.src_lang = "hi"
 >>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
 >>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
 >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 "La vie est comme une boîte de chocolat."
 >>> # translate Chinese to English
 >>> tokenizer.src_lang = "zh"
 >>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
 >>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
 >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 "Life is like a box of chocolate."
 ```
 ## M2M100Config
 [[autodoc]] M2M100Config
 ## M2M100Tokenizer
 [[autodoc]] M2M100Tokenizer
    - build_inputs_with_special_tokens
    - get_special_tokens_mask
    - create_token_type_ids_from_sequences
    - save_vocabulary
 ## M2M100Model
 [[autodoc]] M2M100Model
    - forward
 ## M2M100ForConditionalGeneration
 [[autodoc]] M2M100ForConditionalGeneration
    - forward
--- a/docs/source/model_doc/m2m_100.rst
+++ b/docs/source/model_doc/m2m_100.rst
@@ -1,130 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 M2M100
 -----------------------------------------------------------------------------------------------------------------------
 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The M2M100 model was proposed in `Beyond English-Centric Multilingual Machine Translation
 <https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
 Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
 Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
 The abstract from the paper is the following:
 *Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
 single model able to translate between any pair of languages. However, much of this work is English-Centric by training
 only on data which was translated from or to English. While this is supported by large sources of training data, it
 does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
 model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
 covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
 to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
 to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
 translating between non-English directions while performing competitively to the best single systems of WMT. We
 open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
 This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
 Training and Generation
 _______________________________________________________________________________________________________________________
 M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
 multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
 source and target text. The source text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source language
 id for source text and target language id for target text, with :obj:`X` being the source or target text.
 The :class:`~transformers.M2M100Tokenizer` depends on :obj:`sentencepiece` so be sure to install it before running the
 examples. To install :obj:`sentencepiece` run ``pip install sentencepiece``.
 - Supervised Training
 .. code-block::
    from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
    model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
    tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
    src_text = "Life is like a box of chocolates."
    tgt_text = "La vie est comme une boîte de chocolat."
    model_inputs = tokenizer(src_text, return_tensors="pt")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids
    loss = model(**model_inputs, labels=labels) # forward pass
 - Generation
    M2M100 uses the :obj:`eos_token_id` as the :obj:`decoder_start_token_id` for generation with the target language id
    being forced as the first generated token. To force the target language id as the first generated token, pass the
    `forced_bos_token_id` parameter to the `generate` method. The following example shows how to translate between
    Hindi to French and Chinese to English using the `facebook/m2m100_418M` checkpoint.
 .. code-block::
    >>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
    >>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
    >>> chinese_text = "生活就像一盒巧克力。"
    >>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
    >>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
    >>> # translate Hindi to French
    >>> tokenizer.src_lang = "hi"
    >>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
    >>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
    >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    "La vie est comme une boîte de chocolat."
    >>> # translate Chinese to English
    >>> tokenizer.src_lang = "zh"
    >>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
    >>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
    >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    "Life is like a box of chocolate."
 M2M100Config
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.M2M100Config
    :members:
 M2M100Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.M2M100Tokenizer
    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
        create_token_type_ids_from_sequences, save_vocabulary
 M2M100Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.M2M100Model
    :members: forward
 M2M100ForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.M2M100ForConditionalGeneration
    :members: forward
--- a/docs/source/model_doc/marian.mdx
+++ b/docs/source/model_doc/marian.mdx
@@ -0,0 +1,191 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # MarianMT
 **Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
 and assign @patrickvonplaten.
 Translations should be similar, but not identical to output in the test set linked to in each model card.
 ## Implementation Notes
 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found [here](https://huggingface.co/Helsinki-NLP).
 - Models were originally trained by [Jörg Tiedemann](https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann) using the [Marian](https://marian-nmt.github.io/) C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as [`BartForConditionalGeneration`] with a few minor modifications:
  - static (sinusoid) positional embeddings (`MarianConfig.static_position_embeddings=True`)
  - no layernorm_embedding (`MarianConfig.normalize_embedding=False`)
  - the model starts generating with `pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
    `<s/>`),
 - Code to bulk convert models can be found in `convert_marian_to_pytorch.py`.
 - This model was contributed by [sshleifer](https://huggingface.co/sshleifer).
 ## Naming
 - All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found [here](https://developers.google.com/admin-sdk/directory/v1/languages), three digit codes require googling "language
  code {code}".
 - Codes formatted like `es_AR` are usually `code_{region}`. That one is Spanish from Argentina.
 - The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
  group use a combination of ISO-639-5 codes and ISO-639-2 codes.
 ## Examples
 - Since Marian models are smaller than many other translation models available in the library, they can be useful for
  fine-tuning experiments and integration tests.
 - [Fine-tune on GPU](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh)
 - [Fine-tune on GPU with pytorch-lightning](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh)
 ## Multilingual Models
 - All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`:
 - If a model can output multiple languages, and you should specify a language code by prepending the desired output
  language to the `src_text`.
 - You can see a models's supported language codes in its model card, under target constituents, like in [opus-mt-en-roa](https://huggingface.co/Helsinki-NLP/opus-mt-en-roa).
 - Note that if a model is only multilingual on the source side, like `Helsinki-NLP/opus-mt-roa-en`, no language
  codes are required.
 New multi-lingual models from the [Tatoeba-Challenge repo](https://github.com/Helsinki-NLP/Tatoeba-Challenge)
 require 3 character language codes:
 ```python
 >>> from transformers import MarianMTModel, MarianTokenizer
 >>> src_text = [
 ...     '>>fra<< this is a sentence in english that we want to translate to french',
 ...     '>>por<< This should go to portuguese',
 ...     '>>esp<< And this to Spanish'
 >>> ]
 >>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
 >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
 >>> print(tokenizer.supported_language_codes)
 ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
 >>> model = MarianMTModel.from_pretrained(model_name)
 >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
 >>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
 ["c'est une phrase en anglais que nous voulons traduire en français",
 'Isto deve ir para o português.',
 'Y esto al español']
 ```
 Here is the code to see all available pretrained models on the hub:
 ```python
 from huggingface_hub import list_models
 model_list = list_models()
 org = "Helsinki-NLP"
 model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
 suffix = [x.split('/')[1] for x in model_ids]
 old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
 ```
 ## Old Style Multi-Lingual Models
 These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
 group:
 ```python
 ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
 'Helsinki-NLP/opus-mt-de-ZH',
 'Helsinki-NLP/opus-mt-en-CELTIC',
 'Helsinki-NLP/opus-mt-en-ROMANCE',
 'Helsinki-NLP/opus-mt-es-NORWAY',
 'Helsinki-NLP/opus-mt-fi-NORWAY',
 'Helsinki-NLP/opus-mt-fi-ZH',
 'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
 'Helsinki-NLP/opus-mt-sv-NORWAY',
 'Helsinki-NLP/opus-mt-sv-ZH']
 GROUP_MEMBERS = {
 'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
 'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
 'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
 'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
 'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
 'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
 }
 ```
 Example of translating english to many romance languages, using old-style 2 character language codes
 ```python
 >>> from transformers import MarianMTModel, MarianTokenizer
 >>> src_text = [
 ...     '>>fr<< this is a sentence in english that we want to translate to french',
 ...     '>>pt<< This should go to portuguese',
 ...     '>>es<< And this to Spanish'
 >>> ]
 >>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
 >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
 >>> model = MarianMTModel.from_pretrained(model_name)
 >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
 >>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
 ["c'est une phrase en anglais que nous voulons traduire en français", 
 'Isto deve ir para o português.',
 'Y esto al español']
 ```
 ## MarianConfig
 [[autodoc]] MarianConfig
 ## MarianTokenizer
 [[autodoc]] MarianTokenizer
    - as_target_tokenizer
 ## MarianModel
 [[autodoc]] MarianModel
    - forward
 ## MarianMTModel
 [[autodoc]] MarianMTModel
    - forward
 ## MarianForCausalLM
 [[autodoc]] MarianForCausalLM
    - forward
 ## TFMarianModel
 [[autodoc]] TFMarianModel
    - call
 ## TFMarianMTModel
 [[autodoc]] TFMarianMTModel
    - call
 ## FlaxMarianModel
 [[autodoc]] FlaxMarianModel
    - __call__
 ## FlaxMarianMTModel
 [[autodoc]] FlaxMarianMTModel
    - __call__
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -1,232 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 MarianMT
 -----------------------------------------------------------------------------------------------------------------------
 **Bugs:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
 and assign @patrickvonplaten.
 Translations should be similar, but not identical to output in the test set linked to in each model card.
 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
 - Models were originally trained by `Jörg Tiedemann
  <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
  <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
 - All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
 - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
      :obj:`<s/>`),
 - Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
 - This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__.
 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here
  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
  code {code}".
 - Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
 - The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
  group use a combination of ISO-639-5 codes and ISO-639-2 codes.
 Examples
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - Since Marian models are smaller than many other translation models available in the library, they can be useful for
  fine-tuning experiments and integration tests.
 - `Fine-tune on GPU
  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
 - `Fine-tune on GPU with pytorch-lightning
  <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
 - If a model can output multiple languages, and you should specify a language code by prepending the desired output
  language to the :obj:`src_text`.
 - You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
  <https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
 - Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
  codes are required.
 New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
 require 3 character language codes:
 .. code-block:: python
    >>> from transformers import MarianMTModel, MarianTokenizer
    >>> src_text = [
    ...     '>>fra<< this is a sentence in english that we want to translate to french',
    ...     '>>por<< This should go to portuguese',
    ...     '>>esp<< And this to Spanish'
    >>> ]
    >>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
    >>> print(tokenizer.supported_language_codes)
    ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
    >>> model = MarianMTModel.from_pretrained(model_name)
    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    >>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    ["c'est une phrase en anglais que nous voulons traduire en français",
     'Isto deve ir para o português.',
     'Y esto al español']
 Here is the code to see all available pretrained models on the hub:
 .. code-block:: python
    from huggingface_hub import list_models
    model_list = list_models()
    org = "Helsinki-NLP"
    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
    suffix = [x.split('/')[1] for x in model_ids]
    old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
 Old Style Multi-Lingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
 group:
 .. code-block:: python
    ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
     'Helsinki-NLP/opus-mt-ROMANCE-en',
     'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
     'Helsinki-NLP/opus-mt-de-ZH',
     'Helsinki-NLP/opus-mt-en-CELTIC',
     'Helsinki-NLP/opus-mt-en-ROMANCE',
     'Helsinki-NLP/opus-mt-es-NORWAY',
     'Helsinki-NLP/opus-mt-fi-NORWAY',
     'Helsinki-NLP/opus-mt-fi-ZH',
     'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
     'Helsinki-NLP/opus-mt-sv-NORWAY',
     'Helsinki-NLP/opus-mt-sv-ZH']
    GROUP_MEMBERS = {
     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
    }
 Example of translating english to many romance languages, using old-style 2 character language codes
 .. code-block::python
    >>> from transformers import MarianMTModel, MarianTokenizer
    >>> src_text = [
    ...     '>>fr<< this is a sentence in english that we want to translate to french',
    ...     '>>pt<< This should go to portuguese',
    ...     '>>es<< And this to Spanish'
    >>> ]
    >>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
    >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
    >>> model = MarianMTModel.from_pretrained(model_name)
    >>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    >>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    ["c'est une phrase en anglais que nous voulons traduire en français", 
     'Isto deve ir para o português.',
     'Y esto al español']
 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianConfig
    :members:
 MarianTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianTokenizer
    :members: as_target_tokenizer
 MarianModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianModel
    :members: forward
 MarianMTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianMTModel
    :members: forward
 MarianForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianForCausalLM
    :members: forward
 TFMarianModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFMarianModel
    :members: call
 TFMarianMTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFMarianMTModel
    :members: call
 FlaxMarianModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMarianModel
    :members: __call__
 FlaxMarianMTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMarianMTModel
    :members: __call__
--- a/docs/source/model_doc/mbart.mdx
+++ b/docs/source/model_doc/mbart.mdx
@@ -0,0 +1,230 @@
 <!--Copyright 2020 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # MBart and MBart-50
 **DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
@patrickvonplaten
 ## Overview of MBart
 The MBart model was presented in [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
 Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
 sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
 on the encoder, decoder, or reconstructing parts of the text.
 This model was contributed by [valhalla](https://huggingface.co/valhalla). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/mbart)
 ### Training of MBart
 MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
 model is multilingual it expects the sequences in a different format. A special language id token is added in both the
 source and target text. The source text format is `X [eos, src_lang_code]` where `X` is the source text. The
 target text format is `[tgt_lang_code] X [eos]`. `bos` is never used.
 The regular [`~MBartTokenizer.__call__`] will encode source text format, and it should be wrapped
 inside the context manager [`~MBartTokenizer.as_target_tokenizer`] to encode target text format.
 - Supervised training
 ```python
 >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
 >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
 >>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
 >>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
 >>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
 >>> with tokenizer.as_target_tokenizer():
 ...     labels = tokenizer(expected_translation_romanian, return_tensors="pt")
 >>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
 >>> # forward pass
 >>> model(**inputs, labels=batch['labels'])
 ```
 - Generation
  While generating the target text set the `decoder_start_token_id` to the target language id. The following
  example shows how to translate English to Romanian using the *facebook/mbart-large-en-ro* model.
 ```python
 >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
 >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
 >>> article = "UN Chief Says There Is No Military Solution in Syria"
 >>> inputs = tokenizer(article, return_tensors="pt")
 >>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
 >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
 "Şeful ONU declară că nu există o soluţie militară în Siria"
 ```
 ## Overview of MBart-50
 MBart-50 was introduced in the *Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
 <https://arxiv.org/abs/2008.00401>* paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
 Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding
 its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
 languages.
 According to the abstract
 *Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
 direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
 can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
 average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
 improving 9.3 BLEU on average over bilingual baselines from scratch.*
 ### Training of MBart-50
 The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
 for both source and target text i.e the text format is `[lang_code] X [eos]`, where `lang_code` is source
 language id for source text and target language id for target text, with `X` being the source or target text
 respectively.
 MBart-50 has its own tokenizer [`MBart50Tokenizer`].
 -  Supervised training
 ```python
 from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
 model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
 tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
 src_text = " UN Chief Says There Is No Military Solution in Syria"
 tgt_text =  "Şeful ONU declară că nu există o soluţie militară în Siria"
 model_inputs = tokenizer(src_text, return_tensors="pt")
 with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids
 model(**model_inputs, labels=labels) # forward pass
 ```
 - Generation
  To generate using the mBART-50 multilingual translation models, `eos_token_id` is used as the
  `decoder_start_token_id` and the target language id is forced as the first generated token. To force the
  target language id as the first generated token, pass the *forced_bos_token_id* parameter to the *generate* method.
  The following example shows how to translate between Hindi to French and Arabic to English using the
  *facebook/mbart-50-large-many-to-many* checkpoint.
 ```python
 from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
 article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
 article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
 model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
 tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
 # translate Hindi to French
 tokenizer.src_lang = "hi_IN"
 encoded_hi = tokenizer(article_hi, return_tensors="pt")
 generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
 tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 # => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
 # translate Arabic to English
 tokenizer.src_lang = "ar_AR"
 encoded_ar = tokenizer(article_ar, return_tensors="pt")
 generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
 tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
 # => "The Secretary-General of the United Nations says there is no military solution in Syria."
 ```
 ## MBartConfig
 [[autodoc]] MBartConfig
 ## MBartTokenizer
 [[autodoc]] MBartTokenizer
    - as_target_tokenizer
    - build_inputs_with_special_tokens
 ## MBartTokenizerFast
 [[autodoc]] MBartTokenizerFast
 ## MBart50Tokenizer
 [[autodoc]] MBart50Tokenizer
 ## MBart50TokenizerFast
 [[autodoc]] MBart50TokenizerFast
 ## MBartModel
 [[autodoc]] MBartModel
 ## MBartForConditionalGeneration
 [[autodoc]] MBartForConditionalGeneration
 ## MBartForQuestionAnswering
 [[autodoc]] MBartForQuestionAnswering
 ## MBartForSequenceClassification
 [[autodoc]] MBartForSequenceClassification
 ## MBartForCausalLM
 [[autodoc]] MBartForCausalLM
    - forward
 ## TFMBartModel
 [[autodoc]] TFMBartModel
    - call
 ## TFMBartForConditionalGeneration
 [[autodoc]] TFMBartForConditionalGeneration
    - call
 ## FlaxMBartModel
 [[autodoc]] FlaxMBartModel
    - __call__
    - encode
    - decode
 ## FlaxMBartForConditionalGeneration
 [[autodoc]] FlaxMBartForConditionalGeneration
    - __call__
    - encode
    - decode
 ## FlaxMBartForSequenceClassification
 [[autodoc]] FlaxMBartForSequenceClassification
    - __call__
    - encode
    - decode
 ## FlaxMBartForQuestionAnswering
 [[autodoc]] FlaxMBartForQuestionAnswering
    - __call__
    - encode
    - decode
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -1,270 +0,0 @@
 .. 
    Copyright 2020 The HuggingFace Team. All rights reserved.
    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
    the License. You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
    specific language governing permissions and limitations under the License.
 MBart and MBart-50
 -----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange, file a `Github Issue
 <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@patrickvonplaten
 Overview of MBart
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
 <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
 Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
 According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
 corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
 sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
 on the encoder, decoder, or reconstructing parts of the text.
 This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The Authors' code can be found `here
 <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
 Training of MBart
 _______________________________________________________________________________________________________________________
 MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
 model is multilingual it expects the sequences in a different format. A special language id token is added in both the
 source and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The
 target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
 The regular :meth:`~transformers.MBartTokenizer.__call__` will encode source text format, and it should be wrapped
 inside the context manager :meth:`~transformers.MBartTokenizer.as_target_tokenizer` to encode target text format.
 - Supervised training
 .. code-block::
    >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
    >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
    >>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
    >>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
    >>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
    >>> with tokenizer.as_target_tokenizer():
    ...     labels = tokenizer(expected_translation_romanian, return_tensors="pt")
    >>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
    >>> # forward pass
    >>> model(**inputs, labels=batch['labels'])
 - Generation
    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
    example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
 .. code-block::
    >>> from transformers import MBartForConditionalGeneration, MBartTokenizer
    >>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
    >>> article = "UN Chief Says There Is No Military Solution in Syria"
    >>> inputs = tokenizer(article, return_tensors="pt")
    >>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
    >>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    "Şeful ONU declară că nu există o soluţie militară în Siria"
 Overview of MBart-50
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 MBart-50 was introduced in the `Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
 <https://arxiv.org/abs/2008.00401>` paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
 Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original `mbart-large-cc25` checkpoint by extendeding
 its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
 languages.
 According to the abstract
 *Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
 direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
 can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
 average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
 improving 9.3 BLEU on average over bilingual baselines from scratch.*
 Training of MBart-50
 _______________________________________________________________________________________________________________________
 The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
 for both source and target text i.e the text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source
 language id for source text and target language id for target text, with :obj:`X` being the source or target text
 respectively.
 MBart-50 has its own tokenizer :class:`~transformers.MBart50Tokenizer`.
 -  Supervised training
 .. code-block::
    from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
    src_text = " UN Chief Says There Is No Military Solution in Syria"
    tgt_text =  "Şeful ONU declară că nu există o soluţie militară în Siria"
    model_inputs = tokenizer(src_text, return_tensors="pt")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids
    model(**model_inputs, labels=labels) # forward pass
 - Generation
    To generate using the mBART-50 multilingual translation models, :obj:`eos_token_id` is used as the
    :obj:`decoder_start_token_id` and the target language id is forced as the first generated token. To force the
    target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method.
    The following example shows how to translate between Hindi to French and Arabic to English using the
    `facebook/mbart-50-large-many-to-many` checkpoint.
 .. code-block::
    from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
    article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
    article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
    tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
    # translate Hindi to French
    tokenizer.src_lang = "hi_IN"
    encoded_hi = tokenizer(article_hi, return_tensors="pt")
    generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
    tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    # => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
    # translate Arabic to English
    tokenizer.src_lang = "ar_AR"
    encoded_ar = tokenizer(article_ar, return_tensors="pt")
    generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
    tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    # => "The Secretary-General of the United Nations says there is no military solution in Syria."
 MBartConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartConfig
    :members:
 MBartTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartTokenizer
    :members: as_target_tokenizer, build_inputs_with_special_tokens
 MBartTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartTokenizerFast
    :members:
 MBart50Tokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBart50Tokenizer
    :members:
 MBart50TokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBart50TokenizerFast
    :members:
 MBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartModel
    :members:
 MBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartForConditionalGeneration
    :members:
 MBartForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartForQuestionAnswering
    :members:
 MBartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartForSequenceClassification
 MBartForCausalLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MBartForCausalLM
    :members: forward
 TFMBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFMBartModel
    :members: call
 TFMBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.TFMBartForConditionalGeneration
    :members: call
 FlaxMBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMBartModel
    :members: __call__, encode, decode
 FlaxMBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMBartForConditionalGeneration
    :members: __call__, encode, decode
 FlaxMBartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMBartForSequenceClassification
    :members: __call__, encode, decode
 FlaxMBartForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.FlaxMBartForQuestionAnswering
    :members: __call__, encode, decode