diff --git a/docs/source/benchmarks.rst b/docs/source/benchmarks.rst
index 38afce66fb..4e17c75e19 100644
--- a/docs/source/benchmarks.rst
+++ b/docs/source/benchmarks.rst
@@ -1,12 +1,12 @@
 Benchmarks
-==========
+=======================================================================================================================
 
 Let's take a look at how 🤗 Transformer models can be benchmarked, best practices, and already available benchmarks.
 
 A notebook explaining in more detail how to benchmark 🤗 Transformer models can be found `here <https://github.com/huggingface/transformers/blob/master/notebooks/05-benchmark.ipynb>`__.
 
 How to benchmark 🤗 Transformer models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The classes :class:`~transformers.PyTorchBenchmark` and :class:`~transformers.TensorFlowBenchmark` allow to flexibly benchmark 🤗 Transformer models.
 The benchmark classes allow us to measure the `peak memory usage` and `required time` for both 
@@ -300,7 +300,7 @@ deciding for which configuration the model should be trained.
 
 
 Benchmark best practices
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This section lists a couple of best practices one should be aware of when benchmarking a model.
 
@@ -311,7 +311,7 @@ This section lists a couple of best practices one should be aware of when benchm
 
 
 Sharing your benchmark
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Previously all available core models (10 at the time) have been benchmarked for `inference time`, across many different settings: using PyTorch, with
 and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for
diff --git a/docs/source/bertology.rst b/docs/source/bertology.rst
index e1ebda78d6..8bb8c46e7d 100644
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
@@ -1,5 +1,5 @@
 BERTology
----------
+-----------------------------------------------------------------------------------------------------------------------
 
 There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
 
diff --git a/docs/source/converting_tensorflow_models.rst b/docs/source/converting_tensorflow_models.rst
index 4151f8cf5c..a96af984dd 100644
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -1,5 +1,5 @@
 Converting Tensorflow Checkpoints
-================================================
+=======================================================================================================================
 
 A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints in models than be loaded using the ``from_pretrained`` methods of the library.
 
@@ -10,7 +10,7 @@ A command-line interface is provided to convert original Bert/GPT/GPT-2/Transfor
     The documentation below reflects the **transformers-cli convert** command format.
 
 BERT
-^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
 
@@ -34,7 +34,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
 
 ALBERT
-^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
 
@@ -54,7 +54,7 @@ Here is an example of the conversion process for the pre-trained ``ALBERT Base``
 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
 
 OpenAI GPT
-^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
 
@@ -70,7 +70,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
 
 
 OpenAI GPT-2
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see `here <https://github.com/openai/gpt-2>`__\ )
 
@@ -85,7 +85,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT-2 mode
      [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK]
 
 Transformer-XL
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
 
@@ -101,7 +101,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
 
 
 XLNet
-^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here is an example of the conversion process for a pre-trained XLNet model:
 
@@ -118,7 +118,7 @@ Here is an example of the conversion process for a pre-trained XLNet model:
 
 
 XLM
-^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Here is an example of the conversion process for a pre-trained XLM model:
 
diff --git a/docs/source/custom_datasets.rst b/docs/source/custom_datasets.rst
index c9fe228792..9e93c56978 100644
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@@ -1,5 +1,5 @@
 Fine-tuning with custom datasets
-================================
+=======================================================================================================================
 
 .. note::
 
@@ -24,7 +24,7 @@ We include several examples, each of which demonstrates a different type of comm
 .. _seq_imdb:
 
 Sequence Classification with IMDb Reviews
------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 .. note::
 
@@ -139,7 +139,7 @@ Now that our datasets our ready, we can fine-tune a model either with the 🤗
 .. _ft_trainer:
 
 Fine-tuning with Trainer
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a
 model to fine-tune, define the :class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments`
@@ -200,7 +200,7 @@ and instantiate a :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer
 .. _ft_native:
 
 Fine-tuning with native PyTorch/TensorFlow
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 We can also train use native PyTorch or TensorFlow:
 
@@ -244,7 +244,7 @@ We can also train use native PyTorch or TensorFlow:
 .. _tok_ner:
 
 Token Classification with W-NUT Emerging Entities
--------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 .. note::
 
@@ -443,7 +443,7 @@ sequence classification example above.
 .. _qa_squad:
 
 Question Answering with SQuAD 2.0
----------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 .. note::
 
@@ -655,7 +655,7 @@ multiple model outputs.
 .. _resources:
 
 Additional Resources
---------------------
+-----------------------------------------------------------------------------------------------------------------------
 
   - `How to train a new language model from scratch using Transformers and Tokenizers
     <https://huggingface.co/blog/how-to-train>`_. Blog post showing the steps to load in Esperanto data and train a
@@ -666,7 +666,7 @@ Additional Resources
 .. _nlplib:
 
 Using the 🤗 NLP Datasets & Metrics library
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This tutorial demonstrates how to read in datasets from various raw text formats and prepare them for training with
 🤗 Transformers so that you can do the same thing with your own custom datasets. However, we recommend users use the
diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst
index f2f15755a5..406d9f30e4 100644
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -1,8 +1,8 @@
 Glossary
-^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 General terms
--------------
+-----------------------------------------------------------------------------------------------------------------------
 
 - autoencoding models: see MLM
 - autoregressive models: see CLM
@@ -27,7 +27,7 @@ General terms
   or a punctuation symbol.
 
 Model inputs
-------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are
 detailed here alongside usage examples.
@@ -35,7 +35,7 @@ detailed here alongside usage examples.
 .. _input-ids:
 
 Input IDs
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
 numerical representations of tokens building the sequences that will be used as input by the model*.
@@ -43,7 +43,7 @@ numerical representations of tokens building the sequences that will be used as
 Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
 tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
 
-::
+.. code-block::
 
     >>> from transformers import BertTokenizer
     >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -52,7 +52,7 @@ tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ token
 
 The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
 
-::
+.. code-block::
 
     >>> tokenized_sequence = tokenizer.tokenize(sequence)
 
@@ -60,7 +60,7 @@ The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the
 in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
 added for "RA" and "M":
 
-::
+.. code-block::
 
     >>> print(tokenized_sequence)
     ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
@@ -69,14 +69,14 @@ These tokens can then be converted into IDs which are understandable by the mode
 the sentence to the tokenizer, which leverages the Rust implementation of
 `huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
 
-::
+.. code-block::
 
     >>> inputs = tokenizer(sequence)
 
 The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
 token indices are under the key "input_ids":
 
-::
+.. code-block::
 
     >>> encoded_sequence = inputs["input_ids"]
     >>> print(encoded_sequence)
@@ -87,13 +87,13 @@ IDs the model sometimes uses.
 
 If we decode the previous sequence of ids,
 
-::
+.. code-block::
 
     >>> decoded_sequence = tokenizer.decode(encoded_sequence)
 
 we will see
 
-::
+.. code-block::
 
     >>> print(decoded_sequence)
     [CLS] A Titan RTX has 24GB of VRAM [SEP]
@@ -103,14 +103,14 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 .. _attention-mask:
 
 Attention mask
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The attention mask is an optional argument used when batching sequences together. This argument indicates to the
 model which tokens should be attended to, and which should not.
 
 For example, consider these two sequences:
 
-::
+.. code-block::
 
     >>> from transformers import BertTokenizer
     >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -123,7 +123,7 @@ For example, consider these two sequences:
 
 The encoded versions have different lengths:
 
-::
+.. code-block::
 
     >>> len(encoded_sequence_a), len(encoded_sequence_b)
     (8, 19)
@@ -134,13 +134,13 @@ of the second one, or the second one needs to be truncated down to the length of
 In the first case, the list of IDs will be extended by the padding indices. We can pass a list to the tokenizer and ask
 it to pad like this:
 
-::
+.. code-block::
 
     >>> padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
 
 We can see that 0s have been added on the right of the first sentence to make it the same length as the second one:
 
-::
+.. code-block::
 
     >>> padded_sequences["input_ids"]
     [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
@@ -150,7 +150,7 @@ the position of the padded indices so that the model does not attend to them. Fo
 :class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
 a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
 
-::
+.. code-block::
 
     >>> padded_sequences["attention_mask"]
     [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
@@ -158,20 +158,20 @@ a padded value. This attention mask is in the dictionary returned by the tokeniz
 .. _token-type-ids:
 
 Token Type IDs
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Some models' purpose is to do sequence classification or question answering. These require two different sequences to
 be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
 tokens. For example, the BERT model builds its two sequence input as such:
 
-::
+.. code-block::
 
    >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
 
 We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
 not a list, like before) like this:
 
-::
+.. code-block::
 
     >>> from transformers import BertTokenizer
     >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
@@ -183,7 +183,7 @@ not a list, like before) like this:
 
 which will return:
 
-::
+.. code-block::
 
     >>> print(decoded)
     [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
@@ -194,7 +194,7 @@ mask identifying the two types of sequence in the model.
 
 The tokenizer returns this mask as the "token_type_ids" entry:
 
-::
+.. code-block::
 
     >>> encoded_dict['token_type_ids']
     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
@@ -207,7 +207,7 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
 .. _position-ids:
 
 Position IDs
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Contrary to RNNs that have the position of each token embedded within them,
 transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
@@ -221,7 +221,7 @@ use other types of positional embeddings, such as sinusoidal position embeddings
 .. _feed-forward-chunking:
 
 Feed Forward Chunking
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
 The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 8ff8d206b3..1b9d882c71 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,5 +1,5 @@
 Transformers
-================================================================================================================================================
+=======================================================================================================================
 
 State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
 
@@ -11,7 +11,7 @@ TensorFlow 2.0 and PyTorch.
 This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`_.
 
 Features
----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 - High performance on NLU and NLG tasks
 - Low barrier to entry for educators and practitioners
@@ -36,7 +36,7 @@ Choose the right framework for every part of a model's lifetime:
 - Seamlessly pick the right framework for training, evaluation, production
 
 Contents
----------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The documentation is organized in five parts:
 
diff --git a/docs/source/internal/modeling_utils.rst b/docs/source/internal/modeling_utils.rst
index 9e7fb6b11c..a5d59ffa10 100644
--- a/docs/source/internal/modeling_utils.rst
+++ b/docs/source/internal/modeling_utils.rst
@@ -1,13 +1,13 @@
 Custom Layers and Utilities
----------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 This page lists all the custom layers used by the library, as well as the utility functions it provides for modeling.
 
 Most of those are only useful if you are studying the code of the models in the library.
 
 
-``Pytorch custom modules``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+Pytorch custom modules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_utils.Conv1D
 
@@ -29,8 +29,8 @@ Most of those are only useful if you are studying the code of the models in the
     :members: forward
 
 
-``PyTorch Helper Functions``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+PyTorch Helper Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autofunction:: transformers.apply_chunking_to_forward
 
@@ -42,8 +42,8 @@ Most of those are only useful if you are studying the code of the models in the
 
 .. autofunction:: transformers.modeling_utils.prune_linear_layer
 
-``TensorFlow custom layers``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TensorFlow custom layers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_tf_utils.TFConv1D
 
@@ -54,8 +54,8 @@ Most of those are only useful if you are studying the code of the models in the
     :members: call
 
 
-``TensorFlow loss functions``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TensorFlow loss functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_tf_utils.TFCausalLanguageModelingLoss
     :members:
@@ -76,8 +76,8 @@ Most of those are only useful if you are studying the code of the models in the
     :members:
 
 
-``TensorFlow Helper Functions``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TensorFlow Helper Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autofunction:: transformers.modeling_tf_utils.cast_bool_to_primitive
 
diff --git a/docs/source/internal/pipelines_utils.rst b/docs/source/internal/pipelines_utils.rst
index c6fda75803..1e33551af7 100644
--- a/docs/source/internal/pipelines_utils.rst
+++ b/docs/source/internal/pipelines_utils.rst
@@ -1,40 +1,40 @@
-Utilities for pipelines
------------------------
-
-This page lists all the utility functions the library provides for pipelines.
-
-Most of those are only useful if you are studying the code of the models in the library.
-
-
-Argument handling
-~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.pipelines.ArgumentHandler
-
-.. autoclass:: transformers.pipelines.ZeroShotClassificationArgumentHandler
-
-.. autoclass:: transformers.pipelines.QuestionAnsweringArgumentHandler
-
-
-Data format
-~~~~~~~~~~~
-
-.. autoclass:: transformers.pipelines.PipelineDataFormat
-    :members:
-
-.. autoclass:: transformers.pipelines.CsvPipelineDataFormat
-    :members:
-
-.. autoclass:: transformers.pipelines.JsonPipelineDataFormat
-    :members:
-
-.. autoclass:: transformers.pipelines.PipedPipelineDataFormat
-    :members:
-
-
-Utilities
-~~~~~~~~~
-
-.. autofunction:: transformers.pipelines.get_framework
-
-.. autoclass:: transformers.pipelines.PipelineException
+Utilities for pipelines
+-----------------------------------------------------------------------------------------------------------------------
+
+This page lists all the utility functions the library provides for pipelines.
+
+Most of those are only useful if you are studying the code of the models in the library.
+
+
+Argument handling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.pipelines.ArgumentHandler
+
+.. autoclass:: transformers.pipelines.ZeroShotClassificationArgumentHandler
+
+.. autoclass:: transformers.pipelines.QuestionAnsweringArgumentHandler
+
+
+Data format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.pipelines.PipelineDataFormat
+    :members:
+
+.. autoclass:: transformers.pipelines.CsvPipelineDataFormat
+    :members:
+
+.. autoclass:: transformers.pipelines.JsonPipelineDataFormat
+    :members:
+
+.. autoclass:: transformers.pipelines.PipedPipelineDataFormat
+    :members:
+
+
+Utilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autofunction:: transformers.pipelines.get_framework
+
+.. autoclass:: transformers.pipelines.PipelineException
diff --git a/docs/source/internal/tokenization_utils.rst b/docs/source/internal/tokenization_utils.rst
index 48752c8de2..ccd557404c 100644
--- a/docs/source/internal/tokenization_utils.rst
+++ b/docs/source/internal/tokenization_utils.rst
@@ -1,38 +1,38 @@
-Utilities for Tokenizers
-------------------------
-
-This page lists all the utility functions used by the tokenizers, mainly the class
-:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase` that implements the common methods between
-:class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast` and the mixin
-:class:`~transformers.tokenization_utils_base.SpecialTokensMixin`.
-
-Most of those are only useful if you are studying the code of the tokenizers in the library.
-
-``PreTrainedTokenizerBase``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.tokenization_utils_base.PreTrainedTokenizerBase
-    :special-members: __call__
-    :members:
-
-
-``SpecialTokensMixin``
-~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.tokenization_utils_base.SpecialTokensMixin
-    :members:
-
-
-Enums and namedtuples
-~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
-
-.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
-
-.. autoclass:: transformers.tokenization_utils_base.TensorType
-
-.. autoclass:: transformers.tokenization_utils_base.TruncationStrategy
-
-.. autoclass:: transformers.tokenization_utils_base.CharSpan
-
-.. autoclass:: transformers.tokenization_utils_base.TokenSpan
+Utilities for Tokenizers
+-----------------------------------------------------------------------------------------------------------------------
+
+This page lists all the utility functions used by the tokenizers, mainly the class
+:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerBase` that implements the common methods between
+:class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast` and the mixin
+:class:`~transformers.tokenization_utils_base.SpecialTokensMixin`.
+
+Most of those are only useful if you are studying the code of the tokenizers in the library.
+
+PreTrainedTokenizerBase
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.tokenization_utils_base.PreTrainedTokenizerBase
+    :special-members: __call__
+    :members:
+
+
+SpecialTokensMixin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.tokenization_utils_base.SpecialTokensMixin
+    :members:
+
+
+Enums and namedtuples
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.tokenization_utils_base.ExplicitEnum
+
+.. autoclass:: transformers.tokenization_utils_base.PaddingStrategy
+
+.. autoclass:: transformers.tokenization_utils_base.TensorType
+
+.. autoclass:: transformers.tokenization_utils_base.TruncationStrategy
+
+.. autoclass:: transformers.tokenization_utils_base.CharSpan
+
+.. autoclass:: transformers.tokenization_utils_base.TokenSpan
diff --git a/docs/source/main_classes/configuration.rst b/docs/source/main_classes/configuration.rst
index eb48f0b15e..04db915c06 100644
--- a/docs/source/main_classes/configuration.rst
+++ b/docs/source/main_classes/configuration.rst
@@ -1,5 +1,5 @@
 Configuration
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The base class :class:`~transformers.PretrainedConfig` implements the common methods for loading/saving a configuration
 either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded
@@ -7,7 +7,7 @@ from HuggingFace's AWS S3 repository).
 
 
 PretrainedConfig
-~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.PretrainedConfig
     :members:
diff --git a/docs/source/main_classes/logging.rst b/docs/source/main_classes/logging.rst
index e73f1b84f5..2642726fec 100644
--- a/docs/source/main_classes/logging.rst
+++ b/docs/source/main_classes/logging.rst
@@ -1,18 +1,20 @@
 Logging
--------
+-----------------------------------------------------------------------------------------------------------------------
 
 🤗 Transformers has a centralized logging system, so that you can setup the verbosity of the library easily.
 
 Currently the default verbosity of the library is ``WARNING``.
 
-To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity to the INFO level.
+To change the level of verbosity, just use one of the direct setters. For instance, here is how to change the verbosity
+to the INFO level.
 
 .. code-block:: python
 
     import transformers
     transformers.logging.set_verbosity_info()
 
-You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override the default verbosity. You can set it to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
+You can also use the environment variable ``TRANSFORMERS_VERBOSITY`` to override the default verbosity. You can set it
+to one of the following: ``debug``, ``info``, ``warning``, ``error``, ``critical``. For example:
 
 .. code-block:: bash
                
@@ -32,7 +34,7 @@ verbose to the most verbose), those levels (with their corresponding int values
 - :obj:`transformers.logging.DEBUG` (int value, 10): report all information.
 
 Base setters
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autofunction:: transformers.logging.set_verbosity_error
 
@@ -43,7 +45,7 @@ Base setters
 .. autofunction:: transformers.logging.set_verbosity_debug
 
 Other functions
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autofunction:: transformers.logging.get_verbosity
 
diff --git a/docs/source/main_classes/model.rst b/docs/source/main_classes/model.rst
index f6500438b1..ce988456ba 100644
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
@@ -1,5 +1,5 @@
 Models
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The base classes :class:`~transformers.PreTrainedModel` and :class:`~transformers.TFPreTrainedModel` implement the
 common methods for loading/saving a model either from a local file or directory, or from a pretrained model
@@ -17,36 +17,36 @@ for text generation, :class:`~transformers.generation_utils.GenerationMixin` (fo
 :class:`~transformers.generation_tf_utils.TFGenerationMixin` (for the TensorFlow models)
 
 
-``PreTrainedModel``
-~~~~~~~~~~~~~~~~~~~~~
+PreTrainedModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.PreTrainedModel
     :members:
 
 
-``ModuleUtilsMixin``
-~~~~~~~~~~~~~~~~~~~~
+ModuleUtilsMixin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_utils.ModuleUtilsMixin
     :members:
 
 
-``TFPreTrainedModel``
-~~~~~~~~~~~~~~~~~~~~~
+TFPreTrainedModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFPreTrainedModel
     :members:
 
 
-``TFModelUtilsMixin``
-~~~~~~~~~~~~~~~~~~~~~
+TFModelUtilsMixin
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_tf_utils.TFModelUtilsMixin
     :members:
 
 
 Generative models
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.generation_utils.GenerationMixin
     :members:
diff --git a/docs/source/main_classes/optimizer_schedules.rst b/docs/source/main_classes/optimizer_schedules.rst
index 998100075e..149e745425 100644
--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
@@ -1,5 +1,5 @@
 Optimization
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The ``.optimization`` module provides:
 
@@ -7,29 +7,29 @@ The ``.optimization`` module provides:
 - several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
 - a gradient accumulation class to accumulate the gradients of multiple batches
 
-``AdamW`` (PyTorch)
-~~~~~~~~~~~~~~~~~~~
+AdamW (PyTorch)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AdamW
     :members:
 
-``AdaFactor`` (PyTorch)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AdaFactor (PyTorch)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.Adafactor
 
-``AdamWeightDecay`` (TensorFlow)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AdamWeightDecay (TensorFlow)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AdamWeightDecay
 
 .. autofunction:: transformers.create_optimizer
 
 Schedules
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Learning Rate Schedules (Pytorch)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autofunction:: transformers.get_constant_schedule
 
@@ -62,16 +62,16 @@ Learning Rate Schedules (Pytorch)
     :target: /imgs/warmup_linear_schedule.png
     :alt:
 
-``Warmup`` (TensorFlow)
-^^^^^^^^^^^^^^^^^^^^^^^
+Warmup (TensorFlow)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: transformers.WarmUp
     :members:
 
 Gradient Strategies
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-``GradientAccumulator`` (TensorFlow)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+GradientAccumulator (TensorFlow)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: transformers.GradientAccumulator
diff --git a/docs/source/main_classes/output.rst b/docs/source/main_classes/output.rst
index fe43c8e59b..f1e8e01b0d 100644
--- a/docs/source/main_classes/output.rst
+++ b/docs/source/main_classes/output.rst
@@ -1,5 +1,5 @@
 Model outputs
--------------
+-----------------------------------------------------------------------------------------------------------------------
 
 PyTorch models have outputs that are instances of subclasses of :class:`~transformers.file_utils.ModelOutput`. Those
 are data structures containing all the information returned by the model, but that can also be used as tuples or
@@ -44,98 +44,217 @@ values. Here for instance, it has two keys that are ``loss`` and ``logits``.
 We document here the generic model outputs that are used by more than one model type. Specific output types are
 documented on their corresponding model page.
 
-``ModelOutput``
-~~~~~~~~~~~~~~~
+ModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.file_utils.ModelOutput
     :members:
 
-``BaseModelOutput``
-~~~~~~~~~~~~~~~~~~~
+
+BaseModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.BaseModelOutput
     :members:
 
-``BaseModelOutputWithPooling``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+BaseModelOutputWithPooling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPooling
     :members:
 
-``BaseModelOutputWithPast``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+BaseModelOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.BaseModelOutputWithPast
     :members:
 
-``Seq2SeqModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~
+Seq2SeqModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.Seq2SeqModelOutput
     :members:
 
-``CausalLMOutput``
-~~~~~~~~~~~~~~~~~~
+
+CausalLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.CausalLMOutput
     :members:
 
-``CausalLMOutputWithPast``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+CausalLMOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.CausalLMOutputWithPast
     :members:
 
-``MaskedLMOutput``
-~~~~~~~~~~~~~~~~~~
+
+MaskedLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.MaskedLMOutput
     :members:
 
-``Seq2SeqLMOutput``
-~~~~~~~~~~~~~~~~~~~
+
+Seq2SeqLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.Seq2SeqLMOutput
     :members:
 
-``NextSentencePredictorOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NextSentencePredictorOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.NextSentencePredictorOutput
     :members:
 
-``SequenceClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+SequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.SequenceClassifierOutput
     :members:
 
-``Seq2SeqSequenceClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Seq2SeqSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput
     :members:
 
-``MultipleChoiceModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+MultipleChoiceModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.MultipleChoiceModelOutput
     :members:
 
-``TokenClassifierOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TokenClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.TokenClassifierOutput
     :members:
 
-``QuestionAnsweringModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.QuestionAnsweringModelOutput
     :members:
 
-``Seq2SeqQuestionAnsweringModelOutput``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Seq2SeqQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_outputs.Seq2SeqQuestionAnsweringModelOutput
     :members:
+
+
+TFBaseModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutput
+    :members:
+
+
+TFBaseModelOutputWithPooling
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling
+    :members:
+
+
+TFBaseModelOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFBaseModelOutputWithPast
+    :members:
+
+
+TFSeq2SeqModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqModelOutput
+    :members:
+
+
+TFCausalLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFCausalLMOutput
+    :members:
+
+
+TFCausalLMOutputWithPast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFCausalLMOutputWithPast
+    :members:
+
+
+TFMaskedLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFMaskedLMOutput
+    :members:
+
+
+TFSeq2SeqLMOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqLMOutput
+    :members:
+
+
+TFNextSentencePredictorOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFNextSentencePredictorOutput
+    :members:
+
+
+TFSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFSequenceClassifierOutput
+    :members:
+
+
+TFSeq2SeqSequenceClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput
+    :members:
+
+
+TFMultipleChoiceModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput
+    :members:
+
+
+TFTokenClassifierOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFTokenClassifierOutput
+    :members:
+
+
+TFQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput
+    :members:
+
+
+TFSeq2SeqQuestionAnsweringModelOutput
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput
+    :members:
diff --git a/docs/source/main_classes/pipelines.rst b/docs/source/main_classes/pipelines.rst
index 1ca9dd59f6..4f28a49ec0 100644
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
@@ -1,5 +1,5 @@
 Pipelines
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most
 of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity
@@ -24,7 +24,7 @@ There are two categories of pipeline abstractions to be aware about:
     - :class:`~transformers.Text2TextGenerationPipeline`
 
 The pipeline abstraction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
 other pipeline but requires an additional argument which is the `task`.
@@ -33,10 +33,10 @@ other pipeline but requires an additional argument which is the `task`.
 
 
 The task specific pipelines
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 ConversationalPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.Conversation
 
@@ -45,76 +45,76 @@ ConversationalPipeline
     :members:
 
 FeatureExtractionPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.FeatureExtractionPipeline
     :special-members: __call__
     :members:
 
 FillMaskPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.FillMaskPipeline
     :special-members: __call__
     :members:
 
 NerPipeline
-==========================================
+=======================================================================================================================
 
 This class is an alias of the :class:`~transformers.TokenClassificationPipeline` defined below. Please refer to that
 pipeline for documentation and usage examples.
 
 QuestionAnsweringPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.QuestionAnsweringPipeline
     :special-members: __call__
     :members:
 
 SummarizationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.SummarizationPipeline
     :special-members: __call__
     :members:
 
 TextClassificationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.TextClassificationPipeline
     :special-members: __call__
     :members:
 
 TextGenerationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.TextGenerationPipeline
     :special-members: __call__
     :members:
 
 Text2TextGenerationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.Text2TextGenerationPipeline
     :special-members: __call__
     :members:
 
 TokenClassificationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.TokenClassificationPipeline
     :special-members: __call__
     :members:
 
 ZeroShotClassificationPipeline
-==========================================
+=======================================================================================================================
 
 .. autoclass:: transformers.ZeroShotClassificationPipeline
     :special-members: __call__
     :members:
 
 Parent class: :obj:`Pipeline`
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.Pipeline
     :members:
diff --git a/docs/source/main_classes/processors.rst b/docs/source/main_classes/processors.rst
index 0e318eff07..9167a43ef3 100644
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -1,11 +1,11 @@
 Processors
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 This library includes processors for several traditional tasks. These processors can be used to process a dataset into
 examples that can be fed to a model.
 
 Processors
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 All processors follow the same architecture which is that of the
 :class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
@@ -26,7 +26,7 @@ of :class:`~transformers.data.processors.utils.InputExample`. These
 
 
 GLUE
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `General Language Understanding Evaluation (GLUE) <https://gluebenchmark.com/>`__ is a benchmark that evaluates
 the performance of models across a diverse set of existing NLU tasks. It was released together with the paper
@@ -52,13 +52,13 @@ Additionally, the following method  can be used to load values from a data file
 .. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features
 
 Example usage
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 An example using these processors is given in the `run_glue.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_glue.py>`__ script.
 
 
 XNLI
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
 the quality of cross-lingual text representations. 
@@ -78,7 +78,7 @@ An example using these processors is given in the
 
 
 SQuAD
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `The Stanford Question Answering Dataset (SQuAD) <https://rajpurkar.github.io/SQuAD-explorer//>`__ is a benchmark that evaluates
 the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper
@@ -88,7 +88,7 @@ the paper `Know What You Don't Know: Unanswerable Questions for SQuAD <https://a
 This library hosts a processor for each of the two versions:
 
 Processors
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Those processors are:
     - :class:`~transformers.data.processors.utils.SquadV1Processor`
@@ -109,7 +109,7 @@ Examples are given below.
 
 
 Example usage
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Here is an example using the processors as well as the conversion method using data files:
 
 Example::
diff --git a/docs/source/main_classes/tokenizer.rst b/docs/source/main_classes/tokenizer.rst
index a15e516df3..30d74ab457 100644
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -1,5 +1,5 @@
 Tokenizer
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most
 of the tokenizers are available in two flavors: a full python implementation and a "Fast" implementation based on the
@@ -36,24 +36,24 @@ alignment methods which can be used to map between the original string (characte
 getting the index of the token comprising a given character or the span of characters corresponding to a given token).
 
 
-``PreTrainedTokenizer``
-~~~~~~~~~~~~~~~~~~~~~~~~
+PreTrainedTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.PreTrainedTokenizer
     :special-members: __call__
     :members:
 
 
-``PreTrainedTokenizerFast``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+PreTrainedTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.PreTrainedTokenizerFast
     :special-members: __call__
     :members:
 
 
-``BatchEncoding``
-~~~~~~~~~~~~~~~~~~~~~~~~
+BatchEncoding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BatchEncoding
     :members:
diff --git a/docs/source/main_classes/trainer.rst b/docs/source/main_classes/trainer.rst
index e6c2c27222..76cd1f34d7 100644
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -1,75 +1,75 @@
-Trainer
-----------
-
-The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
-training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
-
-Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
-:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
-customization during training.
-
-The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
-<https://github.com/NVIDIA/apex>`__ for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
-
-Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop supporting the
-previous features. To inject custom behavior you can subclass them and override the following methods:
-
-- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
-- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaulation DataLoader (PyTorch) or TF Dataset.
-- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
-- **log** -- Logs information on the various objects watching training.
-- **setup_wandb** -- Setups wandb (see `here <https://docs.wandb.com/huggingface>`__ for more information).
-- **create_optimizer_and_scheduler** -- Setups the optimizer and learning rate scheduler if they were not passed at
-  init.
-- **compute_loss** - Computes the loss on a batch of training inputs.
-- **training_step** -- Performs a training step.
-- **prediction_step** -- Performs an evaluation/test step.
-- **run_model** (TensorFlow only) -- Basic pass through the model.
-- **evaluate** -- Runs an evaluation loop and returns metrics.
-- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
-
-Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function:
-
-.. code-block:: python
-
-    from transformers import Trainer
-    class MyTrainer(Trainer):
-        def compute_loss(self, model, inputs):
-            labels = inputs.pop("labels")
-            outputs = models(**inputs)
-            logits = outputs[0]
-            return my_custom_loss(logits, labels)
-
-
-``Trainer`` 
-~~~~~~~~~~~
-
-.. autoclass:: transformers.Trainer
-    :members:
-
-``TFTrainer`` 
-~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTrainer
-    :members:
-
-``TrainingArguments``
-~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TrainingArguments
-    :members:
-
-``TFTrainingArguments``
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TFTrainingArguments
-    :members:
-
-Utilities
-~~~~~~~~~
-
-.. autoclass:: transformers.EvalPrediction
-
-.. autofunction:: transformers.set_seed
-
-.. autofunction:: transformers.torch_distributed_zero_first
+Trainer
+-----------------------------------------------------------------------------------------------------------------------
+
+The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
+training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
+
+Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
+:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
+customization during training.
+
+The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
+<https://github.com/NVIDIA/apex>`__ for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
+
+Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain the basic training loop supporting the
+previous features. To inject custom behavior you can subclass them and override the following methods:
+
+- **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
+- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaulation DataLoader (PyTorch) or TF Dataset.
+- **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
+- **log** -- Logs information on the various objects watching training.
+- **setup_wandb** -- Setups wandb (see `here <https://docs.wandb.com/huggingface>`__ for more information).
+- **create_optimizer_and_scheduler** -- Setups the optimizer and learning rate scheduler if they were not passed at
+  init.
+- **compute_loss** - Computes the loss on a batch of training inputs.
+- **training_step** -- Performs a training step.
+- **prediction_step** -- Performs an evaluation/test step.
+- **run_model** (TensorFlow only) -- Basic pass through the model.
+- **evaluate** -- Runs an evaluation loop and returns metrics.
+- **predict** -- Returns predictions (with metrics if labels are available) on a test set.
+
+Here is an example of how to customize :class:`~transformers.Trainer` using a custom loss function:
+
+.. code-block:: python
+
+    from transformers import Trainer
+    class MyTrainer(Trainer):
+        def compute_loss(self, model, inputs):
+            labels = inputs.pop("labels")
+            outputs = models(**inputs)
+            logits = outputs[0]
+            return my_custom_loss(logits, labels)
+
+
+Trainer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.Trainer
+    :members:
+
+TFTrainer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFTrainer
+    :members:
+
+TrainingArguments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TrainingArguments
+    :members:
+
+TFTrainingArguments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFTrainingArguments
+    :members:
+
+Utilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.EvalPrediction
+
+.. autofunction:: transformers.set_seed
+
+.. autofunction:: transformers.torch_distributed_zero_first
diff --git a/docs/source/model_doc/albert.rst b/docs/source/model_doc/albert.rst
index c78426d0c7..4f5b9c4338 100644
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -1,15 +1,16 @@
 ALBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
-by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
-two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
+The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
+<https://arxiv.org/abs/1909.11942>`__ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
+Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
+speed of BERT:
 
-- Splitting the embedding matrix into two smaller matrices
-- Using repeating layers split among groups
+- Splitting the embedding matrix into two smaller matrices.
+- Using repeating layers split among groups.
 
 The abstract from the paper is the following:
 
@@ -30,17 +31,17 @@ Tips:
   similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
   number of (repeating) layers.
 
-The original code can be found `here <https://github.com/google-research/ALBERT>`_.
+The original code can be found `here <https://github.com/google-research/ALBERT>`__.
 
 AlbertConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertConfig
     :members:
 
 
 AlbertTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -48,7 +49,7 @@ AlbertTokenizer
 
 
 Albert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_albert.AlbertForPreTrainingOutput
     :members:
@@ -58,98 +59,98 @@ Albert specific outputs
 
 
 AlbertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertModel
-    :members:
+    :members: forward
 
 
 AlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForPreTraining
-    :members:
+    :members: forward
 
 
 AlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForMaskedLM
-    :members:
+    :members: forward
 
 
 AlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForSequenceClassification
-    :members:
+    :members: forward
 
 
 AlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForMultipleChoice
     :members:
 
 
 AlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForTokenClassification
-    :members:
+    :members: forward
 
 
 AlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AlbertForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFAlbertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertModel
-    :members:
+    :members: call
 
 
 TFAlbertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForPreTraining
-    :members:
+    :members: call
 
 
 TFAlbertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForMaskedLM
-    :members:
+    :members: call
 
 
 TFAlbertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForSequenceClassification
-    :members:
+    :members: call
 
 
 TFAlbertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForMultipleChoice
-    :members:
+    :members: call
 
 
 TFAlbertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForTokenClassification
-    :members:
+    :members: call
 
 
 TFAlbertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAlbertForQuestionAnswering
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/auto.rst b/docs/source/model_doc/auto.rst
index e87191dd73..d70fdd78dc 100644
--- a/docs/source/model_doc/auto.rst
+++ b/docs/source/model_doc/auto.rst
@@ -1,5 +1,5 @@
 AutoClasses
------------
+-----------------------------------------------------------------------------------------------------------------------
 
 In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you
 are supplying to the :obj:`from_pretrained()` method.
@@ -20,112 +20,112 @@ There is one class of :obj:`AutoModel` for each task, and for each backend (PyTo
 
 
 AutoConfig
-~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoConfig
     :members:
 
 
 AutoTokenizer
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoTokenizer
     :members:
 
 
 AutoModel
-~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModel
     :members:
 
 
 AutoModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelForPreTraining
     :members:
 
 
 AutoModelWithLMHead
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelWithLMHead
     :members:
 
 
 AutoModelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelForSequenceClassification
     :members:
 
 
 AutoModelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelForMultipleChoice
     :members:
 
 
 AutoModelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelForTokenClassification
     :members:
 
 
 AutoModelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.AutoModelForQuestionAnswering
     :members:
 
 
 TFAutoModel
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModel
     :members:
 
 
 TFAutoModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelForPreTraining
     :members:
 
 
 TFAutoModelWithLMHead
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelWithLMHead
     :members:
 
 
 TFAutoModelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelForSequenceClassification
     :members:
 
 
 TFAutoModelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelForMultipleChoice
     :members:
 
 
 TFAutoModelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelForTokenClassification
     :members:
 
 
 TFAutoModelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFAutoModelForQuestionAnswering
     :members:
diff --git a/docs/source/model_doc/bart.rst b/docs/source/model_doc/bart.rst
index 69a502cf01..f79cc29012 100644
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -1,11 +1,11 @@
 Bart
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
 According to the abstract,
@@ -18,7 +18,7 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
 
 
 Implementation Notes
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 - Bart doesn't use :obj:`token_type_ids` for sequence classification. Use BartTokenizer.encode to get the proper splitting.
 - The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``)  if they are not passed. This is different than some other modeling APIs.
@@ -29,21 +29,21 @@ Implementation Notes
 
 
 BartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartForConditionalGeneration
     :members: forward
 
 
 BartConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartConfig
     :members:
 
 
 BartTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartTokenizer
     :members:
@@ -51,7 +51,7 @@ BartTokenizer
 
 
 BartModel
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartModel
     :members: forward
@@ -60,14 +60,14 @@ BartModel
 
 
 BartForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartForSequenceClassification
     :members: forward
 
 
 BartForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BartForQuestionAnswering
     :members: forward
diff --git a/docs/source/model_doc/bert.rst b/docs/source/model_doc/bert.rst
index 13bc47e260..6a5f380684 100644
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -1,13 +1,13 @@
 BERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The BERT model was proposed in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__
-by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a bidirectional transformer
-pre-trained using a combination of masked language modeling objective and next sentence prediction
-on a large corpus comprising the Toronto Book Corpus and Wikipedia.
+The BERT model was proposed in `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
+<https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a
+bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence
+prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.
 
 The abstract from the paper is the following:
 
@@ -27,20 +27,20 @@ Tips:
 
 - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
   the right rather than the left.
-- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked
-  tokens and at NLU in general, but is not optimal for text generation.
+- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
 
-The original code can be found `here <https://github.com/google-research/bert>`_.
+The original code can be found `here <https://github.com/google-research/bert>`__.
 
 BertConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertConfig
     :members:
 
 
 BertTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -48,14 +48,14 @@ BertTokenizer
 
 
 BertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertTokenizerFast
     :members:
 
 
 Bert specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_bert.BertForPreTrainingOutput
     :members:
@@ -65,127 +65,126 @@ Bert specific outputs
 
 
 BertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertModel
-    :members:
+    :members: forward
 
 
 BertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForPreTraining
-    :members:
+    :members: forward
 
 
 BertModelLMHeadModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertLMHeadModel
-    :members:
+    :members: forward
 
 
 BertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForMaskedLM
-    :members:
+    :members: forward
 
 
 BertForNextSentencePrediction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForNextSentencePrediction
-    :members:
+    :members: forward
 
 
 BertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForSequenceClassification
-    :members:
+    :members: forward
 
 
 BertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForMultipleChoice
-    :members:
+    :members: forward
 
 
 BertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForTokenClassification
-    :members:
+    :members: forward
 
 
 BertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFBertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertModel
-    :members:
+    :members: call
 
 
 TFBertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForPreTraining
-    :members:
+    :members: call
 
 
 TFBertModelLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertLMHeadModel
-    :members:
+    :members: call
 
 
 TFBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForMaskedLM
-    :members:
+    :members: call
 
 
 TFBertForNextSentencePrediction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForNextSentencePrediction
-    :members:
+    :members: call
 
 
 TFBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForSequenceClassification
-    :members:
+    :members: call
 
 
 TFBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForMultipleChoice
-    :members:
+    :members: call
 
 
 TFBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForTokenClassification
-    :members:
+    :members: call
 
 
 TFBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFBertForQuestionAnswering
-    :members:
-
+    :members: call
diff --git a/docs/source/model_doc/bertgeneration.rst b/docs/source/model_doc/bertgeneration.rst
index b9b40d90ef..ee2591e1b6 100644
--- a/docs/source/model_doc/bertgeneration.rst
+++ b/docs/source/model_doc/bertgeneration.rst
@@ -1,24 +1,36 @@
 BertGeneration
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using :class:`~transformers.EncoderDecoderModel` as proposed in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
+:class:`~transformers.EncoderDecoderModel` as proposed in `Leveraging Pre-trained Checkpoints for Sequence Generation
+Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 
 The abstract from the paper is the following:
 
-*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.*
+*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
+warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
+benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
+Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
+developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
+GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
+encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
+Text Summarization, Sentence Splitting, and Sentence Fusion.*
 
 Usage:
 
-- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two bert pretrained bert checkpoints for subsequent fine-tuning.
+- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
+  BERT checkpoints for subsequent fine-tuning.
 
-::
+:: code-block
   
   # leverage checkpoints for Bert2Bert model...
-  encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)  # use BERT's cls token as BOS token and sep token as EOS token
-  decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)  # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
+  # use BERT's cls token as BOS token and sep token as EOS token
+  encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
+  # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
+  decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
   bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
   
   # create tokenizer...
@@ -32,10 +44,10 @@ Usage:
   loss.backward()
 
 
-- Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, *e.g.*:
+- Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
 
 
-::
+:: code-block
 
   # instantiate sentence fusion model
   sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
@@ -50,33 +62,35 @@ Usage:
 
 Tips:
 
-- :class:`~transformers.BertGenerationEncoder` and :class:`~transformers.BertGenerationDecoder`  should be used in combination with :class:`~transformers.EncoderDecoder`.
-- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input. Therefore, no EOS token should be added to the end of the input.
+- :class:`~transformers.BertGenerationEncoder` and :class:`~transformers.BertGenerationDecoder` should be used in
+  combination with :class:`~transformers.EncoderDecoder`.
+- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
+  Therefore, no EOS token should be added to the end of the input.
 
 The original code can be found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
 
 BertGenerationConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertGenerationConfig
     :members:
 
 
 BertGenerationTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertGenerationTokenizer
-    :members: 
+    :members: save_vocabulary
 
 BertGenerationEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertGenerationEncoder
-    :members:
+    :members: forward
 
 
 BertGenerationDecoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.BertGenerationDecoder
-    :members:
+    :members: forward
diff --git a/docs/source/model_doc/camembert.rst b/docs/source/model_doc/camembert.rst
index 8f0d578848..83b9e47561 100644
--- a/docs/source/model_doc/camembert.rst
+++ b/docs/source/model_doc/camembert.rst
@@ -1,8 +1,8 @@
 CamemBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__
 by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
@@ -22,20 +22,20 @@ pretrained model for CamemBERT hoping to foster research and downstream applicat
 
 Tips:
 
-- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
   examples as well as the information relative to the inputs and outputs.
 
-The original code can be found `here <https://camembert-model.fr/>`_.
+The original code can be found `here <https://camembert-model.fr/>`__.
 
 CamembertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertConfig
     :members:
 
 
 CamembertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -43,91 +43,91 @@ CamembertTokenizer
 
 
 CamembertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertModel
     :members:
 
 
 CamembertForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForCausalLM
     :members:
 
 
 CamembertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForMaskedLM
     :members:
 
 
 CamembertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForSequenceClassification
     :members:
 
 
 CamembertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForMultipleChoice
     :members:
 
 
 CamembertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForTokenClassification
     :members:
 
 
 CamembertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CamembertForQuestionAnswering
     :members:
 
 
 TFCamembertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertModel
     :members:
 
 
 TFCamembertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertForMaskedLM
     :members:
 
 
 TFCamembertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertForSequenceClassification
     :members:
 
 
 TFCamembertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertForMultipleChoice
     :members:
 
 
 TFCamembertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertForTokenClassification
     :members:
 
 
 TFCamembertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCamembertForQuestionAnswering
     :members:
\ No newline at end of file
diff --git a/docs/source/model_doc/ctrl.rst b/docs/source/model_doc/ctrl.rst
index 2683320eb3..370f1d2ed8 100644
--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
@@ -1,12 +1,12 @@
 CTRL
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_
-by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
-It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
+CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
+<https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
+Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
 corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
 
 The abstract from the paper is the following:
@@ -31,50 +31,50 @@ Tips:
   it can be observed in the `run_generation.py` example script.
 - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
   this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
-  See `reusing the past in generative models <../quickstart.html#using-the-past>`_ for more information on the usage
+  See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
   of this argument.
 
-The original code can be found `here <https://github.com/salesforce/ctrl>`_.
+The original code can be found `here <https://github.com/salesforce/ctrl>`__.
 
 
 CTRLConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CTRLConfig
     :members:
 
 
 CTRLTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CTRLTokenizer
     :members: save_vocabulary
 
 
 CTRLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CTRLModel
-    :members:
+    :members: forward
 
 
 CTRLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.CTRLLMHeadModel
-    :members:
+    :members: forward
 
 
 TFCTRLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCTRLModel
-    :members:
+    :members: call
 
 
 TFCTRLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFCTRLLMHeadModel
-    :members:
+    :members: call
 
diff --git a/docs/source/model_doc/dialogpt.rst b/docs/source/model_doc/dialogpt.rst
index 4381698829..ca7ed9f0ea 100644
--- a/docs/source/model_doc/dialogpt.rst
+++ b/docs/source/model_doc/dialogpt.rst
@@ -1,8 +1,8 @@
 DialoGPT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 DialoGPT was proposed in
 `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_
diff --git a/docs/source/model_doc/distilbert.rst b/docs/source/model_doc/distilbert.rst
index 67f27495bd..c3013bb944 100644
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@@ -1,14 +1,15 @@
 DistilBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The DistilBERT model was proposed in the blog post
-`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__,
-and the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
-DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less
-parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on
+`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT
+<https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a distilled version of BERT:
+smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
+DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less
+parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on
 the GLUE language understanding benchmark.
 
 The abstract from the paper is the following:
@@ -27,113 +28,115 @@ on-device study.*
 
 Tips:
 
-- DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
-- DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.
+- DistilBERT doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
+  separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`[SEP]`).
+- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
+  necessary though, just let us know if you need this option.
 
-The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
+The original code can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
 
 
 DistilBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertConfig
     :members:
 
 
 DistilBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertTokenizer
     :members:
 
 
 DistilBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertTokenizerFast
     :members:
 
 
 DistilBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertModel
-    :members:
+    :members: forward
 
 
 DistilBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertForMaskedLM
-    :members:
+    :members: forward
 
 
 DistilBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertForSequenceClassification
-    :members:
+    :members: forward
 
 
 DistilBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertForMultipleChoice
-    :members:
+    :members: forward
 
 
 DistilBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertForTokenClassification
-    :members:
+    :members: forward
 
 
 DistilBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DistilBertForQuestionAnswering
-    :members:
+    :members: forward
 
 TFDistilBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertModel
-    :members:
+    :members: call
 
 
 TFDistilBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertForMaskedLM
-    :members:
+    :members: call
 
 
 TFDistilBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertForSequenceClassification
-    :members:
+    :members: call
 
 
 
 TFDistilBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertForMultipleChoice
-    :members:
+    :members: call
 
 
 
 TFDistilBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertForTokenClassification
-    :members:
+    :members: call
 
 
 TFDistilBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFDistilBertForQuestionAnswering
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/dpr.rst b/docs/source/model_doc/dpr.rst
index a77d3868bf..2273739c10 100644
--- a/docs/source/model_doc/dpr.rst
+++ b/docs/source/model_doc/dpr.rst
@@ -1,13 +1,12 @@
 DPR
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research.
-It is based on the following paper:
-
-Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering.
+Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research.
+It was intorduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__
+by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
 
 The abstract from the paper is the following:
 
@@ -19,58 +18,58 @@ our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% ab
 retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
 benchmarks.*
 
-The original code can be found `here <https://github.com/facebookresearch/DPR>`_.
+The original code can be found `here <https://github.com/facebookresearch/DPR>`__.
 
 
 DPRConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRConfig
     :members:
 
 
 DPRContextEncoderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRContextEncoderTokenizer
     :members:
 
 
 DPRContextEncoderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRContextEncoderTokenizerFast
     :members:
 
 DPRQuestionEncoderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRQuestionEncoderTokenizer
     :members:
 
 
 DPRQuestionEncoderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRQuestionEncoderTokenizerFast
     :members:
 
 DPRReaderTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRReaderTokenizer
     :members:
 
 
 DPRReaderTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRReaderTokenizerFast
     :members:
 
 
 DPR specific outputs
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_dpr.DPRContextEncoderOutput
     :members:
@@ -83,20 +82,20 @@ DPR specific outputs
 
 
 DPRContextEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRContextEncoder
-    :members:
+    :members: forward
 
 DPRQuestionEncoder
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRQuestionEncoder
-    :members:
+    :members: forward
 
 
 DPRReader
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.DPRReader
-    :members:
+    :members: forward
diff --git a/docs/source/model_doc/electra.rst b/docs/source/model_doc/electra.rst
index 993ed4d2b5..8752ceabda 100644
--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst
@@ -1,14 +1,14 @@
 ELECTRA
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The ELECTRA model was proposed in the paper.
-`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__.
-ELECTRA is a new pre-training approach which trains two transformer models: the generator and the discriminator. The
-generator's role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator,
-which is the model we're interested in, tries to identify which tokens were replaced by the generator in the sequence.
+The ELECTRA model was proposed in the paper `ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
+Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__. ELECTRA is a new pretraining approach which trains two
+transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and
+is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to
+identify which tokens were replaced by the generator in the sequence.
 
 The abstract from the paper is the following:
 
@@ -35,44 +35,45 @@ compute and outperforms them when using the same amount of compute.*
 
 Tips:
 
-- ELECTRA is the pre-training approach, therefore there is nearly no changes done to the underlying model: BERT. The
-  only change is the separation of the embedding size and the hidden size -> The embedding size is generally smaller,
+- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
+  only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
   while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
   their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
   projection layer is used.
 - The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
   contain both the generator and discriminator. The conversion script requires the user to name which model to export
   into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
-  available ELECTRA models, however. This means that the discriminator may be loaded in the `ElectraForMaskedLM` model,
-  and the generator may be loaded in the `ElectraForPreTraining` model (the classification head will be randomly
-  initialized as it doesn't exist in the generator).
+  available ELECTRA models, however. This means that the discriminator may be loaded in the
+  :class:`~transformers.ElectraForMaskedLM` model, and the generator may be loaded in the
+  :class:`~transformers.ElectraForPreTraining` model (the classification head will be randomly initialized as it
+  doesn't exist in the generator).
 
-The original code can be found `here <https://github.com/google-research/electra>`_.
+The original code can be found `here <https://github.com/google-research/electra>`__.
 
 
 ElectraConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraConfig
     :members:
 
 
 ElectraTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraTokenizer
     :members:
 
 
 ElectraTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraTokenizerFast
     :members:
 
 
 Electra specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_electra.ElectraForPreTrainingOutput
     :members:
@@ -82,98 +83,98 @@ Electra specific outputs
 
 
 ElectraModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraModel
-    :members:
+    :members: forward
 
 
 ElectraForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForPreTraining
-    :members:
+    :members: forward
 
 
 ElectraForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForMaskedLM
-    :members:
+    :members: forward
 
 
 ElectraForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForSequenceClassification
-    :members:
+    :members: forward
 
 
 ElectraForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForMultipleChoice
-    :members:
+    :members: forward
 
 
 ElectraForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForTokenClassification
-    :members:
+    :members: forward
 
 
 ElectraForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ElectraForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFElectraModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraModel
-    :members:
+    :members: call
 
 
 TFElectraForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForPreTraining
-    :members:
+    :members: call
 
 
 TFElectraForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForMaskedLM
-    :members:
+    :members: call
 
 
 TFElectraForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForSequenceClassification
-    :members:
+    :members: call
 
 
 TFElectraForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForMultipleChoice
-    :members:
+    :members: call
 
 
 TFElectraForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForTokenClassification
-    :members:
+    :members: call
 
 
 TFElectraForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFElectraForQuestionAnswering
-    :members:
\ No newline at end of file
+    :members: call
diff --git a/docs/source/model_doc/encoderdecoder.rst b/docs/source/model_doc/encoderdecoder.rst
index 54f74320ab..a32b6b1350 100644
--- a/docs/source/model_doc/encoderdecoder.rst
+++ b/docs/source/model_doc/encoderdecoder.rst
@@ -1,24 +1,30 @@
 Encoder Decoder Models
-------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
-The :class:`~transformers.EncoderDecoderModel` can be used to initialize a sequence-to-sequence model with any pre-trained autoencoding model as the encoder and any pre-trained autoregressive model as the decoder.
+The :class:`~transformers.EncoderDecoderModel` can be used to initialize a sequence-to-sequence model with any
+pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.
 
-The effectiveness of initializing sequence-to-sequence models with pre-trained checkpoints for sequence generation tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation tasks
+was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by
+Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 
-After such an :class:`~transformers.EncoderDecoderModel` has been trained / fine-tuned, it can be saved / loaded just like any other models (see Examples for more information).
+After such an :class:`~transformers.EncoderDecoderModel` has been trained/fine-tuned, it can be saved/loaded just like
+any other models (see the examples for more information).
 
-An application of this architecture could be to leverage two pre-trained :obj:`transformers.BertModel` models as the encoder and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders <https://arxiv.org/abs/1908.08345>`_ by Yang Liu and Mirella Lapata. 
+An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
+and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
+<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata. 
 
 
-``EncoderDecoderConfig``
-~~~~~~~~~~~~~~~~~~~~~~~~~
+EncoderDecoderConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.EncoderDecoderConfig
     :members:
 
 
-``EncoderDecoderModel``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+EncoderDecoderModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.EncoderDecoderModel
-    :members:
+    :members: forward
diff --git a/docs/source/model_doc/flaubert.rst b/docs/source/model_doc/flaubert.rst
index e454f96cba..be03c76102 100644
--- a/docs/source/model_doc/flaubert.rst
+++ b/docs/source/model_doc/flaubert.rst
@@ -1,12 +1,12 @@
 FlauBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The FlauBERT model was proposed in the paper
-`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le et al.
-It's a transformer pre-trained using a masked language modeling (MLM) objective (BERT-like).
+The FlauBERT model was proposed in the paper `FlauBERT: Unsupervised Language Model Pre-training for French
+<https://arxiv.org/abs/1912.05372>`__ by Hang Le et al. It's a transformer model pretrained using a masked language
+modeling (MLM) objective (like BERT).
 
 The abstract from the paper is the following:
 
@@ -23,109 +23,109 @@ of the time they outperform other pre-training approaches. Different versions of
 evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared
 to the research community for further reproducible experiments in French NLP.*
 
-The original code can be found `here <https://github.com/getalp/Flaubert>`_.
+The original code can be found `here <https://github.com/getalp/Flaubert>`__.
 
 
 FlaubertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertConfig
     :members:
 
 
 FlaubertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertTokenizer
     :members:
 
 
 FlaubertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertModel
-    :members:
+    :members: forward
 
 
 FlaubertWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertWithLMHeadModel
-    :members:
+    :members: forward
 
 
 FlaubertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertForSequenceClassification
-    :members:
+    :members: forward
 
 
 FlaubertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertForMultipleChoice
-    :members:
+    :members: forward
 
 
 FlaubertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertForTokenClassification
-    :members:
+    :members: forward
 
 
 FlaubertForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertForQuestionAnsweringSimple
-    :members:
+    :members: forward
 
 
 FlaubertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FlaubertForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFFlaubertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertModel
-    :members:
+    :members: call
 
 
 TFFlaubertWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertWithLMHeadModel
-    :members:
+    :members: call
 
 
 TFFlaubertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertForSequenceClassification
-    :members:
+    :members: call
 
 
 TFFlaubertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertForMultipleChoice
-    :members:
+    :members: call
 
 
 TFFlaubertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertForTokenClassification
-    :members:
+    :members: call
 
 
 TFFlaubertForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFlaubertForQuestionAnsweringSimple
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/fsmt.rst b/docs/source/model_doc/fsmt.rst
index ca808ef21f..83f03a42ff 100644
--- a/docs/source/model_doc/fsmt.rst
+++ b/docs/source/model_doc/fsmt.rst
@@ -1,49 +1,61 @@
 FSMT
-----------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
 @stas00.
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-FSMT (FairSeq MachineTranslation) models were introduced in "Facebook FAIR's WMT19 News Translation Task Submission" <this paper <https://arxiv.org/abs/1907.06616>__ by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
+FSMT (FairSeq MachineTranslation) models were introduced in `Facebook FAIR's WMT19 News Translation Task Submission
+<https://arxiv.org/abs/1907.06616>`__ by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
 
 The abstract of the paper is the following:
 
-    This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations. This system improves upon our WMT'18 submission by 4.5 BLEU points.
+*This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two
+language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from
+last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling
+toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes,
+as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific
+data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the
+human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
+This system improves upon our WMT'18 submission by 4.5 BLEU points.*
 
 The original code can be found here <https://github.com/pytorch/fairseq/tree/master/examples/wmt19>__.
 
 Implementation Notes
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-- FSMT uses source and target vocab pair, that aren't combined into one. It doesn't share embed tokens either. Its tokenizer is very similar to `XLMTokenizer` and the main model is derived from `BartModel`.
-
-
-FSMTForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.FSMTForConditionalGeneration
-    :members: forward
+- FSMT uses source and target vocabulary pairs that aren't combined into one. It doesn't share embeddings tokens
+  either. Its tokenizer is very similar to :class:`~transformers.XLMTokenizer` and the main model is derived from
+  :class:`~transformers.BartModel`.
 
 
 FSMTConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FSMTConfig
     :members:
 
 
 FSMTTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FSMTTokenizer
-    :members:
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, prepare_seq2seq_batch, save_vocabulary
 
 
 FSMTModel
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FSMTModel
     :members: forward
+
+
+FSMTForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.FSMTForConditionalGeneration
+    :members: forward
\ No newline at end of file
diff --git a/docs/source/model_doc/funnel.rst b/docs/source/model_doc/funnel.rst
index 20e39e6483..bfdc3f24b3 100644
--- a/docs/source/model_doc/funnel.rst
+++ b/docs/source/model_doc/funnel.rst
@@ -1,14 +1,13 @@
 Funnel Transformer
-------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The Funnel Transformer model was proposed in the paper
-`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
-<https://arxiv.org/abs/2006.03236>`__.
-It is a bidirectional transformer model, like BERT, but with a pooling operation after each block of layers, a bit
-like in traditional convolutional neural networks (CNN) in computer vision.
+The Funnel Transformer model was proposed in the paper `Funnel-Transformer: Filtering out Sequential Redundancy for
+Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__. It is a bidirectional transformer model, like
+BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks
+(CNN) in computer vision.
 
 The abstract from the paper is the following:
 
@@ -38,18 +37,18 @@ Tips:
   :class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
   :class:`~transformers.FunnelForMultipleChoice`.
 
-The original code can be found `here <https://github.com/laiguokun/Funnel-Transformer>`_.
+The original code can be found `here <https://github.com/laiguokun/Funnel-Transformer>`__.
 
 
 FunnelConfig
-~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelConfig
     :members:
 
 
 FunnelTokenizer
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -57,14 +56,14 @@ FunnelTokenizer
 
 
 FunnelTokenizerFast
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelTokenizerFast
     :members:
 
 
 Funnel specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_funnel.FunnelForPreTrainingOutput
     :members:
@@ -74,112 +73,112 @@ Funnel specific outputs
 
 
 FunnelBaseModel
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelBaseModel
-    :members:
+    :members: forward
 
 
 FunnelModel
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelModel
-    :members:
+    :members: forward
 
 
 FunnelModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForPreTraining
-    :members:
+    :members: forward
 
 
 FunnelForMaskedLM
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForMaskedLM
-    :members:
+    :members: forward
 
 
 FunnelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForSequenceClassification
-    :members:
+    :members: forward
 
 
 FunnelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForMultipleChoice
-    :members:
+    :members: forward
 
 
 FunnelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForTokenClassification
-    :members:
+    :members: forward
 
 
 FunnelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.FunnelForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFFunnelBaseModel
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelBaseModel
-    :members:
+    :members: call
 
 
 TFFunnelModel
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelModel
-    :members:
+    :members: call
 
 
 TFFunnelModelForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForPreTraining
-    :members:
+    :members: call
 
 
 TFFunnelForMaskedLM
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForMaskedLM
-    :members:
+    :members: call
 
 
 TFFunnelForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForSequenceClassification
-    :members:
+    :members: call
 
 
 TFFunnelForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForMultipleChoice
-    :members:
+    :members: call
 
 
 TFFunnelForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForTokenClassification
-    :members:
+    :members: call
 
 
 TFFunnelForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFFunnelForQuestionAnswering
-    :members:
\ No newline at end of file
+    :members: call
diff --git a/docs/source/model_doc/gpt.rst b/docs/source/model_doc/gpt.rst
index 99772b30fd..07deb89b66 100644
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -1,12 +1,14 @@
 OpenAI GPT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training <https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
+OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
+<https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
 by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional)
-transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
+transformer pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book
+Corpus.
 
 The abstract from the paper is the following:
 
@@ -36,7 +38,7 @@ Tips:
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT is one of them.
 
-The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`_.
+The original code can be found `here <https://github.com/openai/finetune-transformer-lm>`__.
 
 Note:
 
@@ -46,33 +48,33 @@ If you want to reproduce the original tokenization process of the `OpenAI GPT` p
     pip install spacy ftfy==4.4.3
     python -m spacy download en
 
-If you don't install ``ftfy`` and ``SpaCy``, the :class:`transformers.OpenAIGPTTokenizer` will default to tokenize using 
-BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't 
+If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
+using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't 
 worry).
 
 OpenAIGPTConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTConfig
     :members:
 
 
 OpenAIGPTTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTTokenizer
     :members: save_vocabulary
 
 
 OpenAIGPTTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTTokenizerFast
     :members:
 
 
 OpenAI specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
     :members:
@@ -82,42 +84,42 @@ OpenAI specific outputs
 
 
 OpenAIGPTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTModel
-    :members:
+    :members: forward
 
 
 OpenAIGPTLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTLMHeadModel
-    :members:
+    :members: forward
 
 
 OpenAIGPTDoubleHeadsModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.OpenAIGPTDoubleHeadsModel
-    :members:
+    :members: forward
 
 
 TFOpenAIGPTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFOpenAIGPTModel
-    :members:
+    :members: call
 
 
 TFOpenAIGPTLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFOpenAIGPTLMHeadModel
-    :members:
+    :members: call
 
 
 TFOpenAIGPTDoubleHeadsModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/gpt2.rst b/docs/source/model_doc/gpt2.rst
index b1ae24d98e..d55e106637 100644
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -1,14 +1,13 @@
 OpenAI GPT2
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-OpenAI GPT-2 model was proposed in
-`Language Models are Unsupervised Multitask Learners <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_
-by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
-It's a causal (unidirectional) transformer pre-trained using  language modeling on a very large
-corpus of ~40 GB of text data.
+OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
+<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_
+by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
+transformer pretrained using  language modeling on a very large corpus of ~40 GB of text data.
 
 The abstract from the paper is the following:
 
@@ -27,39 +26,39 @@ Tips:
   it can be observed in the `run_generation.py` example script.
 - The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
   this `past` value prevents the model from re-computing pre-computed values in the context of text generation.
-  See `reusing the past in generative models <../quickstart.html#using-the-past>`_ for more information on the usage
+  See `reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage
   of this argument.
 
 `Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
-different sizes: small, medium, large, xl and a distilled version of the small checkpoint: distilgpt-2.
+different sizes: small, medium, large, xl and a distilled version of the small checkpoint: `distilgpt-2`.
 
-The original code can be found `here <https://openai.com/blog/better-language-models/>`_.
+The original code can be found `here <https://openai.com/blog/better-language-models/>`__.
 
 
 GPT2Config
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2Config
     :members:
 
 
 GPT2Tokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2Tokenizer
     :members: save_vocabulary
 
 
 GPT2TokenizerFast
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2TokenizerFast
     :members:
 
 
 GPT2 specific outputs
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_gpt2.GPT2DoubleHeadsModelOutput
     :members:
@@ -69,42 +68,42 @@ GPT2 specific outputs
 
 
 GPT2Model
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2Model
-    :members:
+    :members: forward
 
 
 GPT2LMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2LMHeadModel
-    :members:
+    :members: forward
 
 
 GPT2DoubleHeadsModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.GPT2DoubleHeadsModel
-    :members:
+    :members: forward
 
 
 TFGPT2Model
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFGPT2Model
-    :members:
+    :members: call
 
 
 TFGPT2LMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFGPT2LMHeadModel
-    :members:
+    :members: call
 
 
 TFGPT2DoubleHeadsModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFGPT2DoubleHeadsModel
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/longformer.rst b/docs/source/model_doc/longformer.rst
index ea03b798ce..648292fade 100644
--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@@ -1,126 +1,155 @@
 Longformer
-----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
 
 Overview
-~~~~~~~~~
-The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-Here the abstract: 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
+The Longformer model was presented in `Longformer: The Long-Document Transformer
+<https://arxiv.org/pdf/2004.05150.pdf>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
 
-The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
+The abstract from the paper is the following:
+
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
+quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
+mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
+longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
+windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
+evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
+contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
+pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
+WikiHop and TriviaQA.*
+
+The Authors' code can be found `here <https://github.com/allenai/longformer>`__.
 
 Longformer Self Attention
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 Longformer self attention employs self attention on both a "local" context and a "global" context.
-Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer. 
-A selected few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
+Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous
+tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
+:obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
+different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
+conventionally done for all tokens in :obj:`BertSelfAttention`.
 
 Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
-Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
+Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all
+"globally" attending tokens so that global attention is *symmetric*.
 
-The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor `global_attention_mask` at run-time appropriately. `Longformer` employs the following logic for `global_attention_mask`: `0` - the token attends "locally", `1` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
+The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
+:obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
+:obj:`global_attention_mask`: 
 
-Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
+- 0: the token attends "locally",
+- 1: the token attends "globally".
 
-For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
+For more information please also refer to :meth:`~transformers.LongformerModel.forward` method.
+
+Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
+represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to
+:math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window
+size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
+"locally" attending tokens.
+
+For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`__.
 
 
 Training
-~~~~~~~~~~~~~~~~~~~~
-``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and 
-should be used as follows:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-::
+:class:`~transformers.LongformerForMaskedLM` is trained the exact same way :class:`~transformers.RobertaForMaskedLM` is
+trained and should be used as follows:
 
-  input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
-  mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+.. code-block::
 
-  loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
+    input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
+    mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+
+    loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 
 
 LongformerConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerConfig
     :members:
 
 
 LongformerTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerTokenizer
     :members: 
 
 
 LongformerTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerTokenizerFast
     :members: 
 
 
 LongformerModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerModel
-    :members:
+    :members: forward
 
 
 LongformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerForMaskedLM
-    :members:
+    :members: forward
 
 
 LongformerForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerForSequenceClassification
-    :members:
+    :members: forward
 
 
 LongformerForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerForMultipleChoice
-    :members:
+    :members: forward
 
 
 LongformerForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerForTokenClassification
-    :members:
+    :members: forward
 
 
 LongformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LongformerForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFLongformerModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFLongformerModel
-    :members:
+    :members: call
 
 
 TFLongformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFLongformerForMaskedLM
-    :members:
+    :members: call
 
 
 TFLongformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFLongformerForQuestionAnswering
-    :members:
+    :members: call
 
diff --git a/docs/source/model_doc/lxmert.rst b/docs/source/model_doc/lxmert.rst
index e9c7ad339f..37de33c8b5 100644
--- a/docs/source/model_doc/lxmert.rst
+++ b/docs/source/model_doc/lxmert.rst
@@ -1,65 +1,72 @@
 LXMERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers <https://arxiv.org/abs/1908.07490>`__
-by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities)
-pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
-The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
+The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
+<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
+(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
+combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
+visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
+The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering,
+VQA 2.0, and GQA.
 
 The abstract from the paper is the following:
 
-*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two
-modalities. We thus propose the LXMERT
-(Learning Cross-Modality Encoder Representations from Transformers) framework to learn
-these vision-and-language connections. In
-LXMERT, we build a large-scale Transformer
-model that consists of three encoders: an object relationship encoder, a language encoder,
-and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we
-pre-train the model with large amounts of
-image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction
-(feature regression and label classification),
-cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the
-state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA).
-We also show the generalizability of our pretrained cross-modality model by adapting it to
-a challenging visual-reasoning task, NLVR
-,
-and improve the previous best result by 22%
-absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that
-both our novel model components and pretraining strategies significantly contribute to
-our strong results; and also present several
+*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
+the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
+Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
+build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
+encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
+semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
+pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
+cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
+cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
+results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
+pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
+best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
+model components and pretraining strategies significantly contribute to our strong results; and also present several
 attention visualizations for the different encoders*
 
 Tips:
 
-- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features will work.
-- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the cross-modality layer, so they
-  contain information from both modalities. To access a modality that only attends to itself, select the vision/language hidden states from the first input in the tuple.
-- The bi-directional cross-modality encoder attention only returns attention values when the language modality is used as the input and the vision modality is used as the context vector. Further,
-  while the cross-modality encoder contains self-attention for each respective modality and cross-attention, only the cross attention is returned and both self attention outputs are disregarded.
+- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
+  will work.
+- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
+  cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
+  itself, select the vision/language hidden states from the first input in the tuple.
+- The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
+  as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
+  contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
+  both self attention outputs are disregarded.
 
-The code can be found `here <https://github.com/airsplay/lxmert>`__
+The original code can be found `here <https://github.com/airsplay/lxmert>`__.
 
 
 LxmertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LxmertConfig
     :members:
 
 
 LxmertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LxmertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+    :members:
+
+
+LxmertTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertTokenizerFast
+    :members:
 
 
 Lxmert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_lxmert.LxmertModelOutput
     :members:
@@ -78,32 +85,32 @@ Lxmert specific outputs
 
 
 LxmertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LxmertModel
-    :members:
+    :members: forward
 
 LxmertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LxmertForPreTraining
-    :members:
+    :members: forward
 
 LxmertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.LxmertForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFLxmertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFLxmertModel
-    :members:
+    :members: call
 
 TFLxmertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFLxmertForPreTraining
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/marian.rst b/docs/source/model_doc/marian.rst
index df7d56c90b..50276f8836 100644
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -1,11 +1,11 @@
 MarianMT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 **Bugs:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
 
 Implementation Notes
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - Each model is about 298 MB on disk, there are 1,000+ models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
 - models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
@@ -19,14 +19,14 @@ Implementation Notes
 - Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
 
 Naming
-~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 - All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
 - The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
 - Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
 
 
 Multilingual Models
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
     - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
@@ -87,7 +87,7 @@ Code to see available pretrained models:
     multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
 
 MarianMTModel
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
 Model API is identical to BartForConditionalGeneration.
@@ -95,13 +95,13 @@ Available models are listed at `Model List <https://huggingface.co/models?search
 This class inherits nearly all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
 
 MarianConfig
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: transformers.MarianConfig
     :members:
 
 
 MarianTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MarianTokenizer
     :members: prepare_seq2seq_batch
diff --git a/docs/source/model_doc/mbart.rst b/docs/source/model_doc/mbart.rst
index 1cfc65e663..202179d71d 100644
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -1,11 +1,11 @@
 MBart
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
 Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. According to the abstract,
 
@@ -15,7 +15,7 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
 
 
 Training
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
 As the model is multilingual it expects the sequences in a different format. A special language id token 
 is added in both the source and target text. The source text format is ``X [eos, src_lang_code]`` 
@@ -25,7 +25,7 @@ the sequences for seq-2-seq fine-tuning.
 
 - Supervised training
 
-::
+.. code-block::
 
     example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
     expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
@@ -41,7 +41,7 @@ the sequences for seq-2-seq fine-tuning.
     While generating the target text set the `decoder_start_token_id` to the target language id. 
     The following example shows how to translate English to Romanian using the ```facebook/mbart-large-en-ro``` model.
 
-::
+.. code-block::
 
     from transformers import MBartForConditionalGeneration, MBartTokenizer
     model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
@@ -54,21 +54,21 @@ the sequences for seq-2-seq fine-tuning.
 
 
 MBartConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MBartConfig
     :members:
 
 
 MBartTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MBartTokenizer
     :members: build_inputs_with_special_tokens, prepare_seq2seq_batch
 
 
 MBartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MBartForConditionalGeneration
     :members: generate, forward
diff --git a/docs/source/model_doc/mobilebert.rst b/docs/source/model_doc/mobilebert.rst
index 038adc6b66..47fd1858ad 100644
--- a/docs/source/model_doc/mobilebert.rst
+++ b/docs/source/model_doc/mobilebert.rst
@@ -1,13 +1,13 @@
 MobileBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The MobileBERT model was proposed in `MobileBERT: a Compact Task-Agnostic BERT
-for Resource-Limited Devices <https://arxiv.org/abs/2004.02984>`__
-by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. It's a bidirectional transformer
-based on the BERT model, which is compressed and accelerated using several approaches.
+The MobileBERT model was proposed in `MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices
+<https://arxiv.org/abs/2004.02984>`__ by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny
+Zhou. It's a bidirectional transformer based on the BERT model, which is compressed and accelerated using several
+approaches.
 
 The abstract from the paper is the following:
 
@@ -32,32 +32,31 @@ Tips:
   It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
   text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
 
-The original code can be found `here <https://github.com/google-research/mobilebert>`_.
+The original code can be found `here <https://github.com/google-research/mobilebert>`__.
 
 MobileBertConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertConfig
     :members:
 
 
 MobileBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+    :members:
 
 
 MobileBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertTokenizerFast
     :members:
 
 
 MobileBert specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_mobilebert.MobileBertForPreTrainingOutput
     :members:
@@ -67,113 +66,112 @@ MobileBert specific outputs
 
 
 MobileBertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertModel
-    :members:
+    :members: forward
 
 
 MobileBertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForPreTraining
-    :members:
+    :members: forward
 
 
 MobileBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForMaskedLM
-    :members:
+    :members: forward
 
 
 MobileBertForNextSentencePrediction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForNextSentencePrediction
-    :members:
+    :members: forward
 
 
 MobileBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForSequenceClassification
-    :members:
+    :members: forward
 
 
 MobileBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForMultipleChoice
-    :members:
+    :members: forward
 
 
 MobileBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForTokenClassification
-    :members:
+    :members: forward
 
 
 MobileBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.MobileBertForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFMobileBertModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertModel
-    :members:
+    :members: call
 
 
 TFMobileBertForPreTraining
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForPreTraining
-    :members:
+    :members: call
 
 
 TFMobileBertForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForMaskedLM
-    :members:
+    :members: call
 
 
 TFMobileBertForNextSentencePrediction
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForNextSentencePrediction
-    :members:
+    :members: call
 
 
 TFMobileBertForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForSequenceClassification
-    :members:
+    :members: call
 
 
 TFMobileBertForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForMultipleChoice
-    :members:
+    :members: call
 
 
 TFMobileBertForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForTokenClassification
-    :members:
+    :members: call
 
 
 TFMobileBertForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFMobileBertForQuestionAnswering
-    :members:
-
+    :members: call
diff --git a/docs/source/model_doc/pegasus.rst b/docs/source/model_doc/pegasus.rst
index ca48f18105..419d904488 100644
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -1,12 +1,12 @@
 Pegasus
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 **DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
 @sshleifer.
 
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
 Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
@@ -19,7 +19,7 @@ The Authors' code can be found `here <https://github.com/google-research/pegasus
 
 
 Checkpoints
-~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 All the `checkpoints <https://huggingface.co/models?search=pegasus>`_ are finetuned for summarization, besides ``pegasus-large``, whence the other checkpoints are finetuned.
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
 - FP16 is not supported (help/ideas on this appreciated!).
@@ -29,7 +29,7 @@ The gap is likely because of different alpha/length_penalty implementations in b
 
 
 Implementation Notes
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 - All models are transformer encoder-decoders with 16 layers in each component.
 - The implementation is completely inherited from ``BartForConditionalGeneration``
@@ -43,7 +43,7 @@ Implementation Notes
 
 
 Usage Example
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: python
 
@@ -63,7 +63,7 @@ Usage Example
     assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
 
 PegasusForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
 Available models are listed at `Model List <https://huggingface.co/models?search=pegasus>`__
@@ -73,7 +73,7 @@ Available models are listed at `Model List <https://huggingface.co/models?search
 
 
 PegasusConfig
-~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 This config fully inherits from ``BartConfig``, but pegasus uses different default values:
 Up to date parameter values can be seen in `S3 <https://s3.amazonaws.com/models.huggingface.co/bert/google/pegasus-xsum/config.json>`_.
 As of Aug 10, 2020, they are:
@@ -107,7 +107,7 @@ As of Aug 10, 2020, they are:
 
 
 PegasusTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 warning: ``add_tokens`` does not work at the moment.
 
 .. autoclass:: transformers.PegasusTokenizer
diff --git a/docs/source/model_doc/reformer.rst b/docs/source/model_doc/reformer.rst
index 370187dfb8..d1eca9cf71 100644
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -1,20 +1,37 @@
 Reformer
-----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
 
 Overview
-~~~~~~~~~~
-The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-Here the abstract: 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
+The Reformer model was proposed in the paper `Reformer: The Efficient Transformer
+<https://arxiv.org/abs/2001.04451.pdf>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 
-The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`_ .
+The abstract from the paper is the following: 
+
+*Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can
+be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of
+Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its
+complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual
+layers instead of the standard residuals, which allows storing activations only once in the training process instead of
+N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models
+while being much more memory-efficient and much faster on long sequences.*
+
+The Authors' code can be found `here <https://github.com/google/trax/tree/master/trax/models/reformer>`__.
 
 Axial Positional Encodings
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Axial Positional Encodings were first implemented in Google's `trax library <https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Axial Positional Encodings were first implemented in Google's `trax library
+<https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29>`__
+and developed by the authors of this model's paper. In models that are treating very long input sequences, the
+conventional position id encodings store an embedings vector of size :math:`d` being the :obj:`config.hidden_size` for
+every position :math:`i, \ldots, n_s`, with :math:`n_s` being :obj:`config.max_embedding_size`. This means that having
+a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000`
+would result in a position encoding matrix:
 
 .. math::
     X_{i,j}, \text{ with } i \in \left[1,\ldots, d\right] \text{ and } j \in \left[1,\ldots, n_s\right] 
@@ -42,94 +59,127 @@ Therefore the following holds:
                 X^{2}_{i - d^1, l}, & \text{if } i \ge d^1 \text{ with } l = \lfloor\frac{j}{n_s^1}\rfloor
               \end{cases}
 
-Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the ``config.max_embedding_size`` dimension :math:`j` is factorized into :math:`k \text{ and } l`.
-This design ensures that each position embedding vector :math:`x_j` is unique.
+Intuitively, this means that a position embedding vector :math:`x_j \in \mathbb{R}^{d}` is now the composition of two
+factorized embedding vectors: :math:`x^1_{k, l} + x^2_{l, k}`, where as the :obj:`config.max_embedding_size` dimension
+:math:`j` is factorized into :math:`k \text{ and } l`. This design ensures that each position embedding vector
+:math:`x_j` is unique.
 
-Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}` can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
-
-In practice, the parameter ``config.axial_pos_embds_dim`` is set to ``list``:math:`(d^1, d^2)` which sum has to be equal to ``config.hidden_size`` and ``config.axial_pos_shape`` is set to ``list``:math:`(n_s^1, n_s^2)` and which product has to be equal to ``config.max_embedding_size`` which during training has to be equal to the ``sequence length`` of the ``input_ids``.
+Using the above example again, axial position encoding with :math:`d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}`
+can drastically reduced the number of parameters to :math:`2^{14} + 2^{15} \approx 49000` parameters.
 
+In practice, the parameter :obj:`config.axial_pos_embds_dim` is set to a tuple :math:`(d^1, d^2)` which sum has to
+be equal to :obj:`config.hidden_size` and :obj:`config.axial_pos_shape` is set to a tuple :math:`(n_s^1, n_s^2)` which
+product has to be equal to :obj:`config.max_embedding_size`, which during training has to be equal to the
+`sequence length` of the :obj:`input_ids`.
 
 
 LSH Self Attention
-~~~~~~~~~~~~~~~~~~~~
-In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied.
-LSH self attention uses the locality sensitive 
-hashing mechanism proposed in `Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`_ to assign each of the tied key query embedding vectors to one of ``config.num_buckets`` possible buckets. The premise is that the more "similar" key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to the same bucket. 
-The accuracy of the LSH mechanism can be improved by increasing ``config.num_hashes`` or directly the argument ``num_hashes`` of the forward function so that the output of the LSH self attention better approximates the output of the "normal" full self attention.
-The buckets are then sorted and chunked into query key embedding vector chunks each of length ``config.lsh_chunk_length``. For each chunk, the query embedding vectors attend to its key vectors (which are tied to themselves) and to the key embedding vectors of ``config.lsh_num_chunks_before`` previous neighboring chunks and ``config.lsh_num_chunks_after`` following neighboring chunks.
-For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`_ or this great `blog post <https://www.pragmatic.ml/reformer-deep-dive/>`_.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key
+query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in
+`Practical and Optimal LSH for Angular Distance <https://arxiv.org/abs/1509.02897>`__ to assign each of the tied key
+query embedding vectors to one of :obj:`config.num_buckets` possible buckets. The premise is that the more "similar"
+key query embedding vectors (in terms of *cosine similarity*) are to each other, the more likely they are assigned to
+the same bucket. 
 
-Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.
+The accuracy of the LSH mechanism can be improved by increasing :obj:`config.num_hashes` or directly the argument 
+:obj:`num_hashes` of the forward function so that the output of the LSH self attention better approximates the output
+of the "normal" full self attention. The buckets are then sorted and chunked into query key embedding vector chunks
+each of length :obj:`config.lsh_chunk_length`. For each chunk, the query embedding vectors attend to its key vectors
+(which are tied to themselves) and to the key embedding vectors of :obj:`config.lsh_num_chunks_before` previous
+neighboring chunks and :obj:`config.lsh_num_chunks_after` following neighboring chunks.
 
-When training a model from scratch, it is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length a good value for ``num_buckets`` is calculated on the fly. This value will then automatically be saved in the config and should be reused for inference.
+For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>`__ or this great `blog post
+<https://www.pragmatic.ml/reformer-deep-dive/>`__.
 
-Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
+Note that :obj:`config.num_buckets` can also be factorized into a list
+:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to
+one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of
+:math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`.
+This is crucial for very long sequences to save memory.
+
+When training a model from scratch, it is recommended to leave :obj:`config.num_buckets=None`, so that depending on the
+sequence length a good value for :obj:`num_buckets` is calculated on the fly. This value will then automatically be
+saved in the config and should be reused for inference.
+
+Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from
+:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
+and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
 
 
 Local Self Attention
-~~~~~~~~~~~~~~~~~~~~
-Local self attention is essentially a "normal" self attention layer with 
-key, query and value projections, but is chunked so that in each chunk of length ``config.local_chunk_length`` the query embedding vectors only attends to the key embedding vectors in its chunk and to the key embedding vectors of ``config.local_num_chunks_before`` previous neighboring chunks and ``config.local_num_chunks_after`` following neighboring chunks.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
+Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is
+chunked so that in each chunk of length :obj:`config.local_chunk_length` the query embedding vectors only attends to
+the key embedding vectors in its chunk and to the key embedding vectors of :obj:`config.local_num_chunks_before`
+previous neighboring chunks and :obj:`config.local_num_chunks_after` following neighboring chunks.
+
+Using Local self attention, the memory and time complexity of the query-key matmul operation can be reduced from
+:math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory
+and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.
 
 
 Training
-~~~~~~~~~~~~~~~~~~~~
-During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of ``config.lsh_chunk_length`` and ``config.local_chunk_length`` and that the parameters of the Axial Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can easily be trained on sequences as long as 64000 tokens.
-For training, the ``ReformerModelWithLMHead`` should be used as follows: 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-::
+During training, we must ensure that the sequence length is set to a value that can be divided by the least common
+multiple of :obj:`config.lsh_chunk_length` and :obj:`config.local_chunk_length` and that the parameters of the Axial
+Positional Encodings are correctly set as described above. Reformer is very memory efficient so that the model can
+easily be trained on sequences as long as 64000 tokens.
+
+For training, the :class:`~transformers.ReformerModelWithLMHead` should be used as follows: 
+
+.. code-block::
 
   input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
   loss = model(input_ids, labels=input_ids)[0]
 
 
 ReformerConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerConfig
     :members:
 
 
 ReformerTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerTokenizer
-    :members: 
+    :members: save_vocabulary
 
 
 ReformerModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerModel
-    :members:
+    :members: forward
 
 
 ReformerModelWithLMHead
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerModelWithLMHead
-    :members:
+    :members: forward
 
 
 ReformerForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerForMaskedLM
-    :members:
+    :members: forward
 
 
 ReformerForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerForSequenceClassification
-    :members:
+    :members: forward
 
 
 ReformerForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.ReformerForQuestionAnswering
-    :members:
+    :members: forward
diff --git a/docs/source/model_doc/retribert.rst b/docs/source/model_doc/retribert.rst
index c26f61dc08..5208c0fe38 100644
--- a/docs/source/model_doc/retribert.rst
+++ b/docs/source/model_doc/retribert.rst
@@ -1,39 +1,40 @@
 RetriBERT
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The RetriBERT model was proposed in the blog post
-`Explain Anything Like I'm Five: A Model for Open Domain Long Form Question Answering <https://yjernite.github.io/lfqa.html>`__,
-RetriBERT is a small model that uses either a single or pair of Bert encoders with lower-dimension projection for dense semantic indexing of text.
+The RetriBERT model was proposed in the blog post `Explain Anything Like I'm Five: A Model for Open Domain Long Form
+Question Answering <https://yjernite.github.io/lfqa.html>`__. RetriBERT is a small model that uses either a single or
+pair of BERT encoders with lower-dimension projection for dense semantic indexing of text.
 
-Code to train and use the model can be found `here <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
+Code to train and use the model can be found `here
+<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__.
 
 
 RetriBertConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RetriBertConfig
     :members:
 
 
 RetriBertTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RetriBertTokenizer
     :members:
 
 
 RetriBertTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RetriBertTokenizerFast
     :members:
 
 
 RetriBertModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RetriBertModel
-    :members:
+    :members: forward
diff --git a/docs/source/model_doc/roberta.rst b/docs/source/model_doc/roberta.rst
index ac83dde4fc..0e1e3b6d1f 100644
--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -1,12 +1,12 @@
 RoBERTa
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_
-by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
-Veselin Stoyanov. It is based on Google's BERT model released in 2018.
+The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretraining Approach
+<https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
+Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
 
 It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
 objective and training with much larger mini-batches and learning rates.
@@ -27,22 +27,23 @@ Tips:
 - This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
   setup for Roberta pretrained models.
 - RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
-  different pre-training scheme.
-- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
-- `Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.
+  different pretraining scheme.
+- RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
+  separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`</s>`)
+- :doc:`CamemBERT <camembert>` is a wrapper around RoBERTa. Refer to this page for usage examples.
 
 The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_.
 
 
 RobertaConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaConfig
     :members:
 
 
 RobertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -50,98 +51,98 @@ RobertaTokenizer
 
 
 RobertaTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaTokenizerFast
     :members: build_inputs_with_special_tokens
 
 
 RobertaModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaModel
-    :members:
+    :members: forward
 
 
 RobertaForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForCausalLM
-    :members:
+    :members: forward
 
 
 RobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForMaskedLM
-    :members:
+    :members: forward
 
 
 RobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForSequenceClassification
-    :members:
+    :members: forward
 
 
 RobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForMultipleChoice
-    :members:
+    :members: forward
 
 
 RobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForTokenClassification
-    :members:
+    :members: forward
 
 
 RobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.RobertaForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFRobertaModel
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaModel
-    :members:
+    :members: call
 
 
 TFRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaForMaskedLM
-    :members:
+    :members: call
 
 
 TFRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaForSequenceClassification
-    :members:
+    :members: call
 
 
 TFRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaForMultipleChoice
-    :members:
+    :members: call
 
 
 TFRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaForTokenClassification
-    :members:
+    :members: call
 
 
 TFRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFRobertaForQuestionAnswering
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/t5.rst b/docs/source/model_doc/t5.rst
index f7451300c8..43aef7a198 100644
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -1,47 +1,66 @@
 T5
-----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in 
-Here the abstract: 
+The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
+<https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
+Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
 
-*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. 
-In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. 
-Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. 
-By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. 
-To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
+The abstract from the paper is the following:
+
+*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream
+task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
+has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
+transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
+text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
+approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
+with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
+summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
+NLP, we release our dataset, pre-trained models, and code.*
 
 Tips:
 
-- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised 
-  and supervised tasks and for which each task is converted into a text-to-text format.
-  T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
-  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
-- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
+- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which
+  each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
+  different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
+  for summarization: *summarize: ...*.
+  
+  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper
+  <https://arxiv.org/pdf/1910.10683.pdf>`__.
+- For sequence-to-sequence generation, it is recommended to use :obj:`T5ForConditionalGeneration.generate()``. This
+  method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively
+  generates the decoder output.
 - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
 
-The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_.
+The original code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`__.
 
 Training
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
-This means that for training we always need an input sequence and a target sequence. 
-The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* prepended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``labels``. The PAD token is hereby used as the start-sequence token.
-T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
+T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher
+forcing. This means that for training we always need an input sequence and a target sequence. The input sequence is fed
+to the model using :obj:`input_ids``. The target sequence is shifted to the right, i.e., prepended by a start-sequence
+token and fed to the decoder using the :obj:`decoder_input_ids`. In teacher-forcing style, the target sequence is then
+appended by the EOS token and corresponds to the :obj:`labels`. The PAD token is hereby used as the start-sequence
+token. T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
 
 - Unsupervised denoising training
 
   In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
   and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
-  Each sentinel token represents a unique mask token for this sentence and should start with ``<extra_id_0>``, ``<extra_id_1>``, ... up to ``<extra_id_99>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
-  *E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows: 
+  Each sentinel token represents a unique mask token for this sentence and should start with :obj:`<extra_id_0>`, 
+  :obj:`<extra_id_1>`, ... up to :obj:`<extra_id_99>`. As a default, 100 sentinel tokens are available in
+  :class:`~transformers.T5Tokenizer`.
+  
+  For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
+  processed as follows: 
 
-::
+.. code-block::
 
   input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
   labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
@@ -50,11 +69,11 @@ T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
 
 - Supervised training
 
-  In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
-  In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should 
-  be processed as follows:
+  In this setup the input sequence and output sequence are standard sequence-to-sequence input output mapping.
+  In translation, for instance with the input sequence "The house is wonderful." and output sequence "Das Haus ist
+  wunderbar.", the sentences should be processed as follows:
   
-::
+.. code-block::
 
   input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
   labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
@@ -63,43 +82,43 @@ T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
 
 
 T5Config
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.T5Config
     :members:
 
 
 T5Tokenizer
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.T5Tokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
-        create_token_type_ids_from_sequences, save_vocabulary
+        create_token_type_ids_from_sequences, prepare_seq2seq_batch, save_vocabulary
 
 
 T5Model
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.T5Model
-    :members:
+    :members: forward
 
 
 T5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.T5ForConditionalGeneration
-    :members:
+    :members: forward
 
 
 TFT5Model
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFT5Model
-    :members:
+    :members: call
 
 
 TFT5ForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFT5ForConditionalGeneration
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/transformerxl.rst b/docs/source/model_doc/transformerxl.rst
index c9c9807a0d..f5f6b22c3e 100644
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -1,15 +1,14 @@
 Transformer XL
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The Transformer-XL model was proposed in
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__
-by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can reuse
-previously computed hidden-states to attend to longer context (memory).
-This model also uses adaptive softmax inputs and outputs (tied).
+The Transformer-XL model was proposed in `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
+<https://arxiv.org/abs/1901.02860>`__ by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan
+Salakhutdinov. It's a causal (uni-directional) transformer with relative positioning (sinusoïdal) embeddings which can
+reuse previously computed hidden-states to attend to longer context (memory). This model also uses adaptive softmax
+inputs and outputs (tied).
 
 The abstract from the paper is the following:
 
@@ -30,32 +29,32 @@ Tips:
   The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left.
 - Transformer-XL is one of the few models that has no sequence length limit.
 
-The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`_.
+The original code can be found `here <https://github.com/kimiyoung/transformer-xl>`__.
 
 
 TransfoXLConfig
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TransfoXLConfig
     :members:
 
 
 TransfoXLTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TransfoXLTokenizer
     :members: save_vocabulary
 
 
 TransfoXLTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TransfoXLTokenizerFast
     :members:
 
 
 TransfoXL specific outputs
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_transfo_xl.TransfoXLModelOutput
     :members:
@@ -71,28 +70,28 @@ TransfoXL specific outputs
 
 
 TransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TransfoXLModel
-    :members:
+    :members: forward
 
 
 TransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TransfoXLLMHeadModel
-    :members:
+    :members: forward
 
 
 TFTransfoXLModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFTransfoXLModel
-    :members:
+    :members: call
 
 
 TFTransfoXLLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFTransfoXLLMHeadModel
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/xlm.rst b/docs/source/model_doc/xlm.rst
index 9f6254d01c..37f306c845 100644
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@@ -1,15 +1,15 @@
 XLM
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_
-by Guillaume Lample*, Alexis Conneau*. It's a transformer pre-trained using one of the following objectives:
+The XLM model was proposed in `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by
+Guillaume Lample, Alexis Conneau. It's a transformer pretrained using one of the following objectives:
 
 - a causal language modeling (CLM) objective (next token prediction),
-- a masked language modeling (MLM) objective (Bert-like), or
-- a Translation Language Modeling (TLM) object (extension of Bert's MLM to multiple language inputs)
+- a masked language modeling (MLM) objective (BERT-like), or
+- a Translation Language Modeling (TLM) object (extension of BERT's MLM to multiple language inputs)
 
 The abstract from the paper is the following:
 
@@ -27,20 +27,20 @@ Tips:
 
 - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to
   select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation).
-- XLM has multilingual checkpoints which leverage a specific `lang` parameter. Check out the
-  `multi-lingual <../multilingual.html>`__ page for more information.
+- XLM has multilingual checkpoints which leverage a specific :obj:`lang` parameter. Check out the
+  :doc:`multi-lingual <../multilingual>` page for more information.
 
-The original code can be found `here <https://github.com/facebookresearch/XLM/>`_.
+The original code can be found `here <https://github.com/facebookresearch/XLM/>`__.
 
 
 XLMConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMConfig
     :members:
 
 XLMTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -48,99 +48,99 @@ XLMTokenizer
 
 
 XLM specific outputs
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_xlm.XLMForQuestionAnsweringOutput
     :members:
 
 
 XLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMModel
-    :members:
+    :members: forward
 
 
 XLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMWithLMHeadModel
-    :members:
+    :members: forward
 
 
 XLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMForSequenceClassification
-    :members:
+    :members: forward
 
 
 XLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMForMultipleChoice
-    :members:
+    :members: forward
 
 
 XLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMForTokenClassification
-    :members:
+    :members: forward
 
 
 XLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMForQuestionAnsweringSimple
-    :members:
+    :members: forward
 
 
 XLMForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFXLMModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMModel
-    :members:
+    :members: call
 
 
 TFXLMWithLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMWithLMHeadModel
-    :members:
+    :members: call
 
 
 TFXLMForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMForSequenceClassification
-    :members:
+    :members: call
 
 
 TFXLMForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMForMultipleChoice
-    :members:
+    :members: call
 
 
 TFXLMForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMForTokenClassification
-    :members:
+    :members: call
 
 
 
 TFXLMForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMForQuestionAnsweringSimple
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/xlmroberta.rst b/docs/source/model_doc/xlmroberta.rst
index e7ccecfdbd..953adf2a20 100644
--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@@ -1,13 +1,14 @@
 XLM-RoBERTa
-------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__
-by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán,
-Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019.
-It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
+The XLM-RoBERTa model was proposed in `Unsupervised Cross-lingual Representation Learning at Scale
+<https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
+Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's
+RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl
+data.
 
 The abstract from the paper is the following:
 
@@ -25,24 +26,24 @@ and XNLI benchmarks. We will make XLM-R code, data, and models publicly availabl
 
 Tips:
 
-- XLM-R is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
-  not require `lang` tensors to understand which language is used, and should be able to determine the correct
+- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
+  not require :obj:`lang` tensors to understand which language is used, and should be able to determine the correct
   language from the input ids.
-- This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
+- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage
   examples as well as the information relative to the inputs and outputs.
 
-The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_.
+The original code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`__.
 
 
 XLMRobertaConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaConfig
     :members:
 
 
 XLMRobertaTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -50,91 +51,91 @@ XLMRobertaTokenizer
 
 
 XLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaModel
-    :members:
+    :members: forward
 
 
 XLMRobertaForCausalLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForCausalLM
-    :members:
+    :members: forward
 
 
 XLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForMaskedLM
-    :members:
+    :members: forward
 
 
 XLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForSequenceClassification
-    :members:
+    :members: forward
 
 
 XLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForMultipleChoice
-    :members:
+    :members: forward
 
 
 XLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForTokenClassification
-    :members:
+    :members: forward
 
 
 XLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLMRobertaForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFXLMRobertaModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaModel
-    :members:
+    :members: call
 
 
 TFXLMRobertaForMaskedLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaForMaskedLM
-    :members:
+    :members: call
 
 
 TFXLMRobertaForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaForSequenceClassification
-    :members:
+    :members: call
 
 
 TFXLMRobertaForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaForMultipleChoice
-    :members:
+    :members: call
 
 
 TFXLMRobertaForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaForTokenClassification
-    :members:
+    :members: call
 
 
 TFXLMRobertaForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLMRobertaForQuestionAnswering
-    :members:
+    :members: call
diff --git a/docs/source/model_doc/xlnet.rst b/docs/source/model_doc/xlnet.rst
index d424aecc18..894d094df2 100644
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -1,14 +1,14 @@
 XLNet
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Overview
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_
-by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method
-to learn bidirectional contexts by maximizing the expected likelihood over all permutations
-of the input sequence factorization order.
+The XLNet model was proposed in `XLNet: Generalized Autoregressive Pretraining for Language Understanding
+<https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
+Quoc V. Le. XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn
+bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization
+order.
 
 The abstract from the paper is the following:
 
@@ -24,26 +24,26 @@ a large margin, including question answering, natural language inference, sentim
 
 Tips:
 
-- The specific attention pattern can be controlled at training and test time using the `perm_mask` input.
-- Due to the difficulty of training a fully auto-regressive model over various factorization order,
-  XLNet is pretrained using only a sub-set of the output tokens as target which are selected
-  with the `target_mapping` input.
-- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
-  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/text-generation/run_generation.py`)
+- The specific attention pattern can be controlled at training and test time using the :obj:`perm_mask` input.
+- Due to the difficulty of training a fully auto-regressive model over various factorization order, XLNet is pretrained
+  using only a sub-set of the output tokens as target which are selected with the :obj:`target_mapping` input.
+- To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the :obj:`perm_mask` and
+  :obj:`target_mapping` inputs to control the attention span and outputs (see examples in
+  `examples/text-generation/run_generation.py`)
 - XLNet is one of the few models that has no sequence length limit.
 
-The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.
+The original code can be found `here <https://github.com/zihangdai/xlnet/>`__.
 
 
 XLNetConfig
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetConfig
     :members:
 
 
 XLNetTokenizer
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetTokenizer
     :members: build_inputs_with_special_tokens, get_special_tokens_mask,
@@ -51,7 +51,7 @@ XLNetTokenizer
 
 
 XLNet specific outputs
-~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.modeling_xlnet.XLNetModelOutput
     :members:
@@ -94,91 +94,91 @@ XLNet specific outputs
 
 
 XLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetModel
-    :members:
+    :members: forward
 
 
 XLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetLMHeadModel
-    :members:
+    :members: forward
 
 
 XLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetForSequenceClassification
-    :members:
+    :members: forward
 
 
 XLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetForMultipleChoice
-    :members:
+    :members: forward
 
 
 XLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetForTokenClassification
-    :members:
+    :members: forward
 
 
 XLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetForQuestionAnsweringSimple
-    :members:
+    :members: forward
 
 
 XLNetForQuestionAnswering
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.XLNetForQuestionAnswering
-    :members:
+    :members: forward
 
 
 TFXLNetModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetModel
-    :members:
+    :members: call
 
 
 TFXLNetLMHeadModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetLMHeadModel
-    :members:
+    :members: call
 
 
 TFXLNetForSequenceClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetForSequenceClassification
-    :members:
+    :members: call
 
 
 TFLNetForMultipleChoice
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetForMultipleChoice
-    :members:
+    :members: call
 
 
 TFXLNetForTokenClassification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetForTokenClassification
-    :members:
+    :members: call
 
 
 TFXLNetForQuestionAnsweringSimple
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple
-    :members:
+    :members: call
diff --git a/docs/source/model_sharing.rst b/docs/source/model_sharing.rst
index 96ad1d813b..0176bfd5e4 100644
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -1,224 +1,224 @@
-Model sharing and uploading
-===========================
-
-In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
-the `model hub <https://huggingface.co/models>`__.
-
-.. note::
-
-    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
-
-    Optionally, you can join an existing organization or create a new one.
-
-Prepare your model for uploading
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
-done something similar on your task, either using the model directly in your own training loop or using the
-:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
-the `model hub <https://huggingface.co/models>`__.
-
-Basic steps
-^^^^^^^^^^^
-
-.. 
-    When #5258 is merged, we can remove the need to create the directory.
-
-First, pick a directory with the name you want your model to have on the model hub (its full name will then be
-`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`) and create it with either
-
-::
-
-    mkdir path/to/awesome-name-you-picked
-
-or in python
-
-::
-
-    import os
-    os.makedirs("path/to/awesome-name-you-picked")
-
-then you can save your model and tokenizer with:
-
-::
-
-    model.save_pretrained("path/to/awesome-name-you-picked")
-    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
-
-Or, if you're using the Trainer API
-
-::
-
-    trainer.save_model("path/to/awesome-name-you-picked")
-    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
-
-Make your model work on all frameworks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. 
-    TODO Sylvain: make this automatic during the upload
-
-You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
-PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
-your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
-it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
-worry about the GPU, so it should be very easy. Check the
-`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
-and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
-
-First check that your model class exists in the other framework, that is try to import the same model by either adding
-or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
-type
-
-::
-
-    from transformers import TFDistilBertForSequenceClassification
-
-and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
-type
-
-::
-
-    from transformers import DistilBertForSequenceClassification
-
-This will give back an error if your model does not exist in the other framework (something that should be pretty rare
-since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
-
-Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
-model class:
-
-::
-
-    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
-    tf_model.save_pretrained("path/to/awesome-name-you-picked")
-
-and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
-model class:
-
-::
-
-    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
-    pt_model.save_pretrained("path/to/awesome-name-you-picked")
-
-That's all there is to it!
-
-Check the directory before uploading
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Make sure there are no garbage files in the directory you'll upload. It should only have:
-
-- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
-- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
-- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
-- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
-- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
-- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
-  save;
-- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
-
-Other files can safely be deleted.
-
-Upload your model with the CLI
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
-Transformers, since that command :obj:`transformers-cli` comes from the library.
-
-::
-
-    transformers-cli login
-
-Then log in using the same credentials as on huggingface.co. To upload your model, just type
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/
-
-This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
-
-By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
-
-::
-
-    transformers-cli upload -y path/to/awesome-name-you-picked/
-
-
-If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
-just type:
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/that-file 
-
-or
-
-::
-
-   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
-
-if you want to change its filename.
-
-This uploads the model to your personal account. If you want your model to be namespaced by your organization name
-rather than your username, add the following flag to any command:
-
-::
-
-    --organization organization_name
-
-so for instance:
-
-::
-
-    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
-
-Your model will then be accessible through its identifier, which is, as we saw above,
-`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`.
-
-Add a model card
-^^^^^^^^^^^^^^^^
-
-To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
-considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
-placed in a subfolder with your username or organization, then another subfolder named like your model
-(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
-get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
-model card template (meta-suggestions are welcome).
-
-If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
-don't forget to link to its model card so that people can fully trace how your model was built.
-
-If you have never made a pull request to the 🤗 Transformers repo, look at the
-:doc:`contributing guide <contributing>` to see the steps to follow.
-
-.. Note::
-
-    You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
-    inside `path/to/awesome-name-you-picked/`.
-
-Using your model
-^^^^^^^^^^^^^^^^
-
-Your model now has a page on huggingface.co/models 🔥
-
-Anyone can load it from code:
-
-::
-
-    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
-    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
-
-Additional commands
-^^^^^^^^^^^^^^^^^^^
-
-You can list all the files you uploaded on the hub like this:
-
-::
-
-    transformers-cli s3 ls
-
-You can also delete unneeded files with
-
-::
-
-    transformers-cli s3 rm awesome-name-you-picked/filename
-
+Model sharing and uploading
+=======================================================================================================================
+
+In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
+the `model hub <https://huggingface.co/models>`__.
+
+.. note::
+
+    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
+
+    Optionally, you can join an existing organization or create a new one.
+
+Prepare your model for uploading
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have seen in the :doc:`training tutorial <training>`: how to fine-tune a model on a given task. You have probably
+done something similar on your task, either using the model directly in your own training loop or using the
+:class:`~.transformers.Trainer`/:class:`~.transformers.TFTrainer` class. Let's see how you can share the result on
+the `model hub <https://huggingface.co/models>`__.
+
+Basic steps
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. 
+    When #5258 is merged, we can remove the need to create the directory.
+
+First, pick a directory with the name you want your model to have on the model hub (its full name will then be
+`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`) and create it with either
+
+.. code-block::
+
+    mkdir path/to/awesome-name-you-picked
+
+or in python
+
+.. code-block::
+
+    import os
+    os.makedirs("path/to/awesome-name-you-picked")
+
+then you can save your model and tokenizer with:
+
+.. code-block::
+
+    model.save_pretrained("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+
+Or, if you're using the Trainer API
+
+.. code-block::
+
+    trainer.save_model("path/to/awesome-name-you-picked")
+    tokenizer.save_pretrained("path/to/awesome-name-you-picked")
+
+Make your model work on all frameworks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. 
+    TODO Sylvain: make this automatic during the upload
+
+You probably have your favorite framework, but so will other users! That's why it's best to upload your model with both
+PyTorch `and` TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load
+your model in another framework, but it will be slower, as it will have to be converted on the fly). Don't worry, it's super easy to do (and in a future version,
+it will all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don't need to
+worry about the GPU, so it should be very easy. Check the
+`TensorFlow installation page <https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available>`__ 
+and/or the `PyTorch installation page <https://pytorch.org/get-started/locally/#start-locally>`__ to see how.
+
+First check that your model class exists in the other framework, that is try to import the same model by either adding
+or removing TF. For instance, if you trained a :class:`~transformers.DistilBertForSequenceClassification`, try to
+type
+
+.. code-block::
+
+    from transformers import TFDistilBertForSequenceClassification
+
+and if you trained a :class:`~transformers.TFDistilBertForSequenceClassification`, try to
+type
+
+.. code-block::
+
+    from transformers import DistilBertForSequenceClassification
+
+This will give back an error if your model does not exist in the other framework (something that should be pretty rare
+since we're aiming for full parity between the two frameworks). In this case, skip this and go to the next step.
+
+Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your
+model class:
+
+.. code-block::
+
+    tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True)
+    tf_model.save_pretrained("path/to/awesome-name-you-picked")
+
+and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your
+model class:
+
+.. code-block::
+
+    pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True)
+    pt_model.save_pretrained("path/to/awesome-name-you-picked")
+
+That's all there is to it!
+
+Check the directory before uploading
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Make sure there are no garbage files in the directory you'll upload. It should only have:
+
+- a `config.json` file, which saves the :doc:`configuration <main_classes/configuration>` of your model ;
+- a `pytorch_model.bin` file, which is the PyTorch checkpoint (unless you can't have it for some reason) ;
+- a `tf_model.h5` file, which is the TensorFlow checkpoint (unless you can't have it for some reason) ;
+- a `special_tokens_map.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `tokenizer_config.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save;
+- a `vocab.txt`, which is the vocabulary of your tokenizer, part of your :doc:`tokenizer <main_classes/tokenizer>`
+  save;
+- maybe a `added_tokens.json`, which is part of your :doc:`tokenizer <main_classes/tokenizer>` save.
+
+Other files can safely be deleted.
+
+Upload your model with the CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now go in a terminal and run the following command. It should be in the virtual enviromnent where you installed 🤗
+Transformers, since that command :obj:`transformers-cli` comes from the library.
+
+.. code-block::
+
+    transformers-cli login
+
+Then log in using the same credentials as on huggingface.co. To upload your model, just type
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/
+
+This will upload the folder containing the weights, tokenizer and configuration we prepared in the previous section.
+
+By default you will be prompted to confirm that you want these files to be uploaded. If you are uploading multiple models and need to script that process, you can add `-y` to bypass the prompt. For example:
+
+.. code-block::
+
+    transformers-cli upload -y path/to/awesome-name-you-picked/
+
+
+If you want to upload a single file (a new version of your model, or the other framework checkpoint you want to add),
+just type:
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/that-file 
+
+or
+
+.. code-block::
+
+   transformers-cli upload path/to/awesome-name-you-picked/that-file --filename awesome-name-you-picked/new_name
+
+if you want to change its filename.
+
+This uploads the model to your personal account. If you want your model to be namespaced by your organization name
+rather than your username, add the following flag to any command:
+
+.. code-block::
+
+    --organization organization_name
+
+so for instance:
+
+.. code-block::
+
+    transformers-cli upload path/to/awesome-name-you-picked/ --organization organization_name
+
+Your model will then be accessible through its identifier, which is, as we saw above,
+`username/awesome-name-you-picked` or `organization/awesome-name-you-picked`.
+
+Add a model card
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
+considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
+placed in a subfolder with your username or organization, then another subfolder named like your model
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
+get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
+model card template (meta-suggestions are welcome).
+
+If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
+don't forget to link to its model card so that people can fully trace how your model was built.
+
+If you have never made a pull request to the 🤗 Transformers repo, look at the
+:doc:`contributing guide <contributing>` to see the steps to follow.
+
+.. Note::
+
+    You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
+    inside `path/to/awesome-name-you-picked/`.
+
+Using your model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Your model now has a page on huggingface.co/models 🔥
+
+Anyone can load it from code:
+
+.. code-block::
+
+    tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
+    model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")
+
+Additional commands
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can list all the files you uploaded on the hub like this:
+
+.. code-block::
+
+    transformers-cli s3 ls
+
+You can also delete unneeded files with
+
+.. code-block::
+
+    transformers-cli s3 rm awesome-name-you-picked/filename
+
diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst
index e88d8de02a..acfaf243e9 100644
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -1,750 +1,750 @@
-Summary of the models
-================================================
-
-This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
-`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
-<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
-models. You can check them more in detail in their respective documentation. Also checkout the
-:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
-community models <https://huggingface.co/models>`_.
-
-Each one of the models in the library falls into one of the following categories:
-
-  * :ref:`autoregressive-models`
-  * :ref:`autoencoding-models`
-  * :ref:`seq-to-seq-models`
-  * :ref:`multimodal-models`
-  * :ref:`retrieval-based-models`
-
-Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
-previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
-sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
-models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
-A typical example of such models is GPT.
-
-Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
-sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
-full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can
-be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is
-sentence classification or token classification. A typical example of such models is BERT.
-
-Note that the only difference between autoregressive models and autoencoding models is in the way the model is
-pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
-model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
-introduced.
-
-Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
-tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
-most natural applications are translation, summarization and question answering. The original transformer model is an
-example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
-
-Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
-
-.. _autoregressive-models:
-
-Autoregressive models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
-that at each position, the model can only look at the tokens before the attention heads.
-
-Original GPT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=openai-gpt">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
-   </a>
-   <a href="model_doc/gpt.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
-   </a>
-
-`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
-Alec Radford et al.
-
-The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
-
-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
-
-GPT-2
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=gpt2">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
-   </a>
-   <a href="model_doc/gpt2.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
-   </a>
-
-`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
-Alec Radford et al.
-
-A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
-more).
-
-The library provides versions of the model for language modeling and multitask language modeling/multiple choice
-classification.
-
-CTRL
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=ctrl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
-   </a>
-   <a href="model_doc/ctrl.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
-   </a>
-
-`CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_,
-Nitish Shirish Keskar et al.
-
-Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
-several) of those control codes which are then used to influence the text generation: generate with the style of
-wikipedia article, a book or a movie review.
-
-The library provides a version of the model for language modeling only.
-
-Transformer-XL
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=transfo-xl">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
-   </a>
-   <a href="model_doc/transformerxl.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
-   </a>
-
-`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
-Zihang Dai et al.
-
-Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
-RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
-may span across multiple documents, and segments are fed in order to the model.
-
-Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention
-scores. This allows the model to pay attention to information that was in the previous segment as well as the current
-one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
-
-This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would
-give the same results in the current input and the current hidden state at a given position) and needs to make some
-adjustments in the way attention scores are computed.
-
-The library provides a version of the model for language modeling only.
-
-.. _reformer:
-
-Reformer
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=reformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
-   </a>
-   <a href="model_doc/reformer.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
-   </a>
-
-`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
-Nikita Kitaev et al .
-
-An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
-include:
-
-  * Use :ref:`Axial position encoding <axial-pos-encoding>` (see below for more details). It’s a mechanism to avoid
-    having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
-    matrices.
-  * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention <lsh-attention>` (see below for more
-    details). It's a technique to avoid computing the full product query-key in the attention layers.
-  * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
-    the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
-    for results inside a given layer (less efficient than storing them but saves memory).
-  * Compute the feedforward operations by chunks and not on the whole batch.
-
-With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.
-
-**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
-pretraining yet, though.
-
-The library provides a version of the model for language modeling only.
-
-XLNet
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlnet">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
-   </a>
-   <a href="model_doc/xlnet.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
-   </a>
-
-`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
-Zhilin Yang et al.
-
-XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
-tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
-with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens
-for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length.
-
-XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.
-
-The library provides a version of the model for language modeling, token classification, sentence classification,
-multiple choice classification and question answering.
-
-.. _autoencoding-models:
-
-Autoencoding models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
-look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
-
-BERT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=bert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
-   </a>
-   <a href="model_doc/bert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
-   </a>
-
-`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_,
-Jacob Devlin et al.
-
-Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
-15%) is masked by:
-
-  * a special mask token with probability 0.8
-  * a random token different from the one masked with probability 0.1
-  * the same token with probability 0.1
-
-The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
-separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
-they are not related. The model has to predict if the sentences are consecutive or not.
-
-The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
-token classification, sentence classification, multiple choice classification and question answering.
-
-ALBERT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=albert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
-   </a>
-   <a href="model_doc/albert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
-   </a>
-
-`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_,
-Zhenzhong Lan et al.
-
-Same as BERT but with a few tweaks:
-
-  * Embedding size E is different from hidden size H justified because the embeddings are context independent (one
-    embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
-    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
-    being the vocab size). If E < H, it has less parameters.
-  * Layers are split in groups that share parameters (to save memory).
-  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
-    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
-    been swapped or not.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-RoBERTa
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
-   </a>
-   <a href="model_doc/roberta.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
-   </a>
-
-`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
-Yinhan Liu et al.
-
-Same as BERT with better pretraining tricks:
-
-  * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
-  * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
-    contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
-  * train with larger batches
-  * use BPE with bytes as a subunit and not characters (because of unicode characters)
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-DistilBERT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=distilbert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
-   </a>
-   <a href="model_doc/distilbert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
-   </a>
-
-`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_,
-Victor Sanh et al.
-
-Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
-the same probabilities as the larger model. The actual objective is a combination of:
-
-  * finding the same probabilities as the teacher model
-  * predicting the masked tokens correctly (but no next-sentence objective)
-  * a cosine similarity between the hidden states of the student and the teacher model
-
-The library provides a version of the model for masked language modeling, token classification, sentence classification
-and question answering.
-
-XLM
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlm">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
-   </a>
-   <a href="model_doc/xlm.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
-   </a>
-
-`Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_, Guillaume Lample and Alexis Conneau
-
-A transformer model trained on several languages. There are three different type of training for this model and the
-library provides checkpoints for all of them:
-
-  * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
-    previous section as well). One of the languages is selected for each training sample, and the model input is a
-    sentence of 256 tokens, that may span over several documents in one of those languages.
-  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
-    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
-    dynamic masking of the tokens.
-  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
-    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
-    surrounding context in language 1 and the context given by language 2.
-
-Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top
-of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
-indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.
-
-The library provides a version of the model for language modeling, token classification, sentence classification and
-question answering.
-
-XLM-RoBERTa
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=xlm-roberta">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
-   </a>
-   <a href="model_doc/xlmroberta.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
-   </a>
-
-`Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_, Alexis Conneau et
-al.
-
-Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
-masked language modeling on sentences coming from one language. However, the model is trained on many more languages
-(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-FlauBERT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=flaubert">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
-   </a>
-   <a href="model_doc/flaubert.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
-   </a>
-
-`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_, Hang Le et al.
-
-Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
-
-The library provides a version of the model for language modeling and sentence classification.
-
-ELECTRA
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=electra">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
-   </a>
-   <a href="model_doc/electra.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
-   </a>
-
-`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://arxiv.org/abs/2003.10555>`_,
-Kevin Clark et al.
-
-ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
-corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA
-has to predict which token is an original and which one has been replaced. Like for GAN training, the small language
-model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a
-traditional GAN setting) then the ELECTRA model is trained for a few steps.
-
-The library provides a version of the model for masked language modeling, token classification and sentence
-classification.
-
-Funnel Transformer
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=funnel">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
-   </a>
-   <a href="model_doc/funnel.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
-   </a>
-
-`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
-<https://arxiv.org/abs/2006.03236>`_, Zihang Dai et al.
-
-Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
-at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
-way, their length is divided by 2, which speeds up the computation of the next hidden states. All pretrained models
-have three blocks, which means the final hidden state has a sequence length that is one fourth of the original sequence
-length.
-
-For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token
-classification, we need a hidden state with the same sequence length as the original input. In those cases, the final
-hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two
-versions of each checkpoint. The version suffixed with "-base" contains only the three blocks, while the version
-without that suffix contains the three blocks and the upsampling head with its additional layers.
-
-The pretrained models available use the same pretraining objective as ELECTRA.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-.. _longformer:
-
-Longformer
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=longformer">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
-   </a>
-   <a href="model_doc/longformer.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
-   </a>
-
-`Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_, Iz Beltagy et al.
-
-A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
-what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
-still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
-:ref:`local attention section <local-attention>` for more information.
-
-It is pretrained the same way a RoBERTa otherwise.
-
-**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
-pretraining yet, though.
-
-The library provides a version of the model for masked language modeling, token classification, sentence
-classification, multiple choice classification and question answering.
-
-.. _seq-to-seq-models:
-
-Sequence-to-sequence models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-As mentioned before, these models keep both the encoder and the decoder of the original transformer.
-
-BART
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=bart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
-   </a>
-   <a href="model_doc/bart.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
-   </a>
-
-`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
-<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
-
-Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
-fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the
-pretraining tasks, a composition of the following transformations are applied:
-
-  * mask random tokens (like in BERT)
-  * delete random tokens
-  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
-  * permute sentences
-  * rotate the document to make it start at a specific token
-
-The library provides a version of this model for conditional generation and sequence classification.
-
-Pegasus
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=pegasus">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
-   </a>
-   <a href="model_doc/pegasus.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
-   </a>
-
-`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
-<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
-
-Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG).
-
-  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
-  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
-
-In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
-
-The library provides a version of this model for conditional generation, which should be used for summarization.
-
-
-MarianMT
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=marian">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
-   </a>
-   <a href="model_doc/marian.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
-   </a>
-
-`Marian: Fast Neural Machine Translation in C++ <https://arxiv.org/abs/1804.00344>`_, Marcin Junczys-Dowmunt et al.
-
-A framework for translation models, using the same models as BART
-
-The library provides a version of this model for conditional generation.
-
-T5
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=t5">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
-   </a>
-   <a href="model_doc/t5.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
-   </a>
-
-`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
-Colin Raffel et al.
-
-Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
-each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
-prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
-
-The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
-tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
-
-Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
-replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the
-original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
-
-For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder
-input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
-
-The library provides a version of this model for conditional generation.
-
-MBart
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=mbart">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
-   </a>
-   <a href="model_doc/mbart.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
-   </a>
-
-`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-
-The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
-and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
-for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
-
-The library provides a version of this model for conditional generation.
-
-The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english -> romanian translation.
-
-The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
-
-.. _multimodal-models:
-
-Multimodal models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
-others.
-
-MMBT
-----------------------------------------------
-
-`Supervised Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/abs/1909.02950>`_, Douwe Kiela
-et al.
-
-A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
-model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
-(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
-resnet to the hidden state dimension of the transformer).
-
-The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
-model know which part of the input vector corresponds to the text and which to the image.
-
-The pretrained model only works for classification.
-
-..
-    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
-    TODO: write this page
-
-.. _retrieval-based-models:
-
-Retrieval-based models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.
-
-
-DPR
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=dpr">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
-   </a>
-   <a href="model_doc/dpr.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
-   </a>
-
-`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
-Vladimir Karpukhin et al.
-
-Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
-
-
-DPR consists in three models:
-
-  * Question encoder: encode questions as vectors
-  * Context encoder: encode contexts as vectors
-  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
-
-DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
-
-RAG
-----------------------------------------------
-
-.. raw:: html
-
-   <a href="https://huggingface.co/models?filter=rag">
-       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
-   </a>
-   <a href="model_doc/rag.html">
-       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
-   </a>
-
-`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
-Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
-
-Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
-RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
-The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
-
-The two models RAG-Token and RAG-Sequence are available for generation.
-
-More technical aspects
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Full vs sparse attention
-----------------------------------------------
-
-Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
-computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
-use a sparse version of the attention matrix to speed up training.
-
-.. _lsh-attention:
-
-**LSH attention**
-
-:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
-dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
-the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
-modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
-similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
-a n_rounds parameter) and then are averaged together.
-
-.. _local-attention:
-
-**Local attention**
-
-:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
-right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
-last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
-representation of the whole sentence.
-
-Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
-all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
-their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
-
-.. image:: imgs/local_attention_mask.png
-   :scale: 50 %
-   :align: center
-
-Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
-length.
-
-Other tricks
-----------------------------------------------
-
-.. _axial-pos-encoding:
-
-**Axial positional encodings**
-
-:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
-E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
-hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
-E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
-and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
-time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
-:math:`j // l1` in E2.
+Summary of the models
+=======================================================================================================================
+
+This is a summary of the models available in 🤗 Transformers. It assumes you’re familiar with the original
+`transformer model <https://arxiv.org/abs/1706.03762>`_. For a gentle introduction check the `annotated transformer
+<http://nlp.seas.harvard.edu/2018/04/03/attention.html>`_. Here we focus on the high-level differences between the
+models. You can check them more in detail in their respective documentation. Also checkout the
+:doc:`pretrained model page </pretrained_models>` to see the checkpoints available for each type of model and all `the
+community models <https://huggingface.co/models>`_.
+
+Each one of the models in the library falls into one of the following categories:
+
+  * :ref:`autoregressive-models`
+  * :ref:`autoencoding-models`
+  * :ref:`seq-to-seq-models`
+  * :ref:`multimodal-models`
+  * :ref:`retrieval-based-models`
+
+Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
+previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
+sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those
+models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation.
+A typical example of such models is GPT.
+
+Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original
+sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the
+full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can
+be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is
+sentence classification or token classification. A typical example of such models is BERT.
+
+Note that the only difference between autoregressive models and autoencoding models is in the way the model is
+pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given
+model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first
+introduced.
+
+Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation
+tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their
+most natural applications are translation, summarization and question answering. The original transformer model is an
+example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
+
+Multimodal models mix text inputs with other kinds (e.g. images) and are more specific to a given task.
+
+.. _autoregressive-models:
+
+Autoregressive models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
+that at each position, the model can only look at the tokens before the attention heads.
+
+Original GPT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=openai-gpt">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-openai--gpt-blueviolet">
+   </a>
+   <a href="model_doc/gpt.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-openai--gpt-blueviolet">
+   </a>
+
+`Improving Language Understanding by Generative Pre-Training <https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf>`_,
+Alec Radford et al.
+
+The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset.
+
+The library provides versions of the model for language modeling and multitask language modeling/multiple choice
+classification.
+
+GPT-2
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=gpt2">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-gpt2-blueviolet">
+   </a>
+   <a href="model_doc/gpt2.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-gpt2-blueviolet">
+   </a>
+
+`Language Models are Unsupervised Multitask Learners <https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_,
+Alec Radford et al.
+
+A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or
+more).
+
+The library provides versions of the model for language modeling and multitask language modeling/multiple choice
+classification.
+
+CTRL
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=ctrl">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-ctrl-blueviolet">
+   </a>
+   <a href="model_doc/ctrl.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-ctrl-blueviolet">
+   </a>
+
+`CTRL: A Conditional Transformer Language Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`_,
+Nitish Shirish Keskar et al.
+
+Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or
+several) of those control codes which are then used to influence the text generation: generate with the style of
+wikipedia article, a book or a movie review.
+
+The library provides a version of the model for language modeling only.
+
+Transformer-XL
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=transfo-xl">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-transfo--xl-blueviolet">
+   </a>
+   <a href="model_doc/transformerxl.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-transfo--xl-blueviolet">
+   </a>
+
+`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_,
+Zihang Dai et al.
+
+Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular
+RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that
+may span across multiple documents, and segments are fed in order to the model.
+
+Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention
+scores. This allows the model to pay attention to information that was in the previous segment as well as the current
+one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments.
+
+This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would
+give the same results in the current input and the current hidden state at a given position) and needs to make some
+adjustments in the way attention scores are computed.
+
+The library provides a version of the model for language modeling only.
+
+.. _reformer:
+
+Reformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=reformer">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-reformer-blueviolet">
+   </a>
+   <a href="model_doc/reformer.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-reformer-blueviolet">
+   </a>
+
+`Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_,
+Nikita Kitaev et al .
+
+An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks
+include:
+
+  * Use :ref:`Axial position encoding <axial-pos-encoding>` (see below for more details). It’s a mechanism to avoid
+    having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller
+    matrices.
+  * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention <lsh-attention>` (see below for more
+    details). It's a technique to avoid computing the full product query-key in the attention layers.
+  * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during
+    the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them
+    for results inside a given layer (less efficient than storing them but saves memory).
+  * Compute the feedforward operations by chunks and not on the whole batch.
+
+With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models.
+
+**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a
+pretraining yet, though.
+
+The library provides a version of the model for language modeling only.
+
+XLNet
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlnet">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlnet-blueviolet">
+   </a>
+   <a href="model_doc/xlnet.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlnet-blueviolet">
+   </a>
+
+`XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_,
+Zhilin Yang et al.
+
+XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the
+tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done
+with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens
+for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length.
+
+XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies.
+
+The library provides a version of the model for language modeling, token classification, sentence classification,
+multiple choice classification and question answering.
+
+.. _autoencoding-models:
+
+Autoencoding models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
+look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.
+
+BERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=bert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
+   </a>
+   <a href="model_doc/bert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bert-blueviolet">
+   </a>
+
+`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_,
+Jacob Devlin et al.
+
+Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually
+15%) is masked by:
+
+  * a special mask token with probability 0.8
+  * a random token different from the one masked with probability 0.1
+  * the same token with probability 0.1
+
+The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a
+separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50%
+they are not related. The model has to predict if the sentences are consecutive or not.
+
+The library provides a version of the model for language modeling (traditional or masked), next sentence prediction,
+token classification, sentence classification, multiple choice classification and question answering.
+
+ALBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=albert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet">
+   </a>
+   <a href="model_doc/albert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-albert-blueviolet">
+   </a>
+
+`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_,
+Zhenzhong Lan et al.
+
+Same as BERT but with a few tweaks:
+
+  * Embedding size E is different from hidden size H justified because the embeddings are context independent (one
+    embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a
+    sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V
+    being the vocab size). If E < H, it has less parameters.
+  * Layers are split in groups that share parameters (to save memory).
+  * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B
+    (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have
+    been swapped or not.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+RoBERTa
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=roberta">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-roberta-blueviolet">
+   </a>
+   <a href="model_doc/roberta.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-roberta-blueviolet">
+   </a>
+
+`RoBERTa: A Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_,
+Yinhan Liu et al.
+
+Same as BERT with better pretraining tricks:
+
+  * dynamic masking: tokens are masked differently at each epoch, whereas BERT does it once and for all
+  * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of
+    contiguous texts together to reach 512 tokens (so the sentences are in an order than may span several documents)
+  * train with larger batches
+  * use BPE with bytes as a subunit and not characters (because of unicode characters)
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+DistilBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=distilbert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet">
+   </a>
+   <a href="model_doc/distilbert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-distilbert-blueviolet">
+   </a>
+
+`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_,
+Victor Sanh et al.
+
+Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict
+the same probabilities as the larger model. The actual objective is a combination of:
+
+  * finding the same probabilities as the teacher model
+  * predicting the masked tokens correctly (but no next-sentence objective)
+  * a cosine similarity between the hidden states of the student and the teacher model
+
+The library provides a version of the model for masked language modeling, token classification, sentence classification
+and question answering.
+
+XLM
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlm">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm-blueviolet">
+   </a>
+   <a href="model_doc/xlm.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm-blueviolet">
+   </a>
+
+`Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_, Guillaume Lample and Alexis Conneau
+
+A transformer model trained on several languages. There are three different type of training for this model and the
+library provides checkpoints for all of them:
+
+  * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the
+    previous section as well). One of the languages is selected for each training sample, and the model input is a
+    sentence of 256 tokens, that may span over several documents in one of those languages.
+  * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample,
+    and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with
+    dynamic masking of the tokens.
+  * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two
+    different languages, with random masking. To predict one of the masked tokens, the model can use both, the
+    surrounding context in language 1 and the context given by language 2.
+
+Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top
+of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an
+indication of the language used, and when training using MLM+TLM, an indication of the language used for each part.
+
+The library provides a version of the model for language modeling, token classification, sentence classification and
+question answering.
+
+XLM-RoBERTa
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xlm-roberta">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xlm--roberta-blueviolet">
+   </a>
+   <a href="model_doc/xlmroberta.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xlm--roberta-blueviolet">
+   </a>
+
+`Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_, Alexis Conneau et
+al.
+
+Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses
+masked language modeling on sentences coming from one language. However, the model is trained on many more languages
+(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+FlauBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=flaubert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-flaubert-blueviolet">
+   </a>
+   <a href="model_doc/flaubert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-flaubert-blueviolet">
+   </a>
+
+`FlauBERT: Unsupervised Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_, Hang Le et al.
+
+Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective).
+
+The library provides a version of the model for language modeling and sentence classification.
+
+ELECTRA
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=electra">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-electra-blueviolet">
+   </a>
+   <a href="model_doc/electra.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-electra-blueviolet">
+   </a>
+
+`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://arxiv.org/abs/2003.10555>`_,
+Kevin Clark et al.
+
+ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are
+corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA
+has to predict which token is an original and which one has been replaced. Like for GAN training, the small language
+model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a
+traditional GAN setting) then the ELECTRA model is trained for a few steps.
+
+The library provides a version of the model for masked language modeling, token classification and sentence
+classification.
+
+Funnel Transformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=funnel">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-funnel-blueviolet">
+   </a>
+   <a href="model_doc/funnel.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-funnel-blueviolet">
+   </a>
+
+`Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
+<https://arxiv.org/abs/2006.03236>`_, Zihang Dai et al.
+
+Funnel Transformer is a transformer model using pooling, a bit like a ResNet model: layers are grouped in blocks, and
+at the beginning of each block (except the first one), the hidden states are pooled among the sequence dimension. This
+way, their length is divided by 2, which speeds up the computation of the next hidden states. All pretrained models
+have three blocks, which means the final hidden state has a sequence length that is one fourth of the original sequence
+length.
+
+For tasks such as classification, this is not a problem, but for tasks like masked language modeling or token
+classification, we need a hidden state with the same sequence length as the original input. In those cases, the final
+hidden states are upsampled to the input sequence length and go through two additional layers. That's why there are two
+versions of each checkpoint. The version suffixed with "-base" contains only the three blocks, while the version
+without that suffix contains the three blocks and the upsampling head with its additional layers.
+
+The pretrained models available use the same pretraining objective as ELECTRA.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+.. _longformer:
+
+Longformer
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=longformer">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-longformer-blueviolet">
+   </a>
+   <a href="model_doc/longformer.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-longformer-blueviolet">
+   </a>
+
+`Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_, Iz Beltagy et al.
+
+A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g.,
+what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are
+still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the
+:ref:`local attention section <local-attention>` for more information.
+
+It is pretrained the same way a RoBERTa otherwise.
+
+**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a
+pretraining yet, though.
+
+The library provides a version of the model for masked language modeling, token classification, sentence
+classification, multiple choice classification and question answering.
+
+.. _seq-to-seq-models:
+
+Sequence-to-sequence models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned before, these models keep both the encoder and the decoder of the original transformer.
+
+BART
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=bart">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-bart-blueviolet">
+   </a>
+   <a href="model_doc/bart.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-bart-blueviolet">
+   </a>
+
+`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
+<https://arxiv.org/abs/1910.13461>`_, Mike Lewis et al.
+
+Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is
+fed the original tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the
+pretraining tasks, a composition of the following transformations are applied:
+
+  * mask random tokens (like in BERT)
+  * delete random tokens
+  * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token)
+  * permute sentences
+  * rotate the document to make it start at a specific token
+
+The library provides a version of this model for conditional generation and sequence classification.
+
+Pegasus
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=pegasus">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-pegasus-blueviolet">
+   </a>
+   <a href="model_doc/pegasus.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-pegasus-blueviolet">
+   </a>
+
+`PEGASUS: Pre-training with Extracted Gap-sentences forAbstractive Summarization 
+<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
+
+Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training objective, called Gap Sentence Generation (GSG).
+
+  * MLM: encoder input tokens are randomely replaced by a mask tokens and have to be predicted by the encoder (like in BERT)
+  * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder.
+
+In contrast to BART, Pegasus' pretraining task is intentionally similar to summarization: important sentences are masked and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+
+The library provides a version of this model for conditional generation, which should be used for summarization.
+
+
+MarianMT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=marian">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-marian-blueviolet">
+   </a>
+   <a href="model_doc/marian.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-marian-blueviolet">
+   </a>
+
+`Marian: Fast Neural Machine Translation in C++ <https://arxiv.org/abs/1804.00344>`_, Marcin Junczys-Dowmunt et al.
+
+A framework for translation models, using the same models as BART
+
+The library provides a version of this model for conditional generation.
+
+T5
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=t5">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-t5-blueviolet">
+   </a>
+   <a href="model_doc/t5.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-t5-blueviolet">
+   </a>
+
+`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`_,
+Colin Raffel et al.
+
+Uses the traditional transformer model (with a slight change in the positional embeddings, which are learned at
+each layer). To be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific
+prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth.
+
+The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream
+tasks provided by the GLUE and SuperGLUE benchmarks (converting them into text-to-text tasks as explained above).
+
+Self-supervised training uses corrupted tokens, by randomly removing 15% of the tokens and
+replacing them with individual sentinel tokens (if several consecutive tokens are marked for removal, the whole group is replaced with a single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder is the
+original sentence and the target is then the dropped out tokens delimited by their sentinel tokens.
+
+For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the tokens: "dog", "is" and "cute", the encoder
+input becomes “My <x> very <y> .” and the target input becomes “<x> dog is <y> cute .<z>”
+
+The library provides a version of this model for conditional generation.
+
+MBart
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=mbart">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-mbart-blueviolet">
+   </a>
+   <a href="model_doc/mbart.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-mbart-blueviolet">
+   </a>
+
+`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+
+The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages 
+and is intended for supervised and unsupervised machine translation. MBart is one of the first methods 
+for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages,
+
+The library provides a version of this model for conditional generation.
+
+The `mbart-large-en-ro checkpoint <https://huggingface.co/facebook/mbart-large-en-ro>`_ can be used for english -> romanian translation.
+
+The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ checkpoint can be finetuned for other translation and summarization tasks, using code in ```examples/seq2seq/``` , but is not very useful without finetuning.
+
+.. _multimodal-models:
+
+Multimodal models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the
+others.
+
+MMBT
+-----------------------------------------------------------------------------------------------------------------------
+
+`Supervised Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/abs/1909.02950>`_, Douwe Kiela
+et al.
+
+A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer
+model takes as inputs the embeddings of the tokenized text and the final activations of a pretrained on images resnet
+(after the pooling layer) that goes through a linear layer (to go from number of features at the end of the
+resnet to the hidden state dimension of the transformer).
+
+The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the
+model know which part of the input vector corresponds to the text and which to the image.
+
+The pretrained model only works for classification.
+
+..
+    More information in this :doc:`model documentation </model_doc/mmbt.html>`.
+    TODO: write this page
+
+.. _retrieval-based-models:
+
+Retrieval-based models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.
+
+
+DPR
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=dpr">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-dpr-blueviolet">
+   </a>
+   <a href="model_doc/dpr.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-dpr-blueviolet">
+   </a>
+
+`Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_,
+Vladimir Karpukhin et al.
+
+Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain question-answering research.
+
+
+DPR consists in three models:
+
+  * Question encoder: encode questions as vectors
+  * Context encoder: encode contexts as vectors
+  * Reader: extract the answer of the questions inside retrieved contexts, along with a relevance score (high if the inferred span actually answers the question).
+
+DPR's pipeline (not implemented yet) uses a retrieval step to find the top k contexts given a certain question, and then it calls the reader with the question and the retrieved documents to get the answer.
+
+RAG
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=rag">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-rag-blueviolet">
+   </a>
+   <a href="model_doc/rag.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-rag-blueviolet">
+   </a>
+
+`Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_,
+Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
+
+Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models.
+RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs.
+The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks.
+
+The two models RAG-Token and RAG-Sequence are available for generation.
+
+More technical aspects
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Full vs sparse attention
+-----------------------------------------------------------------------------------------------------------------------
+
+Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
+computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
+use a sparse version of the attention matrix to speed up training.
+
+.. _lsh-attention:
+
+**LSH attention**
+
+:ref:`Reformer <reformer>` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
+dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can  consider only
+the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
+modified to mask the current token (except at the first position), because it will give a query and a key equal (so very
+similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by
+a n_rounds parameter) and then are averaged together.
+
+.. _local-attention:
+
+**Local attention**
+
+:ref:`Longformer <longformer>` uses local attention: often, the local context (e.g., what are the two tokens to the left and
+right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the
+last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
+representation of the whole sentence.
+
+Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
+all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
+their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
+
+.. image:: imgs/local_attention_mask.png
+   :scale: 50 %
+   :align: center
+
+Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
+length.
+
+Other tricks
+-----------------------------------------------------------------------------------------------------------------------
+
+.. _axial-pos-encoding:
+
+**Axial positional encodings**
+
+:ref:`Reformer <reformer>` uses axial positional encodings: in traditional transformer models, the positional encoding
+E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the
+hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and
+E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l`
+and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for
+time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and
+:math:`j // l1` in E2.
diff --git a/docs/source/multilingual.rst b/docs/source/multilingual.rst
index c35b01da3b..2ded7d83e1 100644
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -1,5 +1,5 @@
 Multi-lingual models
-================================================
+=======================================================================================================================
 
 Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
 multi-lingual models are available and have a different mechanisms than mono-lingual models.
@@ -8,13 +8,13 @@ This page details the usage of these models.
 The two models that currently support multiple languages are BERT and XLM.
 
 XLM
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
 be split in two categories: the checkpoints that make use of language embeddings, and those that don't
 
 XLM & Language Embeddings
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 This section concerns the following checkpoints:
 
@@ -82,7 +82,7 @@ The example `run_generation.py <https://github.com/huggingface/transformers/blob
 can generate text using the CLM checkpoints from XLM, using the language embeddings.
 
 XLM without Language Embeddings
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 This section concerns the following checkpoints:
 
@@ -94,7 +94,7 @@ sentence representations, differently from previously-mentioned XLM checkpoints.
 
 
 BERT
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 BERT has two checkpoints that can be used for multi-lingual tasks:
 
@@ -105,7 +105,7 @@ These checkpoints do not require language embeddings at inference time. They sho
 used in the context and infer accordingly.
 
 XLM-RoBERTa
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong
 gains over previously released multi-lingual models like mBERT or XLM on downstream taks like classification,
diff --git a/docs/source/perplexity.rst b/docs/source/perplexity.rst
index 555c932950..0be3d43940 100644
--- a/docs/source/perplexity.rst
+++ b/docs/source/perplexity.rst
@@ -1,5 +1,5 @@
 Perplexity of fixed-length models
-=================================
+=======================================================================================================================
 
 Perplexity (PPL) is one of the most common metrics for evaluating language
 models. Before diving in, we should note that the metric applies specifically
@@ -31,7 +31,7 @@ relationship to Bits Per Character (BPC) and data compression, check out this
 <https://thegradient.pub/understanding-evaluation-metrics-for-language-models/>`_.
 
 Calculating PPL with fixed-length models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 If we weren't limited by a model's context size, we would evaluate the
 model's perplexity by autoregressively factorizing a sequence and
@@ -83,7 +83,7 @@ time. This allows computation to procede much faster while still giving the
 model a large context to make predictions at each step.
 
 Example: Calculating perplexity with GPT-2 in 🤗 Transformers
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Let's demonstrate this process with GPT-2.
 
diff --git a/docs/source/philosophy.rst b/docs/source/philosophy.rst
index 9fd17683e7..b449aeb083 100644
--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -1,5 +1,5 @@
 Philosophy
-==========
+=======================================================================================================================
 
 🤗 Transformers is an opinionated library built for:
 
@@ -48,7 +48,7 @@ A few other goals:
 - Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.
 
 Main concepts
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The library is built around three types of classes for each model:
 
diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst
index c5bc19c94b..5fe278e49e 100644
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -1,343 +1,343 @@
-Preprocessing data
-==================
-
-In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
-
-call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
-you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
-
-As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
-punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
-build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
-work properly.
-
-.. note::
-
-    If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
-    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
-    token to index (that we usually call a `vocab`) as during pretraining.
-
-To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the 
-:func:`~transformers.AutoTokenizer.from_pretrained` method:
-
-.. code-block::
-
-    from transformers import AutoTokenizer
-    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
-
-Base use
-~~~~~~~~
-
-A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
-is its ``__call__``: you just need to feed your sentence to your tokenizer object.
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("Hello, I'm a single sentence!")
-    >>> print(encoded_input)
-    {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This returns a dictionary string to list of ints.
-The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
-below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
-:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
-
-The tokenizer can decode a list of token ids in a proper sentence:
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] Hello, I'm a single sentence! [SEP]"
-
-As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
-tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
-seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
-those special tokens yourself) by passing ``add_special_tokens=False``.
-
-If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
-tokenizer:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-                   [101, 1262, 1330, 5650, 102],
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
-     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0],
-                        [0, 0, 0, 0, 0, 0, 0, 0]],
-     'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1],
-                        [1, 1, 1, 1, 1, 1, 1, 1]]}
-
-We get back a dictionary once again, this time with values being list of list of ints.
-
-If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
-probably want:
-
-- To pad each sentence to the maximum length there is in your batch.
-- To truncate each sentence to the maximum length the model can accept (if applicable).
-- To return tensors.
-
-You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
-
-.. code-block::
-
-    >>> ## PYTORCH CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
-    >>> print(batch)
-    {'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-    >>> ## TENSORFLOW CODE
-    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
-    >>> print(batch)
-    {'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
-                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
-                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
-     'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
-                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
-     'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
-                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
-                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-
-It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
-all about: it points out which tokens the model should pay attention to and which ones it should not (because they
-represent padding in this case).
-
-
-Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
-can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
-
-.. _sentence-pairs:
-
-Preprocessing pairs of sentences
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
-pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
-then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
-
-You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
-(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
-This will once again return a dict string to list of ints:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
-    >>> print(encoded_input)
-    {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
-of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
-`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
-its associated model expects. You can force the return (or the non-return) of any of those special arguments by
-using ``return_input_ids`` or ``return_token_type_ids``.
-
-If we decode the token ids we obtained, we will see that the special tokens have been properly added.
-
-.. code-block::
-
-    >>> tokenizer.decode(encoded_input["input_ids"])
-    "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
-
-If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
-list of first sentences and the list of second sentences:
-
-.. code-block::
-
-    >>> batch_sentences = ["Hello I'm a single sentence",
-    ...                    "And another sentence",
-    ...                    "And the very very last one"]
-    >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
-    ...                              "And I should be encoded with the second sentence",
-    ...                              "And I go with the very last one"]
-    >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
-    >>> print(encoded_inputs)
-    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
-                   [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
-                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
-    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
-    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
-                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
-
-As we can see, it returns a dictionary with the values being list of lists of ints.
-
-To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
-
-.. code-block::
-
-    >>> for ids in encoded_inputs["input_ids"]:
-    >>>     print(tokenizer.decode(ids))
-    [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
-    [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
-    [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
-
-Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
-length the model can accept and return tensors directly with the following:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
-
-Everything you always wanted to know about padding and truncation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
-
-truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
-three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
-
-- :obj:`padding` controls the padding. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
-      a single sequence).
-    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
-      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
-      padding will still be applied to it. 
-    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
-      behavior.
-
-- :obj:`truncation` controls the truncation. It can be a boolean or a string which should be:
-
-    - :obj:`True` or :obj:`'only_first'` truncate to a maximum length specified by the :obj:`max_length` argument or
-      the maximum length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will
-      only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`'only_second'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
-      the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
-    - :obj:`'longest_first'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
-      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will truncate token
-      by token, removing a token from the longest sequence in the pair until the proper length is reached.
-    - :obj:`False` or :obj:`'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
-      default behavior.
-
-- :obj:`max_length` to control the length of the padding/truncation. It can be an integer or :obj:`None`, in which case
-  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
-  truncation/padding to :obj:`max_length` is deactivated.
-
-Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
-any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in 
-:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
-:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
-
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| Truncation                           | Padding                           | Instruction                                                                                 |
-+======================================+===================================+=============================================================================================+
-| no truncation                        | no padding                        | :obj:`tokenizer(batch_sentences)`                                                           |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True)` or                                          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='longest')`                                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length')`                                     |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to max model input length | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True)` or                                       |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | Not possible                                                                                |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-| truncation to specific length        | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to max model input length | Not possible                                                                                |
-|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
-|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
-|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
-+--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
-
-Pre-tokenized inputs
-~~~~~~~~~~~~~~~~~~~~
-
-The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
-predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
-`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
-
-.. warning::
-
-    Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
-    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
-    like BPE).
-
-If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
-tokenizer. For instance, we have:
-
-.. code-block::
-
-    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
-    >>> print(encoded_input)
-    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
-     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
-     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
-Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
-``add_special_tokens=False``.
-
-This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
-like this:
-
-.. code-block::
-
-    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
-                       ["And", "another", "sentence"],
-                       ["And", "the", "very", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
-
-or a batch of pair sentences like this:
-
-.. code-block::
-
-    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
-                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
-                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
-    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
-
-And you can add padding, truncation as well as directly return tensors like before:
-
-.. code-block::
-
-    ## PYTORCH CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="pt")
-    ## TENSORFLOW CODE
-    batch = tokenizer(batch_sentences,
-                      batch_of_second_sentences,
-                      is_split_into_words=True,
-                      padding=True,
-                      truncation=True,
-                      return_tensors="tf")
+Preprocessing data
+=======================================================================================================================
+
+In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
+
+call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
+you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.
+
+As we saw in the :doc:`quicktour </quicktour>`, the tokenizer will first split a given text in words (or part of words,
+punctuation symbols, etc.) usually called `tokens`. Then it will convert those `tokens` into numbers, to be able to
+build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to
+work properly.
+
+.. note::
+
+    If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
+    the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
+    token to index (that we usually call a `vocab`) as during pretraining.
+
+To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the 
+:func:`~transformers.AutoTokenizer.from_pretrained` method:
+
+.. code-block::
+
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
+
+Base use
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
+is its ``__call__``: you just need to feed your sentence to your tokenizer object.
+
+.. code-block::
+
+    >>> encoded_input = tokenizer("Hello, I'm a single sentence!")
+    >>> print(encoded_input)
+    {'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102], 
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+This returns a dictionary string to list of ints.
+The `input_ids <glossary.html#input-ids>`__ are the indices corresponding to each token in our sentence. We will see
+below what the `attention_mask <glossary.html#attention-mask>`__ is used for and in
+:ref:`the next section <sentence-pairs>` the goal of `token_type_ids <glossary.html#token-type-ids>`__.
+
+The tokenizer can decode a list of token ids in a proper sentence:
+
+.. code-block::
+
+    >>> tokenizer.decode(encoded_input["input_ids"])
+    "[CLS] Hello, I'm a single sentence! [SEP]"
+
+As you can see, the tokenizer automatically added some special tokens that the model expect. Not all model need special
+tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we would have
+seen the same sentence as the original one here. You can disable this behavior (which is only advised if you have added
+those special tokens yourself) by passing ``add_special_tokens=False``.
+
+If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
+tokenizer:
+
+.. code-block::
+
+    >>> batch_sentences = ["Hello I'm a single sentence",
+    ...                    "And another sentence",
+    ...                    "And the very very last one"]
+    >>> encoded_inputs = tokenizer(batch_sentences)
+    >>> print(encoded_inputs)
+    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+                   [101, 1262, 1330, 5650, 102],
+                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
+     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                        [0, 0, 0, 0, 0],
+                        [0, 0, 0, 0, 0, 0, 0, 0]],
+     'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                        [1, 1, 1, 1, 1],
+                        [1, 1, 1, 1, 1, 1, 1, 1]]}
+
+We get back a dictionary once again, this time with values being list of list of ints.
+
+If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
+probably want:
+
+- To pad each sentence to the maximum length there is in your batch.
+- To truncate each sentence to the maximum length the model can accept (if applicable).
+- To return tensors.
+
+You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
+
+.. code-block::
+
+    >>> ## PYTORCH CODE
+    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
+    >>> print(batch)
+    {'input_ids': tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+    >>> ## TENSORFLOW CODE
+    >>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
+    >>> print(batch)
+    {'input_ids': tf.Tensor([[ 101, 8667,  146,  112,  182,  170, 1423, 5650,  102],
+                          [ 101, 1262, 1330, 5650,  102,    0,    0,    0,    0],
+                          [ 101, 1262, 1103, 1304, 1304, 1314, 1141,  102,    0]]),
+     'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0],
+                               [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
+     'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
+                               [1, 1, 1, 1, 1, 0, 0, 0, 0],
+                               [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
+
+It returns a dictionary string to tensor. We can now see what the `attention_mask <glossary.html#attention-mask>`__ is
+all about: it points out which tokens the model should pay attention to and which ones it should not (because they
+represent padding in this case).
+
+
+Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
+can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer to throw those kinds of warnings.
+
+.. _sentence-pairs:
+
+Preprocessing pairs of sentences
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sometimes you need to feed pair of sentences to your model. For instance, if you want to classify if two sentences in a
+pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input is
+then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
+
+You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
+(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
+This will once again return a dict string to list of ints:
+
+.. code-block::
+
+    >>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
+    >>> print(encoded_input)
+    {'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102], 
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+This shows us what the `token_type_ids <glossary.html#token-type-ids>`__ are for: they indicate to the model which part
+of the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
+`token_type_ids` are not required or handled by all models. By default, a tokenizer will only return the inputs that
+its associated model expects. You can force the return (or the non-return) of any of those special arguments by
+using ``return_input_ids`` or ``return_token_type_ids``.
+
+If we decode the token ids we obtained, we will see that the special tokens have been properly added.
+
+.. code-block::
+
+    >>> tokenizer.decode(encoded_input["input_ids"])
+    "[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
+
+If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
+list of first sentences and the list of second sentences:
+
+.. code-block::
+
+    >>> batch_sentences = ["Hello I'm a single sentence",
+    ...                    "And another sentence",
+    ...                    "And the very very last one"]
+    >>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
+    ...                              "And I should be encoded with the second sentence",
+    ...                              "And I go with the very last one"]
+    >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
+    >>> print(encoded_inputs)
+    {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
+                   [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102], 
+                   [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]], 
+    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
+    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
+                       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
+
+As we can see, it returns a dictionary with the values being list of lists of ints.
+
+To double-check what is fed to the model, we can decode each list in `input_ids` one by one:
+
+.. code-block::
+
+    >>> for ids in encoded_inputs["input_ids"]:
+    >>>     print(tokenizer.decode(ids))
+    [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
+    [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
+    [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
+
+Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
+length the model can accept and return tensors directly with the following:
+
+.. code-block::
+
+    ## PYTORCH CODE
+    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
+    ## TENSORFLOW CODE
+    batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
+
+Everything you always wanted to know about padding and truncation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
+
+truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The
+three arguments you need to know for this are :obj:`padding`, :obj:`truncation` and :obj:`max_length`.
+
+- :obj:`padding` controls the padding. It can be a boolean or a string which should be:
+
+    - :obj:`True` or :obj:`'longest'` to pad to the longest sequence in the batch (doing no padding if you only provide
+      a single sequence).
+    - :obj:`'max_length'` to pad to a length specified by the :obj:`max_length` argument or the maximum length accepted
+      by the model if no :obj:`max_length` is provided (``max_length=None``). If you only provide a single sequence,
+      padding will still be applied to it. 
+    - :obj:`False` or :obj:`'do_not_pad'` to not pad the sequences. As we have seen before, this is the default
+      behavior.
+
+- :obj:`truncation` controls the truncation. It can be a boolean or a string which should be:
+
+    - :obj:`True` or :obj:`'only_first'` truncate to a maximum length specified by the :obj:`max_length` argument or
+      the maximum length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will
+      only truncate the first sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+    - :obj:`'only_second'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
+      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will only truncate
+      the second sentence of a pair if a pair of sequence (or a batch of pairs of sequences) is provided.
+    - :obj:`'longest_first'` truncate to a maximum length specified by the :obj:`max_length` argument or the maximum
+      length accepted by the model if no :obj:`max_length` is provided (``max_length=None``). This will truncate token
+      by token, removing a token from the longest sequence in the pair until the proper length is reached.
+    - :obj:`False` or :obj:`'do_not_truncate'` to not truncate the sequences. As we have seen before, this is the
+      default behavior.
+
+- :obj:`max_length` to control the length of the padding/truncation. It can be an integer or :obj:`None`, in which case
+  it will default to the maximum length the model can accept. If the model has no specific maximum input length,
+  truncation/padding to :obj:`max_length` is deactivated.
+
+Here is a table summarizing the recommend way to setup padding and truncation. If you use pair of inputs sequence in
+any of the following examples, you can replace :obj:`truncation=True` by a :obj:`STRATEGY` selected in 
+:obj:`['only_first', 'only_second', 'longest_first']`, i.e. :obj:`truncation='only_second'` or
+:obj:`truncation= 'longest_first'` to control how both sequence in the pair are truncated as detailed before.
+
++--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| Truncation                           | Padding                           | Instruction                                                                                 |
++======================================+===================================+=============================================================================================+
+| no truncation                        | no padding                        | :obj:`tokenizer(batch_sentences)`                                                           |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True)` or                                          |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='longest')`                                        |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length')`                                     |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', max_length=42)`                      |
++--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| truncation to max model input length | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True)` or                                       |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY)`                                      |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True)` or                         |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY)`                        |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True)` or                 |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)`                |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | Not possible                                                                                |
++--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+| truncation to specific length        | no padding                        | :obj:`tokenizer(batch_sentences, truncation=True, max_length=42)` or                        |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)`                       |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max sequence in batch  | :obj:`tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or          |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)`         |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to max model input length | Not possible                                                                                |
+|                                      +-----------------------------------+---------------------------------------------------------------------------------------------+
+|                                      | padding to specific length        | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or  |
+|                                      |                                   | :obj:`tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
++--------------------------------------+-----------------------------------+---------------------------------------------------------------------------------------------+
+
+Pre-tokenized inputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
+predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
+`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.
+
+.. warning::
+
+    Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
+    if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
+    like BPE).
+
+If you want to use pre-tokenized inputs, just set :obj:`is_split_into_words=True` when passing your inputs to the
+tokenizer. For instance, we have:
+
+.. code-block::
+
+    >>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
+    >>> print(encoded_input)
+    {'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
+     'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
+``add_special_tokens=False``.
+
+This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
+like this:
+
+.. code-block::
+
+    batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
+                       ["And", "another", "sentence"],
+                       ["And", "the", "very", "very", "last", "one"]]
+    encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
+
+or a batch of pair sentences like this:
+
+.. code-block::
+
+    batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
+                                 ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
+                                 ["And", "I", "go", "with", "the", "very", "last", "one"]]
+    encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
+
+And you can add padding, truncation as well as directly return tensors like before:
+
+.. code-block::
+
+    ## PYTORCH CODE
+    batch = tokenizer(batch_sentences,
+                      batch_of_second_sentences,
+                      is_split_into_words=True,
+                      padding=True,
+                      truncation=True,
+                      return_tensors="pt")
+    ## TENSORFLOW CODE
+    batch = tokenizer(batch_sentences,
+                      batch_of_second_sentences,
+                      is_split_into_words=True,
+                      padding=True,
+                      truncation=True,
+                      return_tensors="tf")
diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
index 85f3a14a94..3b48f12025 100644
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -1,5 +1,5 @@
 Pretrained models
-================================================
+=======================================================================================================================
 
 Here is the full list of the currently provided pretrained models together with a short presentation of each model.
 
diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst
index 37a9060187..bb6f3dba8c 100644
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -1,5 +1,5 @@
 Quick tour
-==========
+=======================================================================================================================
 
 Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
 Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
@@ -14,7 +14,7 @@ will dig a little bit more and see how the library gives you access to those mod
     not, the code is expected to work for both backends without any change needed.
 
 Getting started on a task with a pipeline
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
 provides the following tasks out of the box:
@@ -123,7 +123,7 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
 .. _pretrained-model:
 
 Under the hood: pretrained models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
 using the :obj:`from_pretrained` method:
@@ -142,7 +142,7 @@ using the :obj:`from_pretrained` method:
     >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
 
 Using the tokenizer
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
 words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
@@ -210,7 +210,7 @@ padding token the model was pretrained with. The attention mask is also adapted
 You can learn more about tokenizers :doc:`here <preprocessing>`.
 
 Using the model
-^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
 contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the
@@ -330,7 +330,7 @@ Lastly, you can also ask the model to return all hidden states and all attention
     >>> all_hidden_states, all_attentions = tf_outputs[-2:]
 
 Accessing the code
-^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that will automatically work with any
 pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
@@ -358,7 +358,7 @@ without the auto magic:
     >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
 
 Customizing the model
-^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 If you want to change how the model itself is built, you can define your custom configuration class. Each architecture
 comes with its own relevant configuration (in the case of DistilBERT, :class:`~transformers.DistilBertConfig`) which
diff --git a/docs/source/serialization.rst b/docs/source/serialization.rst
index 88e6344fba..ea14a5ac39 100644
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -1,9 +1,9 @@
-**********************************************
+***********************************************************************************************************************
 Exporting transformers models
-**********************************************
+***********************************************************************************************************************
 
 ONNX / ONNXRuntime
-==============================================
+=======================================================================================================================
 
 Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
 to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
@@ -42,7 +42,7 @@ Also, the conversion tool supports different options which let you tune the beha
 
 
 Optimizations
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
 Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
@@ -68,7 +68,7 @@ Optimizations can then be enabled when loading the model through ONNX runtime fo
     For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
 
 Quantization
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 ONNX exporter supports generating a quantized version of the model to allow efficient inference.
 
@@ -116,7 +116,7 @@ Example of quantized BERT model export:
 
 
 TorchScript
-=======================================
+=======================================================================================================================
 
 .. note::
     This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities
@@ -141,10 +141,10 @@ These necessities imply several things developers should be careful about. These
 
 
 Implications
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 TorchScript flag and tied weights
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 This flag is necessary because most of the language models in this repository have tied weights between their
 ``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights, therefore
 it is necessary to untie and clone the weights beforehand.
@@ -157,7 +157,7 @@ This is not the case for models that do not have a Language Model head, as those
 can be safely exported without the ``torchscript`` flag.
 
 Dummy inputs and standard lengths
-------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
 Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
@@ -178,12 +178,12 @@ It is recommended to be careful of the total number of operations done on each i
 when exporting varying sequence-length models.
 
 Using TorchScript in Python
--------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Below is an example, showing how to save, load models as well as how to use the trace for inference.
 
 Saving a model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
 according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
@@ -229,7 +229,7 @@ according to a ``BertConfig`` class and then saved to disk under the filename ``
     torch.jit.save(traced_model, "traced_bert.pt")
 
 Loading a model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
 We are re-using the previously initialised ``dummy_input``.
@@ -242,7 +242,7 @@ We are re-using the previously initialised ``dummy_input``.
     all_encoder_layers, pooled_output = loaded_model(*dummy_input)
 
 Using a traced model for inference
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Using the traced model for inference is as simple as using its ``__call__`` dunder method:
 
diff --git a/docs/source/task_summary.rst b/docs/source/task_summary.rst
index 1926c22aaf..2309f04d88 100644
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -1,5 +1,5 @@
 Summary of the tasks
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This page shows the most frequent use-cases when using the library. The models available allow for many different
 configurations and a great versatility in use-cases. The most simple ones are presented here, showcasing usage
@@ -38,7 +38,7 @@ Both approaches are showcased here.
     This would produce random output.
 
 Sequence Classification
---------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Sequence classification is the task of classifying sequences according to a given number of classes. An example
 of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
@@ -152,7 +152,7 @@ of each other. The process is the following:
     is paraphrase: 6%
 
 Extractive Question Answering
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
 question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
@@ -297,7 +297,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
 
 
 Language Modeling
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based
 models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with
@@ -308,7 +308,7 @@ domain-specific: using a language model trained over a very large corpus, and th
 or on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
 
 Masked Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to
 fill that mask with an appropriate token. This allows the model to attend to both the right context (tokens on the
@@ -421,7 +421,7 @@ This prints five sequences, with the top 5 tokens predicted by the model:
 
 
 Causal Language Modeling
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
@@ -493,7 +493,7 @@ This outputs a (hopefully) coherent next token following the original sequence,
 In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
 
 Text Generation
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines, as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`__ for example).
 
@@ -576,7 +576,7 @@ For more information on how to apply different decoding strategies for text gene
 
 
 Named Entity Recognition
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Named Entity Recognition (NER) is the task of classifying tokens according to a class, for example, identifying a
 token as a person, an organisation or a location.
@@ -723,7 +723,7 @@ following array should be the output:
     [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
 
 Summarization
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Summarization is the task of summarizing a document or an article into a shorter text.
 
@@ -798,7 +798,7 @@ In this example we use Google`s T5 model. Even though it was pre-trained only on
     >>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
 
 Translation
-----------------------------------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 Translation is the task of translating a text from one language to another.
 
diff --git a/docs/source/testing.rst b/docs/source/testing.rst
index 76fe47c40b..82b88edaf2 100644
--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
@@ -1,5 +1,5 @@
 Testing
-==========
+=======================================================================================================================
 
 
 Let's take a look at how 🤗 Transformer models are tested and how you can write new tests and improve the existing ones.
@@ -10,7 +10,7 @@ There are 2 test suites in the repository:
 2. ``examples`` -- tests primarily for various applications that aren't part of the API
 
 How transformers are tested
----------------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 1. Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this `config file <https://github.com/huggingface/transformers/blob/master/.circleci/config.yml>`__, so that if needed you can reproduce the same environment on your machine.
    
@@ -34,14 +34,14 @@ How transformers are tested
 
 
 Running tests
--------------
+-----------------------------------------------------------------------------------------------------------------------
 
 
 
 
 
 Choosing which tests to run
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them `here <https://docs.pytest.org/en/latest/usage.html>`__.
 
@@ -75,7 +75,7 @@ which tells pytest to:
 
 
 Getting the list of all tests
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 All tests of the test suite:
 
@@ -92,7 +92,7 @@ All tests of a given test file:
 
    
 Run a specific test module
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To run an individual test module:
 
@@ -102,7 +102,7 @@ To run an individual test module:
    
 
 Run specific tests
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be:
 
@@ -156,7 +156,7 @@ And you can combine the two patterns in one:
 
 
 Run only modified tests
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 You can run the tests related to the unstaged files or the current branch (according to Git) by using `pytest-picked <https://github.com/anapaulagomes/pytest-picked>`__. This is a great way of quickly testing your changes didn't break anything, since it won't run the tests related to files you didn't touch.
 
@@ -172,7 +172,7 @@ All tests will be run from files and folders which are modified, but not
 yet committed.
 
 Automatically rerun failed tests on source modification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `pytest-xdist <https://github.com/pytest-dev/pytest-xdist>`__ provides a
 very useful feature of detecting all failed tests, and then waiting for
@@ -212,7 +212,7 @@ alternative implementation of this functionality.
 
 
 Skip a test module
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except ``test_modeling_*.py`` tests:
 
@@ -222,7 +222,7 @@ If you want to run all test modules, except a few you can exclude them by giving
 
 
 Clearing state
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 CI builds and when isolation is important (against speed), cache should
 be cleared:
@@ -232,7 +232,7 @@ be cleared:
     pytest --cache-clear tests
 
 Running tests in parallel
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 As mentioned earlier ``make test`` runs tests in parallel via ``pytest-xdist`` plugin (``-n X`` argument, e.g. ``-n 2`` to run 2 parallel jobs).
 
@@ -246,7 +246,7 @@ tests in the same order, which should help with then somehow reducing
 that failing sequence to a minimum.
 
 Test order and repetition
-~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 It's good to repeat the tests several times, in sequence, randomly, or
 in sets, to detect any potential inter-dependency and state-related bugs
@@ -255,7 +255,7 @@ detect some problems that get uncovered by randomness of DL.
 
 
 Repeat tests
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 * `pytest-flakefinder <https://github.com/dropbox/pytest-flakefinder>`__:
 
@@ -277,7 +277,7 @@ And then run every test multiple times (50 by default):
 
 
 Run tests in a random order
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: bash
 
@@ -333,10 +333,10 @@ details please see its `documentation <https://github.com/jbasko/pytest-random-o
 Another randomization alternative is: ``pytest-randomly`` <https://github.com/pytest-dev/pytest-randomly>`__. This module has a very similar functionality/interface, but it doesn't have the bucket modes available in ``pytest-random-order``. It has the same problem of imposing itself once installed.
 
 Look and feel variations
-~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 pytest-sugar
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 `pytest-sugar <https://github.com/Frozenball/pytest-sugar>`__ is a
 plugin that improves the look-n-feel, adds a progressbar, and show tests
@@ -358,7 +358,7 @@ or uninstall it.
 
 
 Report each sub-test name and its progress
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 For a single or a group of tests via ``pytest`` (after
 ``pip install pytest-pspec``):
@@ -370,7 +370,7 @@ For a single or a group of tests via ``pytest`` (after
 
 
 Instantly shows failed tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 `pytest-instafail <https://github.com/pytest-dev/pytest-instafail>`__
 shows failures and errors instantly instead of waiting until the end of
@@ -385,7 +385,7 @@ test session.
     pytest --instafail
 
 To GPU or not to GPU
-~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``:
 
@@ -403,14 +403,14 @@ This is handy when you want to run different tasks on different GPUs.
     
 And we have these decorators that require the condition described by the marker.
 
-```
+``
 @require_torch
 @require_tf
 @require_multigpu
 @require_non_multigpu
 @require_torch_tpu
 @require_torch_and_cuda
-```
+``
 
 Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:
 
@@ -437,7 +437,7 @@ Inside tests:
 
 
 Output capture
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 During test execution any output sent to ``stdout`` and ``stderr`` is
 captured. If a test or a setup method fails, its according captured
@@ -458,7 +458,7 @@ To send test results to JUnit format output:
 
 
 Color control
-~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To have no color (e.g., yellow on white background is not readable):
 
@@ -469,7 +469,7 @@ To have no color (e.g., yellow on white background is not readable):
 
 
 Sending test report to online pastebin service
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Creating a URL for each test failure:
 
@@ -490,7 +490,7 @@ Creating a URL for a whole test session log:
 
 
 Writing tests
--------------
+-----------------------------------------------------------------------------------------------------------------------
 
 🤗 transformers tests are based on ``unittest``, but run by ``pytest``, so most of the time features from both systems can be used.
 
@@ -498,7 +498,7 @@ You can read `here <https://docs.pytest.org/en/stable/unittest.html>`__ which fe
 
 
 Parametrization
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments.
 
@@ -596,7 +596,7 @@ as in the previous example.
     
 
 Temporary files and directories
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Using unique temporary files and directories are essential for parallel test running, so that the tests won't overwrite each other's data. Also we want to get the temp files and directories removed at the end of each test that created them. Therefore, using packages like ``tempfile``, which address these needs is essential.
 
@@ -646,7 +646,7 @@ In this and all the following scenarios the temporary directory will be auto-rem
 
 
 Skipping tests
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 This is useful when a bug is found and a new test is written, yet the
 bug is not fixed yet. In order to be able to commit it to the main
@@ -673,7 +673,7 @@ causes some bad state that will affect other tests, do not use
 ``xfail``.
 
 Implementation
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - Here is how to skip whole test unconditionally:
 
@@ -749,7 +749,7 @@ or skip the whole module:
 More details, example and ways are `here <https://docs.pytest.org/en/latest/skipping.html>`__.
 
 Custom markers
-~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 * Slow tests
 
@@ -776,7 +776,7 @@ Some decorators like ``@parametrized`` rewrite test names, therefore ``@slow`` a
     def test_integration_foo():
 
 Testing the stdout/stderr output
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 In order to test functions that write to ``stdout`` and/or ``stderr``,
 the test can access those streams using the ``pytest``'s `capsys
@@ -885,7 +885,7 @@ If you need to capture both streams at once, use the parent
 
 
 Capturing logger stream
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 If you need to validate the output of a logger, you can use :obj:`CaptureLogger`:
 
@@ -903,7 +903,7 @@ If you need to validate the output of a logger, you can use :obj:`CaptureLogger`
 
 
 Testing with environment variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 If you want to test the impact of environment variables for a specific test you can use a helper decorator ``transformers.testing_utils.mockenv``
 
@@ -917,7 +917,7 @@ If you want to test the impact of environment variables for a specific test you
 
 
 Getting reproducible results
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 In some situations you may want to remove randomness for your tests. To
 get identical reproducable results set, you will need to fix the seed:
@@ -944,7 +944,7 @@ get identical reproducable results set, you will need to fix the seed:
     tf.random.set_seed(seed)
 
 Debugging tests
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To start a debugger at the point of the warning, do this:
 
diff --git a/docs/source/tokenizer_summary.rst b/docs/source/tokenizer_summary.rst
index 72b322a32c..24d5ef6671 100644
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -1,243 +1,243 @@
-Tokenizer summary
------------------
-
-In this page, we will have a closer look at tokenization. As we saw in
-:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
-are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
-specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
-:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
-:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
-
-Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
-algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
-using :ref:`WordPiece <wordpiece>`.
-
-Introduction to tokenization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
-instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
-this text is just to split it by spaces, which would give:
-
-::
-
-    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
-
-This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
-will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
-into account. This would give:
-
-::
-
-    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
-it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
-part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
-into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
-perform properly if you don't use the exact same rules as the persons who pretrained it.
-
-`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
-rule-based tokenizers. On the text above, they'd output something like:
-
-::
-
-    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
-sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
-you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
-:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
-size of 267,735!
-
-A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
-TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
-transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
-language.
-
-So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
-While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
-as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
-all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
-
-Subword tokenization
-^^^^^^^^^^^^^^^^^^^^
-
-Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
-should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
-decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
-form (almost) arbitrarily long complex words by stringing together some subwords.
-
-This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
-subwords. This also enables the model to process words it has never seen before, by decomposing them into
-subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
-this:
-
-.. code-block::
-
-    >>> from transformers import BertTokenizer
-    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-    >>> tokenizer.tokenize("I have a new GPU!")
-    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
-
-Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
-vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
-means that the rest of the token should be attached to the previous one, without space (for when we need to decode
-predictions and reverse the tokenization).
-
-Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
-
-.. code-block::
-
-    >>> from transformers import XLNetTokenizer
-    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
-    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
-    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
-
-We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
-Transformers has been split into "Transform" and "ers".
-
-Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
-training which is usually done on the corpus the corresponding model will be trained on.
-
-.. _byte-pair-encoding:
-
-Byte-Pair Encoding
-~~~~~~~~~~~~~~~~~~
-
-Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
-splitting the training data into words, which can be a simple space tokenization
-(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
-(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
-
-:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
-
-It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
-vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
-
-Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
-word):
-
-::
-
-    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
-
-Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
-
-::
-
-    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
-
-We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
-times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
-`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
-then it adds 'ug' to the vocabulary. Our corpus then becomes
-
-::
-
-    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
-
-and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
-and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
-to the vocabulary.
-
-At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
-represented as
-
-::
-
-    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
-
-If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
-were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
-``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
-base corpus uses all of them), but to special characters like emojis.
-
-As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
-to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
-and chose to stop the training of the tokenizer at 40,000 merges.
-
-Byte-level BPE
-^^^^^^^^^^^^^^
-
-To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
-all unicode characters, the
-`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
-introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
-additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
-token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
-256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
-
-.. _wordpiece:
-
-WordPiece
-=========
-
-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
-:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
-`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
-on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
-progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
-frequent but the one that will maximize the likelihood on the corpus once merged.
-
-What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
-having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
-subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
-sure it's `worth it`.
-
-.. _unigram:
-
-Unigram
-=======
-
-Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
-Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
-from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
-progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
-with :ref:`SentencePiece <sentencepiece>`.
-
-More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
-for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
-sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
-all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
-reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
-BPE or WordPiece).
-
-Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
-tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
-vocabulary
-
-::
-
-    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
-
-we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
-one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
-training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
-tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
-of the tokenization according to their probabilities).
-
-Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
-words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
-tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
-
-.. math::
-    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
-
-.. _sentencepiece:
-
-SentencePiece
-=============
-
-All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
-all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
-pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
-SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
-includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
-
-That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
-the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
-all of them together and replace '▁' with space.
-
-All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
-:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
+Tokenizer summary
+-----------------------------------------------------------------------------------------------------------------------
+
+In this page, we will have a closer look at tokenization. As we saw in
+:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
+are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
+specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
+:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
+:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
+
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
+algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
+using :ref:`WordPiece <wordpiece>`.
+
+Introduction to tokenization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
+instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
+this text is just to split it by spaces, which would give:
+
+.. code-block::
+
+    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
+
+This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
+will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
+into account. This would give:
+
+.. code-block::
+
+    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
+it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
+part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
+into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
+perform properly if you don't use the exact same rules as the persons who pretrained it.
+
+`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
+rule-based tokenizers. On the text above, they'd output something like:
+
+.. code-block::
+
+    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
+sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
+you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
+:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
+size of 267,735!
+
+A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
+TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
+transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
+language.
+
+So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
+While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
+as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
+all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
+
+Subword tokenization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
+should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
+decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
+form (almost) arbitrarily long complex words by stringing together some subwords.
+
+This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
+subwords. This also enables the model to process words it has never seen before, by decomposing them into
+subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
+this:
+
+.. code-block::
+
+    >>> from transformers import BertTokenizer
+    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    >>> tokenizer.tokenize("I have a new GPU!")
+    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
+
+Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
+vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
+means that the rest of the token should be attached to the previous one, without space (for when we need to decode
+predictions and reverse the tokenization).
+
+Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
+
+.. code-block::
+
+    >>> from transformers import XLNetTokenizer
+    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
+    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
+
+We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
+Transformers has been split into "Transform" and "ers".
+
+Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
+training which is usually done on the corpus the corresponding model will be trained on.
+
+.. _byte-pair-encoding:
+
+Byte-Pair Encoding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
+splitting the training data into words, which can be a simple space tokenization
+(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
+(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+
+:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
+
+It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
+vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
+
+Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
+word):
+
+.. code-block::
+
+    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
+
+Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
+
+.. code-block::
+
+    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
+
+We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
+times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
+`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
+then it adds 'ug' to the vocabulary. Our corpus then becomes
+
+.. code-block::
+
+    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
+
+and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
+and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
+to the vocabulary.
+
+At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
+represented as
+
+.. code-block::
+
+    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
+
+If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
+were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
+``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
+base corpus uses all of them), but to special characters like emojis.
+
+As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
+to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
+and chose to stop the training of the tokenizer at 40,000 merges.
+
+Byte-level BPE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
+all unicode characters, the
+`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
+introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
+additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
+token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
+256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
+
+.. _wordpiece:
+
+WordPiece
+=======================================================================================================================
+
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
+:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
+`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
+on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
+progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
+frequent but the one that will maximize the likelihood on the corpus once merged.
+
+What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
+having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
+subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
+sure it's `worth it`.
+
+.. _unigram:
+
+Unigram
+=======================================================================================================================
+
+Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
+Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
+from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
+progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
+with :ref:`SentencePiece <sentencepiece>`.
+
+More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
+for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
+sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
+all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
+reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
+BPE or WordPiece).
+
+Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
+tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
+vocabulary
+
+.. code-block::
+
+    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
+
+we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
+one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
+training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
+tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
+of the tokenization according to their probabilities).
+
+Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
+words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
+tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
+
+.. math::
+    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
+
+.. _sentencepiece:
+
+SentencePiece
+=======================================================================================================================
+
+All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
+all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
+pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
+SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
+includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
+
+That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
+the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
+all of them together and replace '▁' with space.
+
+All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
+:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
diff --git a/docs/source/training.rst b/docs/source/training.rst
index 5d0cbe982b..9a3e510583 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -1,5 +1,5 @@
 Training and fine-tuning
-========================
+=======================================================================================================================
 
 Model classes in 🤗 Transformers are designed to be compatible with native
 PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
@@ -24,7 +24,7 @@ Sections:
 .. _pytorch:
 
 Fine-tuning in native PyTorch
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Model classes in 🤗 Transformers that don't begin with ``TF`` are
 `PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging.
 
 
 Freezing the encoder
---------------------
+-----------------------------------------------------------------------------------------------------------------------
 
 In some cases, you might be interested in keeping the weights of the
 pre-trained encoder frozen and optimizing only the weights of the head
@@ -158,7 +158,7 @@ submodule on any task-specific model in the library:
 .. _tensorflow:
 
 Fine-tuning in native TensorFlow 2
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
 TensorFlow models can be instantiated with
@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
 .. _trainer:
 
 Trainer
-^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 We also provide a simple but feature-complete training and evaluation
 interface through :func:`~transformers.Trainer` and
@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory.
 .. _additional-resources:
 
 Additional resources
-^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
   which uses ``Trainer`` for IMDb sentiment classification.
diff --git a/src/transformers/configuration_albert.py b/src/transformers/configuration_albert.py
index 789d690b17..4e26d76033 100644
--- a/src/transformers/configuration_albert.py
+++ b/src/transformers/configuration_albert.py
@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class AlbertConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel`.
-    It is used to instantiate an ALBERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel` or a
+    :class:`~transformers.TFAlbertModel`. It is used to instantiate an ALBERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
-
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30000):
-            Vocabulary size of the ALBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        vocab_size (:obj:`int`, `optional`, defaults to 30000):
+            Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.AlbertModel` or
+            :class:`~transformers.TFAlbertModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
             Dimensionality of vocabulary embeddings.
-        hidden_size (:obj:`int`, optional, defaults to 4096):
+        hidden_size (:obj:`int`, `optional`, defaults to 4096):
             Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        num_hidden_groups (:obj:`int`, optional, defaults to 1):
+        num_hidden_groups (:obj:`int`, `optional`, defaults to 1):
             Number of groups for the hidden layers, parameters in the same group are shared.
-        num_attention_heads (:obj:`int`, optional, defaults to 64):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 64):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 16384):
-            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        inner_group_num (:obj:`int`, optional, defaults to 1):
+        intermediate_size (:obj:`int`, `optional`, defaults to 16384):
+            The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        inner_group_num (:obj:`int`, `optional`, defaults to 1):
             The number of inner repetition of attention and ffn.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu_new"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with. Typically set this to something
             large (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
+            :class:`~transformers.TFAlbertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        classifier_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for attached classifiers.
 
-    Example::
+    Examples::
 
         >>> from transformers import AlbertConfig, AlbertModel
         >>> # Initializing an ALBERT-xxlarge style configuration
diff --git a/src/transformers/configuration_bert.py b/src/transformers/configuration_bert.py
index 3ef298cfcb..9de97e9c42 100644
--- a/src/transformers/configuration_bert.py
+++ b/src/transformers/configuration_bert.py
@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class BertConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
-    It is used to instantiate an BERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
+    :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig):
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 768):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
+            :class:`~transformers.TFBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
+            :class:`~transformers.TFBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
             If True, use gradient checkpointing to save memory at the expense of slower backward pass.
 
-    Example::
+    Examples::
 
         >>> from transformers import BertModel, BertConfig
 
diff --git a/src/transformers/configuration_bert_generation.py b/src/transformers/configuration_bert_generation.py
index 387802828d..4e9c543cd4 100644
--- a/src/transformers/configuration_bert_generation.py
+++ b/src/transformers/configuration_bert_generation.py
@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig
 
 class BertGenerationConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertGenerationPreTrainedModel`.
-    It is used to instantiate a BertGenerationConfig model according to the specified arguments, defining the model architecture.
+    This is the configuration class to store the configuration of a
+    :class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
+    the specified arguments, defining the model architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
-
     Args:
         vocab_size (:obj:`int`, `optional`, defaults to 50358):
-            Vocabulary size of the BertGeneration model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertGeneration`.
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertGeneration`.
         hidden_size (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the encoder layers and the pooler layer.
         num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig):
         num_attention_heads (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
         intermediate_size (:obj:`int`, `optional`, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
         hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
             If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig):
         gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
             If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
 
-    Example::
+    Examples::
 
         >>> from transformers import BertGenerationConfig, BertGenerationEncoder
 
diff --git a/src/transformers/configuration_ctrl.py b/src/transformers/configuration_ctrl.py
index 304aa06b0d..7104a8ceec 100644
--- a/src/transformers/configuration_ctrl.py
+++ b/src/transformers/configuration_ctrl.py
@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h
 
 class CTRLConfig(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel`.
-    It is used to instantiate an CTRL model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
+    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
+    :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 246534):
-            Vocabulary size of the CTRL model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
-        n_positions (:obj:`int`, optional, defaults to 256):
+        vocab_size (:obj:`int`, `optional`, defaults to 246534):
+            Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
+            :class:`~transformers.TFCTRLModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 256):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 256):
+        n_ctx (:obj:`int`, `optional`, defaults to 256):
             Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 1280):
+        n_embd (:obj:`int`, `optional`, defaults to 1280):
             Dimensionality of the embeddings and hidden states.
-        dff (:obj:`int`, optional, defaults to 8192):
-            Dimensionality of the inner dimension of the FFN.
-        n_layer (:obj:`int`, optional, defaults to 48):
+        dff (:obj:`int`, `optional`, defaults to 8192):
+            Dimensionality of the inner dimension of the feed forward networks (FFN).
+        n_layer (:obj:`int`, `optional`, defaults to 48):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
             The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-6):
             The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
 
-    Example::
+    Examples::
 
         >>> from transformers import CTRLModel, CTRLConfig
 
diff --git a/src/transformers/configuration_distilbert.py b/src/transformers/configuration_distilbert.py
index 2f0b2763b6..03b04c732d 100644
--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class DistilBertConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel`.
-    It is used to instantiate a DistilBERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
+    :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the DistilBERT
+    `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
-
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the DistilBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
+            :class:`~transformers.TFDistilBertModel`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):
+        sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
             Whether to use sinusoidal positional embeddings.
-        n_layers (:obj:`int`, optional, defaults to 6):
+        n_layers (:obj:`int`, `optional`, defaults to 6):
             Number of hidden layers in the Transformer encoder.
-        n_heads (:obj:`int`, optional, defaults to 12):
+        n_heads (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        dim (:obj:`int`, optional, defaults to 768):
+        dim (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the encoder layers and the pooler layer.
-        hidden_dim (:obj:`int`, optional, defaults to 3072):
-            The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        hidden_dim (:obj:`int`, `optional`, defaults to 3072):
+            The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        activation (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        qa_dropout (:obj:`float`, optional, defaults to 0.1):
+        qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilities used in the question answering model
             :class:`~transformers.DistilBertForQuestionAnswering`.
-        seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):
+        seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
             The dropout probabilities used in the sequence classification and the multiple choice model
             :class:`~transformers.DistilBertForSequenceClassification`.
 
-    Example::
+    Examples::
 
         >>> from transformers import DistilBertModel, DistilBertConfig
 
diff --git a/src/transformers/configuration_dpr.py b/src/transformers/configuration_dpr.py
index b8efb59867..ecc23b9100 100644
--- a/src/transformers/configuration_dpr.py
+++ b/src/transformers/configuration_dpr.py
@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig):
     :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
     `DPRModel`.
 
-    This is the configuration class to store the configuration of a `DPRContextEncoder`, `DPRQuestionEncoder`, or a `DPRReader`.
-    It is used to instantiate the components of the DPR model.
+    This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
+    :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
+    components of the DPR model.
+
+    This class is a subclass of :class:`~transformers.BertConfig`. Please check the
+    superclass for the documentation of all kwargs.
 
     Args:
         vocab_size (:obj:`int`, `optional`, defaults to 30522):
diff --git a/src/transformers/configuration_electra.py b/src/transformers/configuration_electra.py
index bdc17c6779..c8cb568acc 100644
--- a/src/transformers/configuration_electra.py
+++ b/src/transformers/configuration_electra.py
@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class ElectraConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`.
-    It is used to instantiate an ELECTRA model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
+    :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the ELECTRA
+    `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig):
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the ELECTRA model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
+            :class:`~transformers.TFElectraModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
             Dimensionality of the encoder layers and the pooler layer.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
             Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 1024):
+        intermediate_size (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
+            :class:`~transformers.TFElectraModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        summary_type (:obj:`string`, optional, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Is one of the following options:
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
 
-                - 'last' => take the last token hidden state (like XLNet)
-                - 'first' => take the first token hidden state (like Bert)
-                - 'mean' => take the mean of all tokens hidden states
-                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-                - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            'gelu' => add a gelu activation to the output, Other => no activation.
-        summary_last_dropout (:obj:`float`, optional, defaults to 0.0):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a dropout after the projection and activation
+            Has to be one of the following options:
 
-    Example::
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
+        summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+
+    Examples::
 
         >>> from transformers import ElectraModel, ElectraConfig
 
diff --git a/src/transformers/configuration_encoder_decoder.py b/src/transformers/configuration_encoder_decoder.py
index 7ce3bd327c..eff92bf245 100644
--- a/src/transformers/configuration_encoder_decoder.py
+++ b/src/transformers/configuration_encoder_decoder.py
@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__)
 
 class EncoderDecoderConfig(PretrainedConfig):
     r"""
-    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.
+    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
+    :class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
+    specified arguments, defining the encoder and decoder configs.
 
-    It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig`
-    and can be used to control the model outputs.
-    See the documentation for :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
 
     Args:
         kwargs (`optional`):
-            Remaining dictionary of keyword arguments. Notably:
-                encoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the encoder config.
-                decoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the decoder config.
+            Dictionary of keyword arguments. Notably:
 
-    Example::
+                - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the encoder config.
+                - **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the decoder config.
+
+    Examples::
 
         >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
 
diff --git a/src/transformers/configuration_flaubert.py b/src/transformers/configuration_flaubert.py
index 05f9961d4e..711b07875a 100644
--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class FlaubertConfig(XLMConfig):
     """
-    Configuration class to store the configuration of a `FlaubertModel`.
-    This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
-    It is used to instantiate an XLM model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
+    :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
+    arguments, defining the model architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig):
         layerdrop (:obj:`float`, `optional`, defaults to 0.0):
             Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
             with Structured Dropout. ICLR 2020)
-        vocab_size (:obj:`int`, optional, defaults to 30145):
-            Vocabulary size of the Flaubert model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FlaubertModel`.
-        emb_dim (:obj:`int`, optional, defaults to 2048):
+        vocab_size (:obj:`int`, `optional`, defaults to 30145):
+            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
+            :class:`~transformers.TFFlaubertModel`.
+        emb_dim (:obj:`int`, `optional`, defaults to 2048):
             Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected
             layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for the attention mechanism
-        gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
-            The non-linear activation function (function or string) in the
-            encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
-        sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
-        causal (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Set this to `True` for the model to behave in a causal manner.
+        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use a `gelu` actibation instead of `relu`.
+        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
+        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model shoul behave in a causal manner.
             Causal models use a triangular attention mask in order to only attend to the left-side context instead
             if a bidirectional context.
-        asm (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
+        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
             layer.
-        n_langs (:obj:`int`, optional, defaults to 1):
+        n_langs (:obj:`int`, `optional`, defaults to 1):
             The number of languages the model handles. Set to 1 for monolingual models.
-        use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
+        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
             Whether to use language embeddings. Some models use additional language embeddings, see
             `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
             for information on how to use them.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might
             ever be used with. Typically set this to something large just in case
             (e.g., 512 or 1024 or 2048).
-        embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
+        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
             The standard deviation of the truncated_normal_initializer for
             initializing the embedding matrices.
-        init_std (:obj:`int`, optional, defaults to 50257):
+        init_std (:obj:`int`, `optional`, defaults to 50257):
             The standard deviation of the truncated_normal_initializer for
             initializing all weight matrices except the embedding matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        bos_index (:obj:`int`, optional, defaults to 0):
+        bos_index (:obj:`int`, `optional`, defaults to 0):
             The index of the beginning of sentence token in the vocabulary.
-        eos_index (:obj:`int`, optional, defaults to 1):
+        eos_index (:obj:`int`, `optional`, defaults to 1):
             The index of the end of sentence token in the vocabulary.
-        pad_index (:obj:`int`, optional, defaults to 2):
+        pad_index (:obj:`int`, `optional`, defaults to 2):
             The index of the padding token in the vocabulary.
-        unk_index (:obj:`int`, optional, defaults to 3):
+        unk_index (:obj:`int`, `optional`, defaults to 3):
             The index of the unknown token in the vocabulary.
-        mask_index (:obj:`int`, optional, defaults to 5):
+        mask_index (:obj:`int`, `optional`, defaults to 5):
             The index of the masking token in the vocabulary.
-        is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
-            Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
-        summary_type (:obj:`string`, optional, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Is one of the following options:
+        is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
+        summary_type (:obj:`string`, `optional`, defaults to "first"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
 
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a dropout before the projection and activation
-        start_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        end_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        mask_token_id (:obj:`int`, optional, defaults to 0):
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        mask_token_id (:obj:`int`, `optional`, defaults to 0):
             Model agnostic parameter to identify masked tokens when generating text in an MLM context.
-        lang_id (:obj:`int`, optional, defaults to 1):
+        lang_id (:obj:`int`, `optional`, defaults to 1):
             The ID of the language used by the model. This parameter is used when generating
             text in a given language.
     """
diff --git a/src/transformers/configuration_fsmt.py b/src/transformers/configuration_fsmt.py
index a9dd0efe43..747f47dd52 100644
--- a/src/transformers/configuration_fsmt.py
+++ b/src/transformers/configuration_fsmt.py
@@ -18,7 +18,6 @@
 import copy
 
 from .configuration_utils import PretrainedConfig
-from .file_utils import add_start_docstrings_to_callable
 from .utils import logging
 
 
@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__)
 FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
 
 
-FSMT_CONFIG_ARGS_DOC = r"""
+class DecoderConfig(PretrainedConfig):
+    r"""
+    Configuration class for FSMT's decoder specific things.
+    note: this is a private helper class
+    """
+    model_type = "fsmt_decoder"
+
+    def __init__(self, vocab_size=0, bos_token_id=0):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+
+
+class FSMTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
+    instantiate a FSMT model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
     Args:
         langs (:obj:`List[str]`):
-            source language, target_language (e.g. ['en', 'ru'])
+            A list with source language and target_language (e.g., ['en', 'ru']).
         src_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
-            method in the encoder.
+            Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed to the forward method in the encoder.
         tgt_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
-            method in the decoder.
+            Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed to the forward method in the decoder.
         d_model (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the layers and the pooler layer.
         encoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of encoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of encoder layers.
         decoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of decoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of decoder layers.
         encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
         decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer decoder.
         decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
         encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
-        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to "relu"):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
         dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
         attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r"""
         eos_token_id (:obj:`int`, `optional`, defaults to 2)
             End of stream token id.
         decoder_start_token_id (:obj:`int`, `optional`):
-            This model starts decoding with `eos_token_id`
+            This model starts decoding with :obj:`eos_token_id`
         encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
             Google "layerdrop arxiv", as its not explainable in one line.
         decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r"""
         early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
             Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
             the beam search when at least ``num_beams`` sentences are finished per batch or not.
-"""
 
+        Examples::
 
-class DecoderConfig(PretrainedConfig):
-    r"""
-    Configuration class for FSMT's decoder specific things.
-    note: this is a private helper class
-    """
-    model_type = "fsmt_decoder"
+            >>> from transformers import FSMTConfig, FSMTModel
 
-    def __init__(self, vocab_size=0, bos_token_id=0):
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.bos_token_id = bos_token_id
+            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
+            >>> model = FSMTModel(config)
 
-
-@add_start_docstrings_to_callable(FSMT_CONFIG_ARGS_DOC)
-class FSMTConfig(PretrainedConfig):
-    r"""
-    Configuration class for FSMT.
     """
     model_type = "fsmt"
 
@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig):
         early_stopping=False,
         **common_kwargs
     ):
-        r"""
-        :class:`~transformers.FSMTConfig` is the configuration class for `FSMTModel`.
-
-        Examples::
-
-            >>> from transformers import FSMTConfig, FSMTModel
-
-            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
-            >>> model = FSMTModel(config)
-
-        """
         if "hidden_size" in common_kwargs:
             raise ValueError("hidden size is called d_model")
         super().__init__(
diff --git a/src/transformers/configuration_funnel.py b/src/transformers/configuration_funnel.py
index 857371e549..e48e24d720 100644
--- a/src/transformers/configuration_funnel.py
+++ b/src/transformers/configuration_funnel.py
@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class FunnelConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel`.
-    It is used to instantiate an Funnel Transformer model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the Funnel Transformer `funnel-transformer/small <https://huggingface.co/funnel-transformer/small>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
+    :class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the Funnel Transformer `funnel-transformer/small
+    <https://huggingface.co/funnel-transformer/small>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
-
     Args:
         vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the Funnel transformer. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FunnelModel`.
+            Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
+            by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
         block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
             The sizes of the blocks used in the model.
         block_repeats (:obj:`List[int]`, `optional`):
@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
         type_vocab_size (:obj:`int`, `optional`, defaults to 3):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.FunnelModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
         initializer_range (:obj:`float`, `optional`, defaults to 0.1):
             The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
             layers.
diff --git a/src/transformers/configuration_gpt2.py b/src/transformers/configuration_gpt2.py
index af90ef959b..87588811d9 100644
--- a/src/transformers/configuration_gpt2.py
+++ b/src/transformers/configuration_gpt2.py
@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class GPT2Config(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model`.
-    It is used to instantiate an GPT-2 model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
+    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig):
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 50257):
-            Vocabulary size of the GPT-2 model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPT2Model`.
-        n_positions (:obj:`int`, optional, defaults to 1024):
+        vocab_size (:obj:`int`, `optional`, defaults to 50257):
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
+            :class:`~transformers.TFGPT2Model`.
+        n_positions (:obj:`int`, `optional`, defaults to 1024):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 1024):
+        n_ctx (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        n_inner (:obj:`int`, optional, defaults to None):
+        n_inner (:obj:`int`, `optional`, defaults to None):
             Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
-        activation_function (:obj:`str`, optional, defaults to 'gelu'):
-            Activation function selected in the list ["relu", "swish", "gelu", "tanh", "gelu_new"].
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
+            Activation function, to be selected in the list :obj:`["relu", "swish", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
             The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
             The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Is one of the following options:
+        summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
 
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
             Argument used when doing sequence summary. Used in for the multiple choice head in
             :class:`~transformers.GPT2DoubleHeadsModel`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Add a dropout before the projection and activation
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            The dropout ratio to be used after the projection and activation.
 
     Example::
 
diff --git a/src/transformers/configuration_longformer.py b/src/transformers/configuration_longformer.py
index e69eef440e..2d04ef9f6c 100644
--- a/src/transformers/configuration_longformer.py
+++ b/src/transformers/configuration_longformer.py
@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class LongformerConfig(RobertaConfig):
     r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel` or a
+    :class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
+    arguments, defining the model architecture.
+
     This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
     It is used to instantiate an Longformer model according to the specified arguments, defining the model
     architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig):
     It reuses the same defaults. Please check the parent class for more information.
 
     Args:
-        attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
-            Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
+        attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
+            Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
             To specify a different window size for each layer, use a :obj:`List[int]` where
             ``len(attention_window) == num_hidden_layers``.
 
diff --git a/src/transformers/configuration_lxmert.py b/src/transformers/configuration_lxmert.py
index 96993ff2c2..4e5bc8d5b2 100644
--- a/src/transformers/configuration_lxmert.py
+++ b/src/transformers/configuration_lxmert.py
@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class LxmertConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
-    It is used to instantiate an Lxmert model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.LxmertModel` or a
+    :class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
+    arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 768):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the LXMERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.LxmertModel` or
+            :class:`~transformers.TFLxmertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the encoder layers and the pooler layer.
-        r_layers (:obj:`int`, optional, defaults to 5):
+        r_layers (:obj:`int`, `optional`, defaults to 5):
             Number of hidden layers in the Transformer visual encoder.
-        l_layers (:obj:`int`, optional, defaults to 9):
+        l_layers (:obj:`int`, `optional`, defaults to 9):
             Number of hidden layers in the Transformer language encoder.
-        x_layers (:obj:`int`, optional, defaults to 5):
+        x_layers (:obj:`int`, `optional`, defaults to 5):
             Number of hidden layers in the Transformer cross modality encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 5):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 5):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
             The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        visual_feat_dim (:obj:`int`, optional, defaults to 2048):
+        visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
             This represents the last dimension of the pooled-object features used as input for the model,
             representing the size of each object feature itself.
-        visual_pos_dim (:obj:`int`, optional, defaults to 4):
+        visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
             This represents the number of spacial features that are mixed into the visual features.
             The default is set to 4 because most commonly this will represent the location of a bounding box.
-            i.e. (x, y, width, height)
-        visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15):
+            i.e., (x, y, width, height)
+        visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
             This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
             one decided to train with multiple vision-based loss objectives.
-        num_qa_labels (:obj:`int`, optional, defaults to 9500):
-            This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA,
-            the user will need to account for the total number of labels that all of the datasets have in total.
-        num_object_labels (:obj:`int`, optional, defaults to 1600):
-            This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature
-            as belonging too.
-        num_attr_labels (:obj:`int`, optional, defaults to 400):
-            This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature
-            as possessing.
-        task_matched (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1.
-            If the sentence does not correctly describe the image, the label will be 0.
-        task_mask_lm (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is the defacto masked langauge modeling used in pretraining models such as BERT.
-        task_obj_predict (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is set to true if the user would like to perform one of the following loss objectives:
-            object predicition, atrribute predicition, feature regression
-        task_qa (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the question-asnwering loss objective
-        visual_obj_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the object-prediction loss objective
-        visual_attr_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective
-        visual_feat_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the feature-regression loss objective
-        output_attentions (:obj:`bool`, optional, defaults to :obj:`False`):
-                if True, the vision, langauge, and cross-modality layers will be returned
-        output_hidden_states (:obj:`bool`, optional, defaults to :obj:`False`):
-                if True, final cross-modality hidden states for language and vision features will be returned
-
+        num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
+            This represents the total number of different question answering (QA) labels there are. If using more than
+            one dataset with QA, the user will need to account for the total number of labels that all of the datasets
+            have in total.
+        num_object_labels (:obj:`int`, `optional`, defaults to 1600):
+            This represents the total number of semantically unique objects that lxmert will be able to classify a
+            pooled-object feature as belonging too.
+        num_attr_labels (:obj:`int`, `optional`, defaults to 400):
+            This represents the total number of semantically unique attributes that lxmert will be able to classify a
+            pooled-object feature as possessing.
+        task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            This task is used for sentence-image matching. If the sentence correctly describes the image the label
+            will be 1. If the sentence does not correctly describe the image, the label will be 0.
+        task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
+            objective.
+        task_obj_predict (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add object predicition, attribute predicition and feature regression to the loss
+            objective.
+        task_qa (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add the question-asnwering loss to the objective
+        visual_obj_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the object-prediction loss objective
+        visual_attr_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the attribute-prediction loss objective
+        visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the feature-regression loss objective
+        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should return the attentions from the vision, langauge, and cross-modality
+            layers should be returned.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
+            layers should be returned.
     """
 
     model_type = "lxmert"
diff --git a/src/transformers/configuration_mmbt.py b/src/transformers/configuration_mmbt.py
index dc3d3b78dd..d650ada16b 100644
--- a/src/transformers/configuration_mmbt.py
+++ b/src/transformers/configuration_mmbt.py
@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__)
 
 
 class MMBTConfig(object):
-    """Configuration class to store the configuration of a `MMBT Model`.
+    """
+    This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
+    instantiate a MMBT model according to the specified arguments, defining the model architecture.
 
     Args:
-        config (:obj:`~transformers.PreTrainedConfig`):
-            Config of the underlying Transformer models. Its values are
-            copied over to use a single config.
-        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
+        config (:class:`~transformers.PreTrainedConfig`):
+            Config of the underlying Transformer models. Its values are copied over to use a single config.
+        num_labels (:obj:`int`, `optional`):
             Size of final Linear layer for classification.
-        modal_hidden_size (:obj:`int`, optional, defautls to 2048):
+        modal_hidden_size (:obj:`int`, `optional`, defautls to 2048):
             Embedding dimension of the non-text modality encoder.
     """
 
diff --git a/src/transformers/configuration_mobilebert.py b/src/transformers/configuration_mobilebert.py
index 8fd965a30d..93b1243ce0 100644
--- a/src/transformers/configuration_mobilebert.py
+++ b/src/transformers/configuration_mobilebert.py
@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class MobileBertConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel`.
-    It is used to instantiate a MobileBERT model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
+    :class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
+    arguments, defining the model architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig):
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the MobileBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.MobileBertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
+            :class:`~transformers.TFMobileBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 512):
             Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 24):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
             Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 512):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 512):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.MobileBertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
+            or :class:`~transformers.TFMobileBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
 
-        pad_token_id (:obj:`int`, optional, defaults to 0):
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
             The ID of the token in the word embedding to use as padding.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
             The dimension of the word embedding vectors.
-        trigram_input (:obj:`bool`, optional, defaults to :obj:`True`):
+        trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Use a convolution of trigram as input.
-        use_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Whether to use bottleneck in BERT.
-        intra_bottleneck_size (:obj:`int`, optional, defaults to 128):
+        intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
             Size of bottleneck layer output.
-        use_bottleneck_attention (:obj:`bool`, optional, defaults to :obj:`False`):
+        use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
             Whether to use attention inputs from the bottleneck transformation.
-        key_query_shared_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Whether to use the same linear transformation for query&key in the bottleneck.
-        num_feedforward_networks (:obj:`int`, optional, defaults to 4):
+        num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
             Number of FFNs in a block.
-        normalization_type (:obj:`str`, optional, defaults to "no_norm"):
-            The normalization type in BERT.
+        normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
+            The normalization type in MobileBERT.
 
-    Example:
+    Examples:
 
         >>> from transformers import MobileBertModel, MobileBertConfig
 
diff --git a/src/transformers/configuration_openai.py b/src/transformers/configuration_openai.py
index dc0bd04184..87c281a35a 100644
--- a/src/transformers/configuration_openai.py
+++ b/src/transformers/configuration_openai.py
@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class OpenAIGPTConfig(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel`.
-    It is used to instantiate an GPT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
+    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel` or a
+    :class:`~transformers.TFOpenAIGPTModel`. It is used to instantiate a GPT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 40478):
-            Vocabulary size of the GPT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
-        n_positions (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 40478):
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
+            :class:`~transformers.TFOpenAIGPTModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 512):
+        n_ctx (:obj:`int`, `optional`, defaults to 512):
             Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        afn (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
             The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
             The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Whether special tokens should be predicted when the model is has a language modeling head.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Is one of the following options:
+        predict_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not special tokens should be predicted when the model has a language modeling head.
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"cls_index"`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
 
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Add a dropout before the projection and activation
+            Has to be one of the following options:
 
-    Example::
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            The dropout ratio to be used after the projection and activation.
+
+    Examples::
 
         >>> from transformers import OpenAIGPTConfig, OpenAIGPTModel
 
diff --git a/src/transformers/configuration_reformer.py b/src/transformers/configuration_reformer.py
index 87c4b9e6fb..4b7a732760 100755
--- a/src/transformers/configuration_reformer.py
+++ b/src/transformers/configuration_reformer.py
@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class ReformerConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`.
-    It is used to instantiate an Reformer model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. It is used to
+    instantiate a Reformer model according to the specified arguments, defining the model architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        attention_head_size (:obj:`int`, optional, defaults to 64):
+        attention_head_size (:obj:`int`, `optional`, defaults to 64):
             Dimensionality of the projected key, query and value vectors
-        attn_layers (:obj:`list(str)`, optional, defaults to ["local", "lsh", "local", "lsh", "local", "lsh"]):
+        attn_layers (:obj:`List[str]`, `optional`, defaults to :obj:`["local", "lsh", "local", "lsh", "local", "lsh"]`):
             List of attention layer types in ascending order. It can be chosen between a
-            LSHSelfAttention layer ("lsh") and a LocalSelfAttention layer ("local").
-            For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .
-            For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .
-        axial_pos_embds (:obj:`bool`, optional, defaults to :obj:`True`):
-            If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__
-        axial_norm_std (:obj:`float`, optional, defaluts to 1.0):
-            The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.
-        axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):
-            The position dims of the axial position encodings.
-            During training the product of the position dims has to equal the sequence length.
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-        axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):
-            The embedding dims of the axial position encodings.
-            The sum of the embedding dims has to equal the hidden size.
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-        chunk_size_lm_head (:obj:`int`, optional, defaults to 0):
+            LSHSelfAttention layer (:obj:`"lsh"`) and a LocalSelfAttention layer (:obj:`"local"`).
+
+            For more information on LSHSelfAttention layer, see `LSH Self Attention
+            <reformer.html#lsh-self-attention>`__. For more information on LocalSelfAttention layer, see `Local Self
+            Attention <reformer.html#local-sensitive-hashing-self-attention>`__.
+        axial_pos_embds (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use axial position embeddings. For more information on how axial position embeddings
+            work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
+        axial_norm_std (:obj:`float`, `optional`, defaults to 1.0):
+            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
+            positional encodings.
+        axial_pos_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 64]`):
+            The position dims of the axial position encodings. During training the product of the position dims has to
+            be equal to the sequence length.
+
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        axial_pos_embds_dim (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 192]`):
+            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
+            hidden size.
+
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        chunk_size_lm_head (:obj:`int`, `optional`, defaults to 0):
             The chunk size of the final language model feed forward head layer.
             A chunk size of 0 means that the feed forward layer is not chunked.
             A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.
-            For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .
-        eos_token_id (:obj:`int`, optional, defaults to 2):
-            The token id for the <EOS> token.
-        feed_forward_size (:obj:`int`, optional, defaults to 512):
-            Dimensionality of the "feed_forward" (i.e., feed-forward) layer in the residual attention block.
-        hash_seed (:obj:`int`, optional, defaults to `None`):
-            Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
-            The non-linear activation function (function or string) in the feed forward layer in the residual attention block.
-            If string, "gelu", "relu", "swish", "gelu_new" and "gelu_fast" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):
+
+            For more information on feed forward chunking, see `How does Feed Forward Chunking work?
+            <../glossary.html#feed-forward-chunking>`__.
+        eos_token_id (:obj:`int`, `optional`, defaults to 2):
+            The token id for the end-of-sentence token.
+        feed_forward_size (:obj:`int`, `optional`, defaults to 512):
+            Dimensionality of the feed_forward layer in the residual attention block.
+        hash_seed (:obj:`int`, `optional`):
+            Seed that can be used to make local sensitive hashing in :obj:`LSHSelfAttention` deterministic. This should
+            only be set for testing purposed. For evaluation and training purposes :obj:`hash_seed` should be left as
+            :obj:`None` to ensure fully random rotations in local sensitive hashing scheme.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
+            The non-linear activation function (function or string) in the feed forward layer in the residual attention
+            block.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.05):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
             Dimensionality of the output hidden states of the residual attention blocks.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        is_decoder (:obj:`bool`, optional, defaults to :obj:`False`):
-            If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.
-            When using the Reformer for causal language modeling, `is_decoder` is set to `True`.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        is_decoder (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether ot not to use a causal mask in addition to the :obj:`attention_mask` passed to
+            :class:`~transformers.ReformerModel`. When using the Reformer for causal language modeling, this argument
+            should be set to :obj:`True`.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        local_chunk_length (:obj:`int`, optional, defaults to 64):
-            Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
-        local_num_chunks_before (:obj:`int`, optional, defaults to 1):
-            Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.
-        local_num_chunks_after (:obj:`int`, optional, defaults to 0):
-            Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.
-        local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-            The dropout ratio for the attention probabilities in LocalSelfAttention.
-        lsh_attn_chunk_length (:obj:`int`, optional, defaults to 64):
-            Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
-        lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):
-            Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.
-        lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):
-            Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.
-        lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-            The dropout ratio for the attention probabilities in LSHSelfAttention.
-        max_position_embeddings (:obj:`int`, optional, defaults to 4096):
+        local_chunk_length (:obj:`int`, `optional`, defaults to 64):
+            Length of chunk which attends to itself in :obj:`LocalSelfAttention`. Chunking reduces memory complexity
+            from sequence length x sequence length (self attention) to
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
+        local_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
+            Number of previous neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer to itself.
+        local_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
+            Number of following neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer in addition to
+            itself.
+        local_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities in :obj:`LocalSelfAttention`.
+        lsh_attn_chunk_length (:obj:`int`, `optional`, defaults to 64):
+            Length of chunk which attends to itself in :obj:`LSHSelfAttention`. Chunking reduces memory complexity from
+            sequence length x sequence length (self attention) to
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
+        lsh_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
+            Number of previous neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
+            Number of following neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities in :obj:`LSHSelfAttention`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 4096):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):
-            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.
-            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.
-            The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.
-        num_hashes (:obj:`int`, optional, defaults to 1):
-            Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.
-            The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.
-        pad_token_id (:obj:`int`, optional, defaults to 0):
-            The token id for the <PAD> token.
-        vocab_size (:obj:`int`, optional, defaults to 320):
-            Vocabulary size of the Reformer model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ReformerModel`.
+        num_buckets (:obj:`int` or :obj:`List[int]`, `optional`):
+            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
+            Each query key vector is hashed into a hash in :obj:`1, ..., num_buckets`.
+            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each
+            query key vector is hashed into a hash in
+            :obj:`1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if :obj:`num_buckets` is
+            factorized into two factors.
+            The number of buckets (or the product the factors) should approximately equal
+            sequence length / lsh_chunk_length. If :obj:`num_buckets` not set, a good value is calculated on the fly.
+        num_hashes (:obj:`int`, `optional`, defaults to 1):
+            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme.
+            The higher :obj:`num_hashes`, the more accurate the :obj:`LSHSelfAttention` becomes, but also the more
+            memory and time intensive the hashing becomes.
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
+            The token id for the padding token.
+        vocab_size (:obj:`int`, `optional`, defaults to 320):\
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
         tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
             Whether to tie input and output embeddings.
 
-    Example::
+    Examples::
 
         >>> from transformers import ReformerModel, ReformerConfig
 
diff --git a/src/transformers/configuration_retribert.py b/src/transformers/configuration_retribert.py
index 9082a9d41e..ac4ddc55d4 100644
--- a/src/transformers/configuration_retribert.py
+++ b/src/transformers/configuration_retribert.py
@@ -38,38 +38,37 @@ class RetriBertConfig(PretrainedConfig):
 
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 768):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the RetriBERT model. Defines the number of different tokens that can be represented by
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.RetriBertModel`
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
             Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
             Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
             The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might ever be used with.
             Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
             The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        share_encoders (:obj:`bool`, optional, defaults to :obj:`True`):
-            Whether to use the same Bert-type encoder for the queries and document
-        projection_dim (:obj:`int`, optional, defaults to 128):
+        share_encoders (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use the same Bert-type encoder for the queries and document
+        projection_dim (:obj:`int`, `optional`, defaults to 128):
             Final dimension of the query and document representation after projection
-
     """
     model_type = "retribert"
 
diff --git a/src/transformers/configuration_roberta.py b/src/transformers/configuration_roberta.py
index 2283158efc..0e7c3b84d6 100644
--- a/src/transformers/configuration_roberta.py
+++ b/src/transformers/configuration_roberta.py
@@ -33,10 +33,10 @@ ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class RobertaConfig(BertConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel`.
-    It is used to instantiate an RoBERTa model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.RobertaModel` or a
+    :class:`~transformers.TFRobertaModel`. It is used to instantiate a RoBERTa model according to the specified
+    arguments, defining the model architecture.
+
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -45,7 +45,7 @@ class RobertaConfig(BertConfig):
     The :class:`~transformers.RobertaConfig` class directly inherits :class:`~transformers.BertConfig`.
     It reuses the same defaults. Please check the parent class for more information.
 
-    Example::
+    Examples::
 
         >>> from transformers import RobertaConfig, RobertaModel
 
diff --git a/src/transformers/configuration_t5.py b/src/transformers/configuration_t5.py
index 80e624c052..3457f37632 100644
--- a/src/transformers/configuration_t5.py
+++ b/src/transformers/configuration_t5.py
@@ -31,33 +31,44 @@ T5_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class T5Config(PretrainedConfig):
     r"""
-    :class:`~transformers.T5Config` is the configuration class to store the configuration of a
-    `T5Model`.
+    This is the configuration class to store the configuration of a :class:`~transformers.T5Model` or a
+    :class:`~transformers.TFT5Model`. It is used to instantiate a T5 model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the T5 `t5-small <https://huggingface.co/t5-small>`__ architecture.
 
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
 
     Arguments:
-        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.
-        d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.
-        num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.
-        d_kv: Size of the key, query, value projections per attention head. `d_kv` has to be equal to `d_model // num_heads`.
-        d_ff: Size of the intermediate feed forward layer in each `T5Block`.
-        num_heads: Number of attention heads for each attention layer in
-            the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.
-        intermediate_size: The size of the "intermediate" (i.e., feed-forward)
-            layer in the Transformer encoder.
-        hidden_act: The non-linear activation function (function or string) in the
-            encoder and pooler. If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob: The dropout probabilitiy for all fully connected
-            layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob: The dropout ratio for the attention
-            probabilities.
-        n_positions: The maximum sequence length that this model might
-            ever be used with. Typically set this to something large just in case
-            (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings`.
-        type_vocab_size: The vocabulary size of the `token_type_ids` passed into
-            `T5Model`.
-        initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).
-        layer_norm_eps: The epsilon used by LayerNorm.
+        vocab_size (:obj:`int`, `optional`, defaults to 32128):
+            Vocabulary size of the T5 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.T5Model` or
+            :class:`~transformers.TFT5Model`.
+        n_positions (:obj:`int`, `optional`, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        d_model (:obj:`int`, `optional`, defaults to 512):
+            Size of the encoder layers and the pooler layer.
+        d_kv (:obj:`int`, `optional`, defaults to 64):
+            Size of the key, query, value projections per attention head. :obj:`d_kv` has to be equal to
+            :obj:`d_model // num_heads`.
+        d_ff (:obj:`int`, `optional`, defaults to 2048):
+            Size of the intermediate feed forward layer in each :obj:`T5Block`.
+        num_layers (:obj:`int`, `optional`, defaults to 6):
+            Number of hidden layers in the Transformer encoder.
+        num_heads (:obj:`int`, `optional`, defaults to 8):
+            Number of attention heads for each attention layer in
+            the Transformer encoder.
+        relative_attention_num_buckets (:obj:`int`, `optional`, defaults to 32):
+            The number of buckets to use for each attention layer.
+        dropout_rate (:obj:`float`, `optional`, defaults to 0.1):
+            The ratio for all dropout layers.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        initializer_factor (:obj:`float`, `optional`, defaults to 1):
+            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
+            testing).
     """
     model_type = "t5"
 
diff --git a/src/transformers/configuration_transfo_xl.py b/src/transformers/configuration_transfo_xl.py
index 4864ec80ca..b678a780ef 100644
--- a/src/transformers/configuration_transfo_xl.py
+++ b/src/transformers/configuration_transfo_xl.py
@@ -31,69 +31,70 @@ TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class TransfoXLConfig(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.TransfoXLModel`.
-    It is used to instantiate a Transformer XL model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.TransfoXLModel` or a
+    :class:`~transformers.TFTransfoXLModel`. It is used to instantiate a Transformer-XL model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `Transformer XL <https://huggingface.co/transfo-xl-wt103>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 267735):
-            Vocabulary size of the Transformer XL model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.TransfoXLModel`.
-        cutoffs (:obj:`List[int]`, optional, defaults to :obj:`[20000, 40000, 200000]`):
-            Cutoffs for the adaptive softmax
-        d_model (:obj:`int`, optional, defaults to 1024):
+        vocab_size (:obj:`int`, `optional`, defaults to 267735):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.TransfoXLModel` or
+            :class:`~transformers.TFTransfoXLModel`.
+        cutoffs (:obj:`List[int]`, `optional`, defaults to :obj:`[20000, 40000, 200000]`):
+            Cutoffs for the adaptive softmax.
+        d_model (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the model's hidden states.
-        d_embed (:obj:`int`, optional, defaults to 1024):
+        d_embed (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the embeddings
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
-        d_head (:obj:`int`, optional, defaults to 64):
+        d_head (:obj:`int`, `optional`, defaults to 64):
             Dimensionality of the model's heads.
-        d_inner (:obj:`int`, optional, defaults to 4096):
+        d_inner (:obj:`int`, `optional`, defaults to 4096):
             Inner dimension in FF
-        div_val (:obj:`int`, optional, defaults to 4):
+        div_val (:obj:`int`, `optional`, defaults to 4):
             Divident value for adapative input and softmax
-        pre_lnorm (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Apply LayerNorm to the input instead of the output
-        n_layer (:obj:`int`, optional, defaults to 18):
+        pre_lnorm (:obj:`boolean`, `optional`, defaults to :obj:`False`):
+            Whether or not to apply LayerNorm to the input instead of the output in the blocks.
+        n_layer (:obj:`int`, `optional`, defaults to 18):
             Number of hidden layers in the Transformer encoder.
-        mem_len (:obj:`int`, optional, defaults to 1600):
-            Length of the retained previous heads
-        clamp_len (:obj:`int`, optional, defaults to 1000):
-            use the same pos embeddings after clamp_len
-        same_length (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Use the same attn length for all tokens
-        proj_share_all_but_first (:obj:`boolean`, optional, defaults to :obj:`True`):
+        mem_len (:obj:`int`, `optional`, defaults to 1600):
+            Length of the retained previous heads.
+        clamp_len (:obj:`int`, `optional`, defaults to 1000):
+            Use the same pos embeddings after clamp_len.
+        same_length (:obj:`boolean`, `optional`, defaults to :obj:`True`):
+            Whether or not to use the same attn length for all tokens
+        proj_share_all_but_first (:obj:`boolean`, `optional`, defaults to :obj:`True`):
             True to share all but first projs, False not to share.
-        attn_type (:obj:`int`, optional, defaults to 0):
+        attn_type (:obj:`int`, `optional`, defaults to 0):
             Attention type. 0 for Transformer-XL, 1 for Shaw et al, 2 for Vaswani et al, 3 for Al Rfou et al.
-        sample_softmax (:obj:`int`, optional, defaults to -1):
-            number of samples in sampled softmax
-        adaptive (:obj:`boolean`, optional, defaults to :obj:`True`):
-            use adaptive softmax
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        sample_softmax (:obj:`int`, `optional`, defaults to -1):
+            Number of samples in the sampled softmax.
+        adaptive (:obj:`boolean`, `optional`, defaults to :obj:`True`):
+            Whether or not to use adaptive softmax.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        dropatt (:obj:`float`, optional, defaults to 0):
+        dropatt (:obj:`float`, `optional`, defaults to 0):
             The dropout ratio for the attention probabilities.
-        untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Untie relative position biases
-        init (:obj:`string`, optional, defaults to `normal`):
-            Parameter initializer to use
-        init_range (:obj:`float`, optional, defaults to 0.01):
+        untie_r (:obj:`boolean`, `optional`, defaults to :obj:`True`):
+            Whether ot not to untie relative position biases.
+        init (:obj:`str`, `optional`, defaults to :obj:`"normal"`):
+            Parameter initializer to use.
+        init_range (:obj:`float`, `optional`, defaults to 0.01):
             Parameters initialized by U(-init_range, init_range).
-        proj_init_std (:obj:`float`, optional, defaults to 0.01):
+        proj_init_std (:obj:`float`, `optional`, defaults to 0.01):
             Parameters initialized by N(0, init_std)
-        init_std (:obj:`float`, optional, defaults to 0.02):
+        init_std (:obj:`float`, `optional`, defaults to 0.02):
             Parameters initialized by N(0, init_std)
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
             The epsilon to use in the layer normalization layers
 
-    Example::
+    Examples::
 
         >>> from transformers import TransfoXLConfig, TransfoXLModel
 
diff --git a/src/transformers/configuration_xlm.py b/src/transformers/configuration_xlm.py
index f1c3430b0f..a11edd6bd5 100644
--- a/src/transformers/configuration_xlm.py
+++ b/src/transformers/configuration_xlm.py
@@ -36,109 +36,109 @@ XLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class XLMConfig(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
-    It is used to instantiate an XLM model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.XLMModel` or a
+    :class:`~transformers.TFXLMModel`. It is used to instantiate a XLM model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 30145):
-            Vocabulary size of the XLM model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XLMModel`.
-        emb_dim (:obj:`int`, optional, defaults to 2048):
+        vocab_size (:obj:`int`, `optional`, defaults to 30145):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.XLMModel` or
+            :class:`~transformers.TFXLMModel`.
+        emb_dim (:obj:`int`, `optional`, defaults to 2048):
             Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected
             layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for the attention mechanism
-        gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
-            The non-linear activation function (function or string) in the
-            encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
-        sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
-        causal (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Set this to `True` for the model to behave in a causal manner.
+        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use `gelu` for the activations instead of `relu`.
+        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
+        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should behave in a causal manner.
             Causal models use a triangular attention mask in order to only attend to the left-side context instead
             if a bidirectional context.
-        asm (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
+        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
             layer.
-        n_langs (:obj:`int`, optional, defaults to 1):
+        n_langs (:obj:`int`, `optional`, defaults to 1):
             The number of languages the model handles. Set to 1 for monolingual models.
-        use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
+        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
             Whether to use language embeddings. Some models use additional language embeddings, see
             `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
             for information on how to use them.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
             The maximum sequence length that this model might
             ever be used with. Typically set this to something large just in case
             (e.g., 512 or 1024 or 2048).
-        embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
+        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
             The standard deviation of the truncated_normal_initializer for
             initializing the embedding matrices.
-        init_std (:obj:`int`, optional, defaults to 50257):
+        init_std (:obj:`int`, `optional`, defaults to 50257):
             The standard deviation of the truncated_normal_initializer for
             initializing all weight matrices except the embedding matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        bos_index (:obj:`int`, optional, defaults to 0):
+        bos_index (:obj:`int`, `optional`, defaults to 0):
             The index of the beginning of sentence token in the vocabulary.
-        eos_index (:obj:`int`, optional, defaults to 1):
+        eos_index (:obj:`int`, `optional`, defaults to 1):
             The index of the end of sentence token in the vocabulary.
-        pad_index (:obj:`int`, optional, defaults to 2):
+        pad_index (:obj:`int`, `optional`, defaults to 2):
             The index of the padding token in the vocabulary.
-        unk_index (:obj:`int`, optional, defaults to 3):
+        unk_index (:obj:`int`, `optional`, defaults to 3):
             The index of the unknown token in the vocabulary.
-        mask_index (:obj:`int`, optional, defaults to 5):
+        mask_index (:obj:`int`, `optional`, defaults to 5):
             The index of the masking token in the vocabulary.
-        is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
-            Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
-        summary_type (:obj:`string`, optional, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Is one of the following options:
+        is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
+        summary_type (:obj:`string`, `optional`, defaults to "first"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
 
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a dropout before the projection and activation
-        start_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        end_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        mask_token_id (:obj:`int`, optional, defaults to 0):
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        mask_token_id (:obj:`int`, `optional`, defaults to 0):
             Model agnostic parameter to identify masked tokens when generating text in an MLM context.
-        lang_id (:obj:`int`, optional, defaults to 1):
+        lang_id (:obj:`int`, `optional`, defaults to 1):
             The ID of the language used by the model. This parameter is used when generating
             text in a given language.
 
-    Example::
+    Examples::
 
         >>> from transformers import XLMConfig, XLMModel
 
diff --git a/src/transformers/configuration_xlnet.py b/src/transformers/configuration_xlnet.py
index b0ea0da858..656461d63b 100644
--- a/src/transformers/configuration_xlnet.py
+++ b/src/transformers/configuration_xlnet.py
@@ -31,85 +31,86 @@ XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 
 class XLNetConfig(PretrainedConfig):
     """
-    This is the configuration class to store the configuration of a :class:`~transformers.XLNetModel`.
-    It is used to instantiate an XLNet model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.XLNetModel` or a
+    :class:`~transformers.TFXLNetModel`. It is used to instantiate a XLNet model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `xlnet-large-cased <https://huggingface.co/xlnet-large-cased>`__ architecture.
 
     Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
     to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
     for more information.
 
     Args:
-        vocab_size (:obj:`int`, optional, defaults to 32000):
-            Vocabulary size of the XLNet model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.XLNetModel`.
-        d_model (:obj:`int`, optional, defaults to 1024):
+        vocab_size (:obj:`int`, `optional`, defaults to 32000):
+            Vocabulary size of the XLNet model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.XLNetModel` or
+            :class:`~transformers.TFXLNetModel`.
+        d_model (:obj:`int`, `optional`, defaults to 1024):
             Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, optional, defaults to 24):
+        n_layer (:obj:`int`, `optional`, defaults to 24):
             Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
             Number of attention heads for each attention layer in the Transformer encoder.
-        d_inner (:obj:`int`, optional, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        ff_activation (:obj:`string`, optional, defaults to "gelu"):
+        d_inner (:obj:`int`, `optional`, defaults to 4096):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        ff_activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
             The non-linear activation function (function or string) in the
-            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
-        untie_r (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Untie relative position biases
-        attn_type (:obj:`string`, optional, defaults to "bi"):
-            The attention type used by the model. Set 'bi' for XLNet, 'uni' for Transformer-XL.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        untie_r (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to untie relative position biases
+        attn_type (:obj:`str`, `optional`, defaults to :obj:`"bi"`):
+            The attention type used by the model. Set :obj:`"bi"` for XLNet, :obj:`"uni"` for Transformer-XL.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        mem_len (:obj:`int` or :obj:`None`, optional):
+        mem_len (:obj:`int` or :obj:`None`, `optional`):
             The number of tokens to cache. The key/value pairs that have already been pre-computed
             in a previous forward pass won't be re-computed. See the
             `quickstart <https://huggingface.co/transformers/quickstart.html#using-the-past>`__
             for more information.
-        reuse_len (:obj:`int` or :obj:`None`, optional):
+        reuse_len (:obj:`int`, `optional`):
             The number of tokens in the current batch to be cached and reused in the future.
-        bi_data (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use bidirectional input pipeline. Usually set to `True` during
-            pretraining and `False` during finetuning.
-        clamp_len (:obj:`int`, optional, defaults to -1):
+        bi_data (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use bidirectional input pipeline. Usually set to :obj:`True` during
+            pretraining and :obj:`False` during finetuning.
+        clamp_len (:obj:`int`, `optional`, defaults to -1):
             Clamp all relative distances larger than clamp_len.
             Setting this attribute to -1 means no clamping.
-        same_length (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use the same attention length for each token.
-        summary_type (:obj:`string`, optional, defaults to "last"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
-            Is one of the following options:
+        same_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the same attention length for each token.
+        summary_type (:obj:`str`, `optional`, defaults to "last"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
 
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_last_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`.
-            Add a dropout after the projection and activation
-        start_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        end_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`boo`, `optional`, defaults to :obj:`True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_last_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
         use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Whether or not the model should return the last pre-computed hidden states.
 
@@ -117,7 +118,7 @@ class XLNetConfig(PretrainedConfig):
                 This flag behaves differently from with other models: it just controls the inference behavior, during
                 training the model always uses ``use_cache=True``.
 
-    Example::
+    Examples::
 
         >>> from transformers import XLNetConfig, XLNetModel
 
diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py
index 4daac636d1..9c9578dd48 100644
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -471,6 +471,7 @@ TF_SEQUENCE_CLASSIFICATION_SAMPLE = r"""
 
 TF_MASKED_LM_SAMPLE = r"""
     Example::
+
         >>> from transformers import {tokenizer_class}, {model_class}
         >>> import tensorflow as tf
 
diff --git a/src/transformers/modeling_albert.py b/src/transformers/modeling_albert.py
index c0ac425bd5..f368a66253 100755
--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -428,7 +428,8 @@ class AlbertForPreTrainingOutput(ModelOutput):
 
     Args:
         loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
-            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
         prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
             Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
         sop_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
@@ -456,7 +457,11 @@ class AlbertForPreTrainingOutput(ModelOutput):
 
 ALBERT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -468,27 +473,31 @@ ALBERT_START_DOCSTRING = r"""
 
 ALBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.AlbertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer` for details.
+            Indices can be obtained using :class:`~transformers.AlbertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.__call__` and
+            :meth:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -496,18 +505,22 @@ ALBERT_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -562,7 +575,7 @@ class AlbertModel(AlbertPreTrainedModel):
             inner_group_idx = int(layer - group_idx * self.config.inner_group_num)
             self.encoder.albert_layer_groups[group_idx].albert_layers[inner_group_idx].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -656,7 +669,7 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
     def get_input_embeddings(self):
         return self.albert.embeddings.word_embeddings
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=AlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -674,22 +687,22 @@ class AlbertForPreTraining(AlbertPreTrainedModel):
         **kwargs,
     ):
         r"""
-            labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Labels for computing the masked language modeling loss.
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
-            sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates original order (sequence A, then sequence B),
-                ``1`` indicates switched order (sequence B, then sequence A).
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``.
+            ``0`` indicates original order (sequence A, then sequence B),
+            ``1`` indicates switched order (sequence B, then sequence A).
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Returns:
 
-        Examples::
+        Example::
 
             >>> from transformers import AlbertTokenizer, AlbertForPreTraining
             >>> import torch
@@ -807,7 +820,7 @@ class AlbertForMaskedLM(AlbertPreTrainedModel):
     def get_input_embeddings(self):
         return self.albert.embeddings.word_embeddings
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -894,7 +907,7 @@ class AlbertForSequenceClassification(AlbertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -978,7 +991,7 @@ class AlbertForTokenClassification(AlbertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1061,7 +1074,7 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1085,11 +1098,11 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -1158,7 +1171,7 @@ class AlbertForMultipleChoice(AlbertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
diff --git a/src/transformers/modeling_bert.py b/src/transformers/modeling_bert.py
index e3f9cd72de..d54cab1ccb 100755
--- a/src/transformers/modeling_bert.py
+++ b/src/transformers/modeling_bert.py
@@ -619,7 +619,8 @@ class BertForPreTrainingOutput(ModelOutput):
 
     Args:
         loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
-            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
         prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
             Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
         seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
@@ -646,7 +647,12 @@ class BertForPreTrainingOutput(ModelOutput):
 
 
 BERT_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -658,27 +664,31 @@ BERT_START_DOCSTRING = r"""
 
 BERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -686,18 +696,22 @@ BERT_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -710,18 +724,15 @@ class BertModel(BertPreTrainedModel):
 
     The model can behave as an encoder (with only self-attention) as well
     as a decoder, in which case a layer of cross-attention is added between
-    the self-attention layers, following the architecture described in `Attention is all you need`_ by Ashish Vaswani,
-    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    the self-attention layers, following the architecture described in `Attention is all you need
+    <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
+    Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
 
     To behave as an decoder the model needs to be initialized with the
     :obj:`is_decoder` argument of the configuration set to :obj:`True`.
     To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder`
     argument and :obj:`add_cross_attention` set to :obj:`True`; an
     :obj:`encoder_hidden_states` is then expected as an input to the forward pass.
-
-    .. _`Attention is all you need`:
-        https://arxiv.org/abs/1706.03762
-
     """
 
     def __init__(self, config):
@@ -748,7 +759,7 @@ class BertModel(BertPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.encoder.layer[layer].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -777,7 +788,9 @@ class BertModel(BertPreTrainedModel):
             Mask to avoid performing attention on the padding token indices of the encoder input. This mask
             is used in the cross-attention if the model is configured as a decoder.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
         """
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
@@ -867,7 +880,7 @@ class BertForPreTraining(BertPreTrainedModel):
     def get_output_embeddings(self):
         return self.cls.predictions.decoder
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=BertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -885,22 +898,23 @@ class BertForPreTraining(BertPreTrainedModel):
         **kwargs
     ):
         r"""
-            labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Labels for computing the masked language modeling loss.
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
-            next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates sequence B is a continuation of sequence A,
-                ``1`` indicates sequence B is a random sequence.
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        labels (:obj:`torch.LongTensor` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Returns:
 
-        Examples::
+        Example::
 
             >>> from transformers import BertTokenizer, BertForPreTraining
             >>> import torch
@@ -976,7 +990,7 @@ class BertLMHeadModel(BertPreTrainedModel):
     def get_output_embeddings(self):
         return self.cls.predictions.decoder
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -994,19 +1008,21 @@ class BertLMHeadModel(BertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
-                if the model is configured as a decoder.
-            encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Mask to avoid performing attention on the padding token indices of the encoder input. This mask
-                is used in the cross-attention if the model is configured as a decoder.
-                Mask values selected in ``[0, 1]``:
-                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the left-to-right language modeling loss (next word prediction).
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+            if the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask
+            is used in the cross-attention if the model is configured as a decoder.
+            Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction).
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            n ``[0, ..., config.vocab_size]``
 
         Returns:
 
@@ -1092,7 +1108,7 @@ class BertForMaskedLM(BertPreTrainedModel):
     def get_output_embeddings(self):
         return self.cls.predictions.decoder
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -1196,7 +1212,7 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1212,11 +1228,12 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates sequence B is a continuation of sequence A,
-                ``1`` indicates sequence B is a random sequence.
+        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair
+            (see ``input_ids`` docstring).  Indices should be in ``[0, 1]``:
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
 
         Returns:
 
@@ -1287,7 +1304,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -1370,7 +1387,7 @@ class BertForMultipleChoice(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -1393,8 +1410,8 @@ class BertForMultipleChoice(BertPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
@@ -1460,7 +1477,7 @@ class BertForTokenClassification(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -1545,7 +1562,7 @@ class BertForQuestionAnswering(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-uncased",
@@ -1569,11 +1586,11 @@ class BertForQuestionAnswering(BertPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/modeling_bert_generation.py b/src/transformers/modeling_bert_generation.py
index f2f6b1dabd..9a0114e3dc 100755
--- a/src/transformers/modeling_bert_generation.py
+++ b/src/transformers/modeling_bert_generation.py
@@ -188,7 +188,12 @@ class BertGenerationPreTrainedModel(PreTrainedModel):
 
 
 BERT_GENERATION_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -200,21 +205,23 @@ BERT_GENERATION_START_DOCSTRING = r"""
 
 BERT_GENERATION_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertGenerationTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertGenerationTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.__call__` and
+            :meth:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -222,18 +229,22 @@ BERT_GENERATION_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -246,10 +257,13 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
 
     The model can behave as an encoder (with only self-attention) as well
     as a decoder, in which case a layer of cross-attention is added between
-    the self-attention layers, following the architecture described in `Attention is all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani,
-    Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    the self-attention layers, following the architecture described in `Attention is all you need
+    <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
+    Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
 
-    This model should be used when leveraging Bert or Roberta checkpoints for the `EncoderDecoderModel` class as described in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
+    This model should be used when leveraging Bert or Roberta checkpoints for the
+    :class:`~transformers.EncoderDecoderModel` class as described in `Leveraging Pre-trained Checkpoints for Sequence
+    Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.
 
     To behave as an decoder the model needs to be initialized with the
     :obj:`is_decoder` argument of the configuration set to :obj:`True`.
@@ -281,7 +295,7 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.encoder.layer[layer].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/bert_for_seq_generation_L-24_bbc_encoder",
@@ -410,7 +424,7 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_GENERATION_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -427,19 +441,21 @@ class BertGenerationDecoder(BertGenerationPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
-                if the model is configured as a decoder.
-            encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Mask to avoid performing attention on the padding token indices of the encoder input. This mask
-                is used in the cross-attention if the model is configured as a decoder.
-                Mask values selected in ``[0, 1]``:
-                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the left-to-right language modeling loss (next word prediction).
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+            if the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask
+            is used in the cross-attention if the model is configured as a decoder.
+            Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction).
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
+            labels in ``[0, ..., config.vocab_size]``
 
         Returns:
 
diff --git a/src/transformers/modeling_camembert.py b/src/transformers/modeling_camembert.py
index 0d5e6c2423..119f3672f6 100644
--- a/src/transformers/modeling_camembert.py
+++ b/src/transformers/modeling_camembert.py
@@ -42,7 +42,11 @@ CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 CAMEMBERT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
diff --git a/src/transformers/modeling_ctrl.py b/src/transformers/modeling_ctrl.py
index e301d63438..3d0128c79b 100644
--- a/src/transformers/modeling_ctrl.py
+++ b/src/transformers/modeling_ctrl.py
@@ -233,7 +233,12 @@ class CTRLPreTrainedModel(PreTrainedModel):
 
 
 CTRL_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -245,33 +250,38 @@ CTRL_START_DOCSTRING = r"""
 
 CTRL_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             :obj:`input_ids_length` = ``sequence_length`` if ``past_key_values`` is ``None`` else
             ``past_key_values[0].shape[-2]`` (``sequence_length`` of input past key value states).
             Indices of input sequence tokens in the vocabulary.
 
-            If ``past_key_values`` is used, only input_ids that do not have their past calculated should be passed as
+            If ``past_key_values`` is used, only input IDs that do not have their past calculated should be passed as
             ``input_ids``.
 
-            Indices can be obtained using :class:`transformers.CTRLTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.CTRLTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.__call__` and
+            :meth:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
             (see ``past_key_values`` output below). Can be used to speed up sequential decoding.
-            The ``input_ids`` which have their past given to this model should not be passed as input ids as they have already been computed.
+            The ``input_ids`` which have their past given to this model should not be passed as input ids as they have
+            already been computed.
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
         position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
@@ -282,21 +292,25 @@ CTRL_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-            If ``past_key_values`` is used, optionally only the last `inputs_embeds` have to be input (see ``past_key_values``).
-        use_cache (:obj:`bool`):
-            If `use_cache` is True, ``past_key_values`` key value states are returned and
-            can be used to speed up decoding (see ``past_key_values``). Defaults to `True`.
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
diff --git a/src/transformers/modeling_distilbert.py b/src/transformers/modeling_distilbert.py
index 87611f5ac8..cec34a0953 100755
--- a/src/transformers/modeling_distilbert.py
+++ b/src/transformers/modeling_distilbert.py
@@ -372,7 +372,11 @@ class DistilBertPreTrainedModel(PreTrainedModel):
 
 DISTILBERT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -384,35 +388,41 @@ DISTILBERT_START_DOCSTRING = r"""
 
 DISTILBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.DistilBertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.DistilBertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -443,7 +453,7 @@ class DistilBertModel(DistilBertPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.transformer.layer[layer].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -516,7 +526,7 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
     def get_output_embeddings(self):
         return self.vocab_projector
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -539,8 +549,8 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Labels for computing the masked language modeling loss.
             Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-            in ``[0, ..., config.vocab_size]``
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
+            labels in ``[0, ..., config.vocab_size]``.
         kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
             Used to hide legacy arguments that have been deprecated.
         """
@@ -601,7 +611,7 @@ class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -681,7 +691,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -703,11 +713,11 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -857,7 +867,7 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @replace_return_docstrings(output_type=MultipleChoiceModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -871,10 +881,10 @@ class DistilBertForMultipleChoice(DistilBertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for computing the multiple choice classification loss.
-                Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
-                of the input tensors. (see `input_ids` above)
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
 
         Returns:
 
diff --git a/src/transformers/modeling_dpr.py b/src/transformers/modeling_dpr.py
index d7e14d4ac0..6230e45ce7 100644
--- a/src/transformers/modeling_dpr.py
+++ b/src/transformers/modeling_dpr.py
@@ -315,7 +315,11 @@ class DPRPretrainedReader(PreTrainedModel):
 
 DPR_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -327,9 +331,9 @@ DPR_START_DOCSTRING = r"""
 
 DPR_ENCODERS_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids: (:obj:``torch.LongTensor`` of shape ``(batch_size, sequence_length)``):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
-            To match pre-training, DPR input sequence should be formatted with [CLS] and [SEP] tokens as follows:
+            To match pretraining, DPR input sequence should be formatted with [CLS] and [SEP] tokens as follows:
 
             (a) For sequence pairs (for a pair title+text for example):
 
@@ -346,57 +350,74 @@ DPR_ENCODERS_INPUTS_DOCSTRING = r"""
             DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on
             the right rather than the left.
 
-            Indices can be obtained using :class:`transformers.DPRTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
-        attention_mask: (:obj:``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Indices can be obtained using :class:`~transformers.DPRTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        token_type_ids: (:obj:``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states tensors of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 DPR_READER_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids: (:obj:``torch.LongTensor`` of shape ``(n_passages, sequence_length)``):
+        input_ids: (:obj:`Tuple[torch.LongTensor]` of shapes :obj:`(n_passages, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
             It has to be a sequence triplet with 1) the question and 2) the passages titles and 3) the passages texts
-            To match pre-training, DPR `input_ids` sequence should be formatted with [CLS] and [SEP] with the format:
+            To match pretraining, DPR :obj:`input_ids` sequence should be formatted with [CLS] and [SEP] with the
+            format:
 
-                [CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>
+                ``[CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>``
 
             DPR is a model with absolute position embeddings so it's usually advised to pad the inputs on
             the right rather than the left.
 
-            Indices can be obtained using :class:`transformers.DPRReaderTokenizer`.
-            See :class:`transformers.DPRReaderTokenizer` for more details
-        attention_mask: (:obj:torch.FloatTensor``, of shape ``(n_passages, sequence_length)``, `optional`:
+            Indices can be obtained using :class:`~transformers.DPRReaderTokenizer`. See this class documentation for
+            more details.
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(n_passages, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(n_passages, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states tensors of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to rturn the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
diff --git a/src/transformers/modeling_electra.py b/src/transformers/modeling_electra.py
index 73f0a16fea..af797d2006 100644
--- a/src/transformers/modeling_electra.py
+++ b/src/transformers/modeling_electra.py
@@ -218,7 +218,12 @@ class ElectraForPreTrainingOutput(ModelOutput):
 
 
 ELECTRA_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -230,27 +235,31 @@ ELECTRA_START_DOCSTRING = r"""
 
 ELECTRA_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.ElectraTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.ElectraTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -258,26 +267,33 @@ ELECTRA_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
             if the model is configured as a decoder.
-        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on the padding token indices of the encoder input. This mask
             is used in the cross-attention if the model is configured as a decoder.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -318,7 +334,7 @@ class ElectraModel(ElectraPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.encoder.layer[layer].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -414,7 +430,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -496,7 +512,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
         self.discriminator_predictions = ElectraDiscriminatorPredictions(config)
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=ElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -512,11 +528,12 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates the token is an original token,
-                ``1`` indicates the token was replaced.
+        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the ELECTRA loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates the token is an original token,
+            - 1 indicates the token was replaced.
 
         Returns:
 
@@ -592,7 +609,7 @@ class ElectraForMaskedLM(ElectraPreTrainedModel):
     def get_output_embeddings(self):
         return self.generator_lm_head
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -681,7 +698,7 @@ class ElectraForTokenClassification(ElectraPreTrainedModel):
         self.classifier = nn.Linear(config.hidden_size, config.num_labels)
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -767,7 +784,7 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -791,11 +808,11 @@ class ElectraForQuestionAnswering(ElectraPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -866,7 +883,7 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -889,8 +906,8 @@ class ElectraForMultipleChoice(ElectraPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
diff --git a/src/transformers/modeling_encoder_decoder.py b/src/transformers/modeling_encoder_decoder.py
index 65ccb763f3..375822528d 100644
--- a/src/transformers/modeling_encoder_decoder.py
+++ b/src/transformers/modeling_encoder_decoder.py
@@ -30,16 +30,28 @@ logger = logging.get_logger(__name__)
 _CONFIG_FOR_DOC = "EncoderDecoderConfig"
 
 ENCODER_DECODER_START_DOCSTRING = r"""
-    This class can be used to inialize a sequence-to-sequnece model with any pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder. The encoder is loaded via :meth:`~transformers.AutoModel.from_pretrained` function and the decoder is loaded via :meth:`~transformers.AutoModelForCausalLM.from_pretrained` function.
-    Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, *i.e.* summarization.
+    This class can be used to inialize a sequence-to-sequnece model with any pretrained autoencoding model as the
+    encoder and any pretrained autoregressive model as the decoder. The encoder is loaded via
+    :meth:`~transformers.AutoModel.from_pretrained` function and the decoder is loaded via
+    :meth:`~transformers.AutoModelForCausalLM.from_pretrained` function.
+    Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative
+    task, like summarization.
 
-    The effectiveness of initializing sequence-to-sequence models with pre-trained checkpoints for sequence generation tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
-    Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
+    The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation
+    tasks was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
+    <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. Michael Matena, Yanqi
+    Zhou, Wei Li, Peter J. Liu.
 
-    After such an Encoder Decoder model has been trained / fine-tuned, it can be saved / loaded just like any other models (see Examples for more information).
+    After such an Encoder Decoder model has been trained/fine-tuned, it can be saved/loaded just like any other models
+    (see the examples for more information).
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#module>`__ sub-class. Use it as a
-    regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
 
     Parameters:
         config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
@@ -50,38 +62,47 @@ ENCODER_DECODER_START_DOCSTRING = r"""
 ENCODER_DECODER_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
-            Indices of input sequence tokens in the vocabulary for the encoder.
-            Indices can be obtained using :class:`~transformers.PretrainedTokenizer`.
-            See :meth:`~transformers.PreTrainedTokenizer.encode` and
-            :meth:`~transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`~transformers.PreTrainedTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on padding token indices for the encoder.
+            Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
         encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`):
             This tuple must consist of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
-            `last_hidden_state` (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`) is a tensor of hidden-states at the output of the last layer of the encoder.
+            :obj:`last_hidden_state` (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`)
+            is a tensor of hidden-states at the output of the last layer of the encoder.
             Used in the cross-attention of the decoder.
         decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
             Provide for sequence to sequence training to the decoder.
-            Indices can be obtained using :class:`transformers.PretrainedTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+            Indices can be obtained using :class:`~transformers.PretrainedTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
         decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
         decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`):
-            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
+            representation. This is useful if you want more control over how to convert :obj:`decoder_input_ids`
+            indices into associated vectors than the model's internal embedding lookup matrix.
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Labels for computing the masked language modeling loss for the decoder.
             Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-            in ``[0, ..., config.vocab_size]``
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
+            labels in ``[0, ..., config.vocab_size]``
         return_dict (:obj:`bool`, `optional`):
             If set to ``True``, the model will return a :class:`~transformers.file_utils.Seq2SeqLMOutput` instead of a
             plain tuple.
@@ -97,8 +118,8 @@ class EncoderDecoderModel(PreTrainedModel):
     :class:`~transformers.EncoderDecoder` is a generic model class that will be
     instantiated as a transformer architecture with one of the base model
     classes of the library as encoder and another one as
-    decoder when created with the `AutoModel.from_pretrained(pretrained_model_name_or_path)`
-    class method for the encoder and `AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)` class method for the decoder.
+    decoder when created with the :meth`~transformers.AutoModel.from_pretrained` class method for the encoder and
+    :meth`~transformers.AutoModelForCausalLM.from_pretrained` class method for the decoder.
     """
     config_class = EncoderDecoderConfig
     base_model_prefix = "encoder_decoder"
@@ -169,40 +190,57 @@ class EncoderDecoderModel(PreTrainedModel):
         *model_args,
         **kwargs
     ) -> PreTrainedModel:
-        r"""Instantiates an encoder and a decoder from one or two base classes of the library from pre-trained model checkpoints.
+        r"""
+        Instantiate an encoder and a decoder from one or two base classes of the library from pretrained model
+        checkpoints.
 
 
-        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated).
-        To train the model, you need to first set it back in training mode with `model.train()`.
+        The model is set in evaluation mode by default using :obj:`model.eval()` (Dropout modules are deactivated).
+        To train the model, you need to first set it back in training mode with :obj:`model.train()`.
 
         Params:
-            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
-                information necessary to initiate the encoder. Either:
+            encoder_pretrained_model_name_or_path (:obj: `str`, `optional`):
+                Information necessary to initiate the encoder. Can be either:
 
-                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
-                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
-                - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/encoder``.
-                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+                    - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
+                      ``bert-base-uncased``.
+                    - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
+                      ``dbmdz/bert-base-german-cased``.
+                    - A path to a `directory` containing model weights saved using
+                      :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
+                    - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
+                      this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
+                      as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
+                      a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
 
             decoder_pretrained_model_name_or_path (:obj: `str`, `optional`, defaults to `None`):
-                information necessary to initiate the decoder. Either:
+                Information necessary to initiate the decoder. Can be either:
 
-                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
-                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
-                - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/decoder``.
-                - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
+                    - A string with the `shortcut name` of a pretrained model to load from cache or download, e.g.,
+                      ``bert-base-uncased``.
+                    - A string with the `identifier name` of a pretrained model that was user-uploaded to our S3, e.g.,
+                      ``dbmdz/bert-base-german-cased``.
+                    - A path to a `directory` containing model weights saved using
+                      :func:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
+                    - A path or url to a `tensorflow index checkpoint file` (e.g, ``./tf_model/model.ckpt.index``). In
+                      this case, ``from_tf`` should be set to :obj:`True` and a configuration object should be provided
+                      as ``config`` argument. This loading path is slower than converting the TensorFlow checkpoint in
+                      a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
 
-            model_args: (`optional`) Sequence of positional arguments:
-                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+            model_args (remaining positional arguments, `optional`):
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method.
 
-            kwargs: (`optional`) Remaining dictionary of keyword arguments.
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attentions=True``).
-                - To update the encoder configuration, use the prefix `encoder_` for each configuration parameter
-                - To update the decoder configuration, use the prefix `decoder_` for each configuration parameter
-                - To update the parent model configuration, do not use a prefix for each configuration parameter
-                Behave differently depending on whether a :obj:`config` is provided or automatically loaded.
+            kwargs (remaining dictionary of keyword arguments, `optional`):
+                Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
+                :obj:`output_attentions=True`).
 
-        Examples::
+                - To update the encoder configuration, use the prefix `encoder_` for each configuration parameter.
+                - To update the decoder configuration, use the prefix `decoder_` for each configuration parameter.
+                - To update the parent model configuration, do not use a prefix for each configuration parameter.
+
+                Behaves differently depending on whether a :obj:`config` is provided or automatically loaded.
+
+        Example::
 
             >>> from transformers import EncoderDecoderModel
             >>> # initialize a bert2bert from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized
diff --git a/src/transformers/modeling_flaubert.py b/src/transformers/modeling_flaubert.py
index f4b1796333..a8fbb06105 100644
--- a/src/transformers/modeling_flaubert.py
+++ b/src/transformers/modeling_flaubert.py
@@ -52,7 +52,11 @@ FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 FLAUBERT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -67,21 +71,25 @@ FLAUBERT_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.FlaubertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
         position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
@@ -91,28 +99,32 @@ FLAUBERT_INPUTS_DOCSTRING = r"""
             `What are position IDs? <../glossary.html#position-ids>`_
         lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Length of each sentence that can be used to avoid performing attention on padding token indices.
-            You can also use `attention_mask` for the same result (see above), kept here for compatbility.
+            You can also use :obj:`attention_mask` for the same result (see above), kept here for compatbility.
             Indices selected in ``[0, ..., input_ids.size(-1)]``:
         cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`):
-            dictionary with ``torch.FloatTensor`` that contains pre-computed
+            Dictionary strings to ``torch.FloatTensor`` that contains precomputed
             hidden-states (key and values in the attention blocks) as computed by the model
-            (see `cache` output below). Can be used to speed up sequential decoding.
+            (see :obj:`cache` output below). Can be used to speed up sequential decoding.
             The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -372,8 +384,8 @@ class FlaubertForQuestionAnsweringSimple(XLMForQuestionAnsweringSimple):
 
 
 @add_start_docstrings(
-    """Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Flaubert Model with a beam-search span classification head on top for extractive question-answering tasks like
+    SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     FLAUBERT_START_DOCSTRING,
 )
 class FlaubertForQuestionAnswering(XLMForQuestionAnswering):
diff --git a/src/transformers/modeling_fsmt.py b/src/transformers/modeling_fsmt.py
index 73b9b4292f..1bffa2f577 100644
--- a/src/transformers/modeling_fsmt.py
+++ b/src/transformers/modeling_fsmt.py
@@ -176,8 +176,13 @@ PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/w
 
 FSMT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class. Use it as a regular PyTorch Module and
-    refer to the PyTorch documentation for all matters related to general usage and behavior.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
 
     Parameters:
         config (:class:`~transformers.FSMTConfig`): Model configuration class with all the parameters of the model.
@@ -207,39 +212,52 @@ FSMT_GENERATION_EXAMPLE = r"""
 FSMT_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
-               Indices of input sequence tokens in the vocabulary. Use FSMTTokenizer.encode to produce them.
-            Padding will be ignored by default should you provide it.
-            Indices can be obtained using :class:`transformers.FSMTTokenizer.encode(text)`.
+            Indices of input sequence tokens in the vocabulary.
+
+            IIndices can be obtained using :class:`~transformers.FSTMTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on padding token indices in input_ids.
+            Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
-            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
-            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a sequence of hidden-states at the output of the last layer of the encoder.
-            Used in the cross-attention of the decoder.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        encoder_outputs (:obj:`Tuple(torch.FloatTensor)`, `optional`):
+            Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`: :obj:`attentions`)
+            :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
+            hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
         decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
-            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.
+            Provide for translation and summarization training. By default, the model will create this tensor by
+            shifting the input_ids right, following the paper.
         decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
-            If you want to change padding behavior, you should read :func:`~transformers.modeling_fairseqtranslator._prepare_decoder_inputs` and modify.
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
+            If you want to change padding behavior, you should read
+            :func:`modeling_fstm._prepare_fstm_decoder_inputs` and modify.
             See diagram 1 in the paper for more info on the default strategy
-        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains pre-computed key and value hidden-states of the attention blocks.
+        past_key_values (:obj:`Tuple(torch.FloatTensor)` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden-states of the attention blocks.
             Can be used to speed up decoding.
-            If ``past_key_values`` are used, the user can optionally input only the last
-            ``decoder_input_ids`` (those that don't have their past key value states given to this model) of shape
-            :obj:`(batch_size, 1)` instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
+            If :obj:`past_key_values` are used, the user can optionally input only the last
+            :obj:`decoder_input_ids` (those that don't have their past key value states given to this model) of shape
+            :obj:`(batch_size, 1)` instead of all :obj:`decoder_input_ids` of shape
+            :obj:`(batch_size, sequence_length)`.
         use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If `use_cache` is True, ``past_key_values`` are returned and can be used to speed up decoding (see
-            ``past_key_values``).
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
diff --git a/src/transformers/modeling_funnel.py b/src/transformers/modeling_funnel.py
index 9e4e40c5db..30302c4134 100644
--- a/src/transformers/modeling_funnel.py
+++ b/src/transformers/modeling_funnel.py
@@ -826,11 +826,17 @@ class FunnelForPreTrainingOutput(ModelOutput):
     attentions: Optional[Tuple[torch.FloatTensor]] = None
 
 
-FUNNEL_START_DOCSTRING = r"""    The Funnel Transformer model was proposed in
+FUNNEL_START_DOCSTRING = r"""
+
+    The Funnel Transformer model was proposed in
     `Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
     <https://arxiv.org/abs/2006.03236>`__ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -841,38 +847,41 @@ FUNNEL_START_DOCSTRING = r"""    The Funnel Transformer model was proposed in
 """
 
 FUNNEL_INPUTS_DOCSTRING = r"""
-    Inputs:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.FunnelTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
             ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`None`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
-        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`None`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
-        return_dict (:obj:`bool`, `optional`, defaults to :obj:`None`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -896,7 +905,7 @@ class FunnelBaseModel(FunnelPreTrainedModel):
     def set_input_embeddings(self, new_embeddings):
         self.embeddings.word_embeddings = new_embeddings
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -973,7 +982,7 @@ class FunnelModel(FunnelPreTrainedModel):
     def set_input_embeddings(self, new_embeddings):
         self.embeddings.word_embeddings = new_embeddings
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1071,7 +1080,7 @@ class FunnelForPreTraining(FunnelPreTrainedModel):
         self.discriminator_predictions = FunnelDiscriminatorPredictions(config)
         self.init_weights()
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=FunnelForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1085,11 +1094,12 @@ class FunnelForPreTraining(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):
-                Labels for computing the ELECTRA-style loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates the token is an original token,
-                ``1`` indicates the token was replaced.
+        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the ELECTRA-style loss. Input should be a sequence of tokens (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates the token is an original token,
+            - 1 indicates the token was replaced.
 
         Returns:
 
@@ -1155,7 +1165,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel):
     def get_output_embeddings(self):
         return self.lm_head
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1174,7 +1184,7 @@ class FunnelForMaskedLM(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Labels for computing the masked language modeling loss.
             Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
             Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
@@ -1226,7 +1236,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel):
         self.classifier = FunnelClassificationHead(config, config.num_labels)
         self.init_weights()
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -1245,7 +1255,7 @@ class FunnelForSequenceClassification(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the sequence classification/regression loss.
             Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
             If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
@@ -1302,7 +1312,7 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel):
         self.classifier = FunnelClassificationHead(config, 1)
         self.init_weights()
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -1321,10 +1331,10 @@ class FunnelForMultipleChoice(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
@@ -1386,7 +1396,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1405,7 +1415,7 @@ class FunnelForTokenClassification(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Labels for computing the token classification loss.
             Indices should be in ``[0, ..., config.num_labels - 1]``.
         """
@@ -1466,7 +1476,7 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1486,13 +1496,13 @@ class FunnelForQuestionAnswering(FunnelPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
-        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/modeling_gpt2.py b/src/transformers/modeling_gpt2.py
index e48e6d9b18..f1671c69cb 100644
--- a/src/transformers/modeling_gpt2.py
+++ b/src/transformers/modeling_gpt2.py
@@ -391,7 +391,11 @@ class GPT2DoubleHeadsModelOutput(ModelOutput):
 
 GPT2_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -411,27 +415,31 @@ GPT2_INPUTS_DOCSTRING = r"""
             If ``past_key_values`` is used, only ``input_ids`` that do not have their past calculated should be passed
             as ``input_ids``.
 
-            Indices can be obtained using :class:`transformers.GPT2Tokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.GPT2Tokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-
         past_key_values (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model
             (see ``past_key_values`` output below). Can be used to speed up sequential decoding.
-            The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they have already been computed.
+            The ``input_ids`` which have their past given to this model should not be passed as ``input_ids`` as they
+            have already been computed.
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, input_ids_length)`, `optional`):
-            `input_ids_length` = `sequence_length if `past` is None else 1
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
             `What are token type IDs? <../glossary.html#token-type-ids>`_
         position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
@@ -441,20 +449,28 @@ GPT2_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-            If ``past_key_values`` is used, optionally only the last `inputs_embeds` have to be input (see ``past_key_values``).
-        use_cache (:obj:`bool`):
-            If `use_cache` is True, ``past_key_values`` key value states are returned and can be used to speed up decoding (see ``past_key_values``). Defaults to `True`.
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+
+            If ``past_key_values`` is used, optionally only the last :obj:`inputs_embeds` have to be input (see
+            ``past_key_values``).
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -809,25 +825,25 @@ class GPT2DoubleHeadsModel(GPT2PreTrainedModel):
         **kwargs,
     ):
         r"""
-            mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
-                Index of the classification token in each input sequence.
-                Selected in the range ``[0, input_ids.size(-1) - 1[``.
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`)
-                Labels for language modeling.
-                Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
-                Indices are selected in ``[-1, 0, ..., config.vocab_size]``
-                All labels set to ``-100`` are ignored (masked), the loss is only
-                computed for labels in ``[0, ..., config.vocab_size]``
-            mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`)
-                Labels for computing the multiple choice classification loss.
-                Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-                of the input tensors. (see `input_ids` above)
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1[``.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`)
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-100`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`)
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Return:
 
-        Examples::
+        Example::
 
             >>> import torch
             >>> from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel
diff --git a/src/transformers/modeling_layoutlm.py b/src/transformers/modeling_layoutlm.py
index 49d81149c9..f30a602113 100644
--- a/src/transformers/modeling_layoutlm.py
+++ b/src/transformers/modeling_layoutlm.py
@@ -519,7 +519,7 @@ LAYOUTLM_START_DOCSTRING = r"""    The LayoutLM model was proposed in
 """
 
 LAYOUTLM_INPUTS_DOCSTRING = r"""
-    Inputs:
+    Args:
         input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
             Indices of input sequence tokens in the vocabulary.
 
diff --git a/src/transformers/modeling_longformer.py b/src/transformers/modeling_longformer.py
index f1979225e2..56f0e31ced 100755
--- a/src/transformers/modeling_longformer.py
+++ b/src/transformers/modeling_longformer.py
@@ -811,65 +811,76 @@ class LongformerPreTrainedModel(PreTrainedModel):
 
 
 LONGFORMER_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
     Parameters:
         config (:class:`~transformers.LongformerConfig`): Model configuration class with all the parameters of the
-            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
-            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+            model. Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
+            weights.
 """
 
 LONGFORMER_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.LonmgformerTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.LongformerTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-
-        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        global_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to decide the attention given on each token, local attention or global attenion.
             Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for
             task-specific finetuning because it makes the model more flexible at representing the task. For example,
             for classification, the <s> token should be given global attention. For QA, all question tokens should also have
             global attention. Please refer to the `Longformer paper <https://arxiv.org/abs/2004.05150>`__ for more details.
             Mask values selected in ``[0, 1]``:
-            ``0`` for local attention (a sliding window attention),
-            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
 
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+            - 0 for local attention (a sliding window attention),
+            - 1 for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
+
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
             `What are position IDs? <../glossary.html#position-ids>`_
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -879,13 +890,14 @@ LONGFORMER_INPUTS_DOCSTRING = r"""
 )
 class LongformerModel(LongformerPreTrainedModel):
     """
-    This class copied code from :class:`~transformers.RobertaModel` and overwrote standard self-attention with longformer self-attention to provide the ability to process
+    This class copied code from :class:`~transformers.RobertaModel` and overwrote standard self-attention with
+    longformer self-attention to provide the ability to process
     long sequences following the self-attention approach described in `Longformer: the Long-Document Transformer
     <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer self-attention
     combines a local (sliding window) and global attention to extend to long documents without the O(n^2) increase in
     memory and compute.
 
-    The self-attention module `LongformerSelfAttention` implemented here supports the combination of local and
+    The self-attention module :obj:`LongformerSelfAttention` implemented here supports the combination of local and
     global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive
     and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream
     tasks. Future release will add support for autoregressive attention, but the support for dilated attention
@@ -989,7 +1001,7 @@ class LongformerModel(LongformerPreTrainedModel):
             attention_mask = global_attention_mask + 1
         return attention_mask
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1116,7 +1128,7 @@ class LongformerForMaskedLM(LongformerPreTrainedModel):
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1133,13 +1145,13 @@ class LongformerForMaskedLM(LongformerPreTrainedModel):
         **kwargs
     ):
         r"""
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the masked language modeling loss.
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Returns:
 
@@ -1219,7 +1231,7 @@ class LongformerForSequenceClassification(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="allenai/longformer-base-4096",
@@ -1310,8 +1322,8 @@ class LongformerClassificationHead(nn.Module):
 
 
 @add_start_docstrings(
-    """Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD /
+    TriviaQA (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     LONGFORMER_START_DOCSTRING,
 )
 class LongformerForQuestionAnswering(BertPreTrainedModel):
@@ -1327,7 +1339,7 @@ class LongformerForQuestionAnswering(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=QuestionAnsweringModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1344,14 +1356,15 @@ class LongformerForQuestionAnswering(BertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the start of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
-            end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the end of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+
         Returns:
 
         Examples::
@@ -1458,7 +1471,7 @@ class LongformerForTokenClassification(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="allenai/longformer-base-4096",
@@ -1546,7 +1559,7 @@ class LongformerForMultipleChoice(BertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="allenai/longformer-base-4096",
@@ -1569,8 +1582,8 @@ class LongformerForMultipleChoice(BertPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/modeling_lxmert.py b/src/transformers/modeling_lxmert.py
index adf670a9ea..dface2fdbd 100644
--- a/src/transformers/modeling_lxmert.py
+++ b/src/transformers/modeling_lxmert.py
@@ -805,13 +805,18 @@ class LxmertPreTrainedModel(PreTrainedModel):
 
 
 LXMERT_START_DOCSTRING = r"""
-    The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers <https://arxiv.org/abs/1908.07490>`__
-    by Hao Tan and Mohit Bansal. It's a vision and language transformer model,
-    pre-trained on a variety of multi-modal datasets comprising of GQA, VQAv2.0, MCSCOCO captions, and Visual genome,
+
+    The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
+    <https://arxiv.org/abs/1908.07490>`__ by Hao Tan and Mohit Bansal. It's a vision and language transformer model,
+    pretrained on a variety of multi-modal datasets comprising of GQA, VQAv2.0, MCSCOCO captions, and Visual genome,
     using a combination of masked language modeling, region of interest feature regression,
     cross entropy loss for question answering attribute prediction, and object tag predicition.
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -824,50 +829,61 @@ LXMERT_START_DOCSTRING = r"""
 LXMERT_INPUTS_DOCSTRING = r"""
 
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.LxmertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.LxmertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         visual_feats: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_feat_dim)՝):
-            This input represents visual features. They ROI pooled object features from bounding boxes using a faster-RCNN model)
-            These are currently not provided by the transformers library
+            This input represents visual features. They ROI pooled object features from bounding boxes using a
+            faster-RCNN model)
+
+            These are currently not provided by the transformers library.
         visual_pos: (:obj:`torch.FloatTensor` of shape :obj:՝(batch_size, num_visual_features, visual_pos_dim)՝):
             This input represents spacial features corresponding to their relative (via index) visual features.
-            The pre-trained lxmert model expects these spacial features to be normalized bounding boxes on a scale of 0~1.
-            These are currently not provided by the transformers library
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+            The pre-trained LXMERT model expects these spacial features to be normalized bounding boxes on a scale of
+            0 to 1.
+
+            These are currently not provided by the transformers library.
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        visual_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        visual_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        output_attentions: (:obj:`bool`, `optional`):
-             If set to ``True``, the attentions tensors of all attention layers for the visual, language, and cross-modality encoder are returned.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states for each respective modality will be returned when used as the input vector in the cross-modality encoder.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.LxmertModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -889,7 +905,7 @@ class LxmertModel(LxmertPreTrainedModel):
     def set_input_embeddings(self, new_embeddings):
         self.embeddings.word_embeddings = new_embeddings
 
-    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="unc-nlp/lxmert-base-uncased",
@@ -1141,7 +1157,7 @@ class LxmertForPreTraining(LxmertPreTrainedModel):
 
         return new_qa_logit_layer
 
-    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=LxmertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1166,16 +1182,17 @@ class LxmertForPreTraining(LxmertPreTrainedModel):
             Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
             Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
             in ``[0, ..., config.vocab_size]``
-        obj_labels: (``Dict[Str: Tuple[Torch.FloatTensor, Torch.FloatTensor]]``, `optional`, defaults to :obj: `None`):
+        obj_labels: (``Dict[Str: Tuple[Torch.FloatTensor, Torch.FloatTensor]]``, `optional`):
             each key is named after each one of the visual losses and each element of the tuple is of the shape
             ``(batch_size, num_features)`` and ``(batch_size, num_features, visual_feature_dim)``
             for each the label id and the label score respectively
         matched_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
             Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
-            Indices should be in ``[0, 1]``.
-            ``0`` indicates that the sentence does not match the image
-            ``1`` indicates that the sentence does match the image
-        ans: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`, defaults to :obj: `None`):
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates that the sentence does not match the image,
+            - 1 indicates that the sentence does match the image.
+        ans: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`):
             a one hot representation hof the correct answer `optional`
 
         Returns:
@@ -1364,7 +1381,7 @@ class LxmertForQuestionAnswering(LxmertPreTrainedModel):
 
         return new_qa_logit_layer
 
-    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="unc-nlp/lxmert-base-uncased",
@@ -1387,7 +1404,7 @@ class LxmertForQuestionAnswering(LxmertPreTrainedModel):
     ):
         r"""
         labels: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`):
-            a one hot representation of the correct answer
+            A one-hot representation of the correct answer
 
         Returns:
         """
diff --git a/src/transformers/modeling_mmbt.py b/src/transformers/modeling_mmbt.py
index bde42ecb2c..ac3bb153be 100644
--- a/src/transformers/modeling_mmbt.py
+++ b/src/transformers/modeling_mmbt.py
@@ -76,20 +76,20 @@ class ModalEmbeddings(nn.Module):
         return embeddings
 
 
-MMBT_START_DOCSTRING = r"""    MMBT model was proposed in
-    `Supervised Multimodal Bitransformers for Classifying Images and Text`_
+MMBT_START_DOCSTRING = r"""
+    MMBT model was proposed in
+    `Supervised Multimodal Bitransformers for Classifying Images and Text <https://github.com/facebookresearch/mmbt>`__
     by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
     It's a supervised multimodal bitransformer model that fuses information from text and other image encoders,
     and obtain state-of-the-art performance on various multimodal classification benchmark tasks.
 
-    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
-    refer to the PyTorch documentation for all matter related to general usage and behavior.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
 
-    .. _`Supervised Multimodal Bitransformers for Classifying Images and Text`:
-        https://github.com/facebookresearch/mmbt
-
-    .. _`torch.nn.Module`:
-        https://pytorch.org/docs/stable/nn.html#module
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
 
     Parameters:
         config (:class:`~transformers.MMBTConfig`): Model configuration class with all the parameters of the model.
@@ -100,55 +100,83 @@ MMBT_START_DOCSTRING = r"""    MMBT model was proposed in
             It should take in a batch of modal inputs and return k, n dimension embeddings.
 """
 
-MMBT_INPUTS_DOCSTRING = r"""    Inputs:
+MMBT_INPUTS_DOCSTRING = r"""
+    Args:
         input_modal (``torch.FloatTensor`` of shape ``(batch_size, ***)``):
             The other modality data. It will be the shape that the encoder for that type expects.
             e.g. With an Image Encoder, the shape would be (batch_size, channels, height, width)
         input_ids (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``):
             Indices of input sequence tokens in the vocabulary.
             It does not expect [CLS] token to be added as it's appended to the end of other modality embeddings.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
         modal_start_tokens (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for Classification tasks.
+            Optional start token to be added to Other Modality Embedding. [CLS] Most commonly used for classification
+            tasks.
         modal_end_tokens (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
             Optional end token to be added to Other Modality Embedding. [SEP] Most commonly used.
         attention_mask (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
-            Segment token indices to indicate different portions of the inputs.
+            Segment token indices to indicate first and second portions of the inputs.
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
         modal_token_type_ids (`optional`) ``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``:
             Segment token indices to indicate different portions of the non-text modality.
             The embeddings from these tokens will be summed with the respective token embeddings for the non-text modality.
         position_ids (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+
+            `What are position IDs? <../glossary.html#position-ids>`__
         modal_position_ids (``torch.LongTensor`` of shape ``(batch_size, modal_sequence_length)``, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings for the non-text modality.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+
+            `What are position IDs? <../glossary.html#position-ids>`__
         head_mask (``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (``torch.FloatTensor`` of shape ``(batch_size, sequence_length, embedding_dim)``, `optional`):
-            Optionally, instead of passing ``input_ids`` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         encoder_hidden_states (``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``, `optional`):
-            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model
-            is configured as a decoder.
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
         encoder_attention_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
             Mask to avoid performing attention on the padding token indices of the encoder input. This mask
             is used in the cross-attention if the model is configured as a decoder.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
diff --git a/src/transformers/modeling_mobilebert.py b/src/transformers/modeling_mobilebert.py
index e81b4706f3..15f05053fc 100644
--- a/src/transformers/modeling_mobilebert.py
+++ b/src/transformers/modeling_mobilebert.py
@@ -690,7 +690,8 @@ class MobileBertForPreTrainingOutput(ModelOutput):
 
     Args:
         loss (`optional`, returned when ``labels`` is provided, ``torch.FloatTensor`` of shape :obj:`(1,)`):
-            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
         prediction_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
             Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
         seq_relationship_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
@@ -717,7 +718,12 @@ class MobileBertForPreTrainingOutput(ModelOutput):
 
 
 MOBILEBERT_START_DOCSTRING = r"""
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -729,27 +735,31 @@ MOBILEBERT_START_DOCSTRING = r"""
 
 MOBILEBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.MobileBertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -757,11 +767,14 @@ MOBILEBERT_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
             if the model is configured as a decoder.
@@ -769,14 +782,18 @@ MOBILEBERT_INPUTS_DOCSTRING = r"""
             Mask to avoid performing attention on the padding token indices of the encoder input. This mask
             is used in the cross-attention if the model is configured as a decoder.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -814,7 +831,7 @@ class MobileBertModel(MobileBertPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.encoder.layer[layer].attention.prune_heads(heads)
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -946,7 +963,7 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
         if output_embeddings is not None and self.config.tie_word_embeddings:
             self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=MobileBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -963,16 +980,18 @@ class MobileBertForPreTraining(MobileBertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Labels for computing the masked language modeling loss.
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
-            next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates sequence B is a continuation of sequence A,
-                ``1`` indicates sequence B is a random sequence.
+        labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        next_sentence_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
+
         Returns:
 
         Examples::
@@ -1061,7 +1080,7 @@ class MobileBertForMaskedLM(MobileBertPreTrainedModel):
         if output_embeddings is not None and self.config.tie_word_embeddings:
             self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1158,7 +1177,7 @@ class MobileBertForNextSentencePrediction(MobileBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1174,11 +1193,12 @@ class MobileBertForNextSentencePrediction(MobileBertPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
-                Indices should be in ``[0, 1]``.
-                ``0`` indicates sequence B is a continuation of sequence A,
-                ``1`` indicates sequence B is a random sequence.
+        next_sentence_label (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring)
+            Indices should be in ``[0, 1]``.
+
+            - 0 indicates sequence B is a continuation of sequence A,
+            - 1 indicates sequence B is a random sequence.
 
         Returns:
 
@@ -1247,7 +1267,7 @@ class MobileBertForSequenceClassification(MobileBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1328,7 +1348,7 @@ class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1352,11 +1372,11 @@ class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -1425,7 +1445,7 @@ class MobileBertForMultipleChoice(MobileBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1448,8 +1468,8 @@ class MobileBertForMultipleChoice(MobileBertPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices-1]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
@@ -1515,7 +1535,7 @@ class MobileBertForTokenClassification(MobileBertPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
diff --git a/src/transformers/modeling_openai.py b/src/transformers/modeling_openai.py
index f6128f34c0..a3029c6ca2 100644
--- a/src/transformers/modeling_openai.py
+++ b/src/transformers/modeling_openai.py
@@ -331,7 +331,11 @@ class OpenAIGPTDoubleHeadsModelOutput(ModelOutput):
 
 OPENAI_GPT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -346,43 +350,51 @@ OPENAI_GPT_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.OpenAIGPTTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.OpenAIGPTTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
         position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
+            `What are position IDs? <../glossary.html#position-ids>`__
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -635,21 +647,21 @@ class OpenAIGPTDoubleHeadsModel(OpenAIGPTPreTrainedModel):
         **kwargs
     ):
         r"""
-            mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
-                Index of the classification token in each input sequence.
-                Selected in the range ``[0, input_ids.size(-1) - 1]``.
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`)
-                Labels for language modeling.
-                Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
-                Indices are selected in ``[-1, 0, ..., config.vocab_size]``
-                All labels set to ``-100`` are ignored (masked), the loss is only
-                computed for labels in ``[0, ..., config.vocab_size]``
-            mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`)
-                Labels for computing the multiple choice classification loss.
-                Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-                of the input tensors. (see `input_ids` above)
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        mc_token_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1]``.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`)
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-100`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+        mc_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size)`, `optional`)
+            Labels for computing the multiple choice classification loss.
+            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
+            of the input tensors. (see `input_ids` above)
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Return:
 
diff --git a/src/transformers/modeling_outputs.py b/src/transformers/modeling_outputs.py
index e6a4b0ed8b..f8c8b2c09b 100644
--- a/src/transformers/modeling_outputs.py
+++ b/src/transformers/modeling_outputs.py
@@ -73,7 +73,8 @@ class BaseModelOutputWithPast(ModelOutput):
         last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
             Sequence of hidden-states at the output of the last layer of the model.
 
-            If `past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
+            If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape
+            :obj:`(batch_size, 1, hidden_size)` is output.
         past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
             List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`,  with each tensor of shape
             :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
@@ -475,9 +476,9 @@ class QuestionAnsweringModelOutput(ModelOutput):
     Args:
         loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
             Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
-        start_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+        start_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-start scores (before SoftMax).
-        end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+        end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-end scores (before SoftMax).
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
@@ -507,9 +508,9 @@ class Seq2SeqQuestionAnsweringModelOutput(ModelOutput):
     Args:
         loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
             Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
-        start_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+        start_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-start scores (before SoftMax).
-        end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+        end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-end scores (before SoftMax).
         past_key_values (:obj:`List[torch.FloatTensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
             List of :obj:`torch.FloatTensor` of length :obj:`config.n_layers`,  with each tensor of shape
diff --git a/src/transformers/modeling_reformer.py b/src/transformers/modeling_reformer.py
index e524467173..0cb516ee2e 100755
--- a/src/transformers/modeling_reformer.py
+++ b/src/transformers/modeling_reformer.py
@@ -1837,16 +1837,16 @@ class ReformerModelOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
-            List of :obj:`tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`,  with :obj:`tuple(0)` being the previous `buckets` of shape
-            :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
-            and :obj:`tuple(1)` being the previous `hidden_states` of shape
+            List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
+            element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
+            and the second being the previous `hidden_states` of shape
             :obj:`(batch_size, sequence_length, hidden_size)`).
 
-            Contains pre-computed buckets and hidden-states that can be used (see
-            ``past_buckets_states`` input) to speed up sequential decoding.
+            Contains precomputed buckets and hidden-states that can be used (see ``past_buckets_states`` input) to
+            speed up sequential decoding.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
 
             Hidden-states of the model at the output of each layer plus the initial embedding outputs.
         attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
@@ -1877,16 +1877,16 @@ class ReformerModelWithLMHeadOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
-            List of :obj:`tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`,  with :obj:`tuple(0)` being the previous `buckets` of shape
-            :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
-            and :obj:`tuple(1)` being the previous `hidden_states` of shape
+            List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
+            element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
+            and the second being the previous `hidden_states` of shape
             :obj:`(batch_size, sequence_length, hidden_size)`).
 
-            Contains pre-computed buckets and hidden-states that can be used (see
-            ``past_buckets_states`` input) to speed up sequential decoding.
+            Contains precomputed buckets and hidden-states that can be used (see ``past_buckets_states`` input) to
+            speed up sequential decoding.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
-            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
-            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            TTuple of :obj:`torch.FloatTensor` (one for the output of the embeddings and one for the output of each
+            layer) of shape :obj:`(batch_size, sequence_length, hidden_size)`.
 
             Hidden-states of the model at the output of each layer plus the initial embedding outputs.
         attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
@@ -1908,7 +1908,11 @@ REFORMER_START_DOCSTRING = r"""
     Reformer was proposed in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.0445>`__
     by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -1926,51 +1930,59 @@ REFORMER_INPUTS_DOCSTRING = r"""
             chunk lengths (lsh's, local's or both). During evaluation, the indices are automatically
             padded to be a multiple of the chunk length.
 
-            Indices can be obtained using :class:`transformers.ReformerTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.ReformerTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
+            `What are position IDs? <../glossary.html#position-ids>`__
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         num_hashes (:obj:`int`, `optional`):
-            `num_hashes` is the number of hashing rounds that should be performed during
-            bucketing. Setting `num_hashes` overwrites the default `num_hashes` defined
-            in `config.num_hashes`.
-            For more information, see `num_hashes` in :class:`transformers.ReformerConfig`.
-        past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`, defaults `None`):
-            List of :obj:`tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`,  with :obj:`tuple(0)` being the previous `buckets` of shape
-            :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
-            and :obj:`tuple(1)` being the previous `hidden_states` of shape
+            The number of hashing rounds that should be performed during bucketing. Setting this argument overwrites
+            the default defined in :obj:`config.num_hashes`.
+
+            For more information, see :obj:`num_hashes` in :class:`~transformers.ReformerConfig`.
+        past_buckets_states (:obj:`List[Tuple(torch.LongTensor, torch.FloatTensor)]`, `optional`):
+            List of :obj:`Tuple(torch.LongTensor, torch.FloatTensor` of length :obj:`config.n_layers`, with the first
+            element being the previous `buckets` of shape :obj:`(batch_size, num_heads, num_hashes, sequence_length)`)
+            and the second being the previous `hidden_states` of shape
             :obj:`(batch_size, sequence_length, hidden_size)`).
 
-            List of tuples that contains all previous computed hidden states and buckets (only relevant for LSH Self-Attention). Can be used to speed up sequential decoding.
+            Contains precomputed hidden-states and buckets (only relevant for LSH Self-Attention). Can be used to speed
+            up sequential decoding.
         use_cache (:obj:`bool`, `optional`):
-            If set to ``True``, the ``past_buckets_states`` of all attention layers are returned.
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -2528,11 +2540,11 @@ class ReformerForQuestionAnswering(ReformerPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/modeling_retribert.py b/src/transformers/modeling_retribert.py
index 5f0adc117a..56e9bee9d2 100644
--- a/src/transformers/modeling_retribert.py
+++ b/src/transformers/modeling_retribert.py
@@ -61,7 +61,11 @@ class RetriBertPreTrainedModel(PreTrainedModel):
 
 RETRIBERT_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -171,29 +175,31 @@ class RetriBertModel(RetriBertPreTrainedModel):
             input_ids_query (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
                 Indices of input sequence tokens in the vocabulary for the queries in a batch.
 
-                Indices can be obtained using :class:`transformers.RetriBertTokenizer`.
-                See :func:`transformers.PreTrainedTokenizer.encode` and
-                :func:`transformers.PreTrainedTokenizer.__call__` for details.
+                Indices can be obtained using :class:`~transformers.RetriBertTokenizer`.
+                See :meth:`transformers.PreTrainedTokenizer.encode` and
+                :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
                 `What are input IDs? <../glossary.html#input-ids>`__
             attention_mask_query (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Mask to avoid performing attention on queries padding token indices.
+                Mask to avoid performing attention on padding token indices.
                 Mask values selected in ``[0, 1]``:
-                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **maked**.
 
                 `What are attention masks? <../glossary.html#attention-mask>`__
             input_ids_doc (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
                 Indices of input sequence tokens in the vocabulary for the documents in a batch.
             attention_mask_doc (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
                 Mask to avoid performing attention on documents padding token indices.
-
             checkpoint_batch_size (:obj:`int`, `optional`, defaults to `:obj:`-1`):
-                If greater than 0, uses gradient checkpointing to only compute sequence representation on checkpoint_batch_size examples at a time
-                on the GPU. All query representations are still compared to all document representations in the batch.
+                If greater than 0, uses gradient checkpointing to only compute sequence representation on
+                :obj:`checkpoint_batch_size` examples at a time on the GPU. All query representations are still
+                compared to all document representations in the batch.
 
         Return:
-            :obj:`torch.FloatTensor` the bi-directional cross-entropy loss obtained while trying to match each query to its corresponding document
-            and each cocument to its corresponding query in the batch
+            :obj:`torch.FloatTensor`: The bidirectional cross-entropy loss obtained while trying to match each query to
+            its corresponding document and each cocument to its corresponding query in the batch
         """
         device = input_ids_query.device
         q_reps = self.embed_questions(input_ids_query, attention_mask_query, checkpoint_batch_size)
diff --git a/src/transformers/modeling_roberta.py b/src/transformers/modeling_roberta.py
index f0550d7881..75f2401081 100644
--- a/src/transformers/modeling_roberta.py
+++ b/src/transformers/modeling_roberta.py
@@ -481,7 +481,11 @@ class RobertaPreTrainedModel(PreTrainedModel):
 
 ROBERTA_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -493,27 +497,31 @@ ROBERTA_START_DOCSTRING = r"""
 
 ROBERTA_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.RobertaTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.RobertaTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -521,18 +529,22 @@ ROBERTA_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -705,7 +717,7 @@ class RobertaForCausalLM(RobertaPreTrainedModel):
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=CausalLMOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -723,19 +735,22 @@ class RobertaForCausalLM(RobertaPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
-                if the model is configured as a decoder.
-            encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Mask to avoid performing attention on the padding token indices of the encoder input. This mask
-                is used in the cross-attention if the model is configured as a decoder.
-                Mask values selected in ``[0, 1]``:
-                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the left-to-right language modeling loss (next word prediction).
-                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
-                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-                in ``[0, ..., config.vocab_size]``
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+            if the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask
+            is used in the cross-attention if the model is configured as a decoder.
+            Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction).
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with
+            labels in ``[0, ..., config.vocab_size]``
 
         Returns:
 
@@ -821,7 +836,7 @@ class RobertaForMaskedLM(RobertaPreTrainedModel):
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -935,7 +950,7 @@ class RobertaForSequenceClassification(RobertaPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -1015,7 +1030,7 @@ class RobertaForMultipleChoice(RobertaPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -1038,8 +1053,8 @@ class RobertaForMultipleChoice(RobertaPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
@@ -1104,7 +1119,7 @@ class RobertaForTokenClassification(RobertaPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -1208,7 +1223,7 @@ class RobertaForQuestionAnswering(RobertaPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -1232,11 +1247,11 @@ class RobertaForQuestionAnswering(RobertaPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
diff --git a/src/transformers/modeling_t5.py b/src/transformers/modeling_t5.py
index 81f837a3a0..71b49b3419 100644
--- a/src/transformers/modeling_t5.py
+++ b/src/transformers/modeling_t5.py
@@ -804,13 +804,19 @@ class T5Stack(T5PreTrainedModel):
 
 
 T5_START_DOCSTRING = r"""
+
     The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
     <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
     Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
     It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#module>`__ sub-class. Use it as a
-    regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
 
     Parameters:
         config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
@@ -822,57 +828,77 @@ T5_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
-            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left.
-            Indices can be obtained using :class:`transformers.T5Tokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
-            To know more on how to prepare :obj:`input_ids` for pre-training take a look at
+            T5 is a model with relative position embeddings so you should be able to pad the inputs on both the right
+            and the left.
+
+            Indices can be obtained using :class:`~transformers.T5Tokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for detail.
+
+            To know more on how to prepare :obj:`input_ids` for pretraining take a look a
             `T5 Training <./t5.html#training>`__.
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
         encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
-            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
-            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a sequence of hidden-states at the output of the last layer of the encoder.
-            Used in the cross-attention of the decoder.
+            Tuple consists of (:obj:`last_hidden_state`, :obj:`optional`: `hidden_states`, :obj:`optional`: `attentions`)
+            :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
+            hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
         decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
-            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
-            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`).
-            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
-            `T5 Training <./t5.html#training>`__. If decoder_input_ids and decoder_inputs_embeds are both None,
-            decoder_input_ids takes the value of input_ids.
+            Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
+            :obj:`decoder_input_ids` generation.
+            If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
+            :obj:`past_key_values`).
+
+            To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at
+            `T5 Training <./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
+            unset, :obj:`decoder_input_ids` takes the value of :obj:`input_ids`.
         decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
         past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains pre-computed key and value hidden-states of the attention blocks.
-            Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
             (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
-            instead of all `decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
-        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If `use_cache` is True, `past_key_values` are returned and can be used to speed up decoding (see `past_key_values`).
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`):
-            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
-            If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be input (see `past_key_values`).
-            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix. If decoder_input_ids and decoder_inputs_embeds are both None,
-            decoder_inputs_embeds takes the value of inputs_embeds.
-        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
+            representation.
+            If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_inputs_embeds` have to be input
+            (see :obj:`past_key_values`).
+            This is useful if you want more control over how to convert :obj:`decoder_input_ids` indices into
+            associated vectors than the model's internal embedding lookup matrix.
+
+            If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
+            unset, :obj:`decoder_input_embeds` takes the value of :obj:`input_embeds`.
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -1095,13 +1121,13 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
         **kwargs,
     ):
         r"""
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for computing the sequence classification/regression loss.
-                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
-                All labels set to ``-100`` are ignored (masked), the loss is only
-                computed for labels in ``[0, ..., config.vocab_size]``
-            kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
-                Used to hide legacy arguments that have been deprecated.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
+            All labels set to ``-100`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+        kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`):
+            Used to hide legacy arguments that have been deprecated.
 
         Returns:
 
diff --git a/src/transformers/modeling_tf_albert.py b/src/transformers/modeling_tf_albert.py
index b2d13956b6..e92bdbc29c 100644
--- a/src/transformers/modeling_tf_albert.py
+++ b/src/transformers/modeling_tf_albert.py
@@ -699,34 +699,33 @@ class TFAlbertForPreTrainingOutput(ModelOutput):
 
 
 ALBERT_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
 
-    .. _`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations`:
-        https://arxiv.org/abs/1909.11942
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
 
-    .. _`tf.keras.Model`:
-        https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Args:
         config (:class:`~transformers.AlbertConfig`): Model configuration class with all the parameters of the model.
@@ -736,27 +735,31 @@ ALBERT_START_DOCSTRING = r"""
 
 ALBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.AlbertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.AlbertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -764,21 +767,25 @@ ALBERT_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
-        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -791,7 +798,7 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.albert = TFAlbertMainLayer(config, name="albert")
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -820,20 +827,25 @@ class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
     def get_output_embeddings(self):
         return self.albert.embeddings
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFAlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(self, inputs, **kwargs):
         r"""
         Return:
 
-        Examples::
+        Example::
+
             >>> import tensorflow as tf
             >>> from transformers import AlbertTokenizer, TFAlbertForPreTraining
+
             >>> tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
             >>> model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')
+
             >>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]  # Batch size 1
             >>> outputs = model(input_ids)
-            >>> prediction_scores, sop_scores = outputs[:2]
+
+            >>> prediction_logits = outputs.prediction_logits
+            >>> sop_logits = outputs.sop_logits
         """
         return_dict = kwargs.get("return_dict")
         return_dict = return_dict if return_dict is not None else self.albert.return_dict
@@ -881,7 +893,7 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
     def get_output_embeddings(self):
         return self.albert.embeddings
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -903,7 +915,7 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
         training=False,
     ):
         r"""
-        labels (:obj::obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Labels for computing the masked language modeling loss.
             Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
             Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
@@ -963,7 +975,7 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1047,7 +1059,7 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1114,7 +1126,8 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
 
 
 @add_start_docstrings(
-    """Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Albert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
+    layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     ALBERT_START_DOCSTRING,
 )
 class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringLoss):
@@ -1127,7 +1140,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1152,11 +1165,11 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.albert.return_dict
@@ -1232,7 +1245,7 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="albert-base-v2",
@@ -1256,8 +1269,8 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
diff --git a/src/transformers/modeling_tf_bert.py b/src/transformers/modeling_tf_bert.py
index be1e25f546..42fbef1afa 100644
--- a/src/transformers/modeling_tf_bert.py
+++ b/src/transformers/modeling_tf_bert.py
@@ -670,30 +670,35 @@ class TFBertForPreTrainingOutput(ModelOutput):
 
 
 BERT_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
-    Parameters:
+    Args:
         config (:class:`~transformers.BertConfig`): Model configuration class with all the parameters of the model.
             Initializing with a config file does not load the weights associated with the model, only the configuration.
             Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
@@ -701,27 +706,31 @@ BERT_START_DOCSTRING = r"""
 
 BERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`__
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -729,21 +738,25 @@ BERT_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -756,7 +769,7 @@ class TFBertModel(TFBertPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.bert = TFBertMainLayer(config, name="bert")
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -784,7 +797,7 @@ class TFBertForPreTraining(TFBertPreTrainedModel):
     def get_output_embeddings(self):
         return self.bert.embeddings
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(self, inputs, **kwargs):
         r"""
@@ -838,7 +851,7 @@ class TFBertForMaskedLM(TFBertPreTrainedModel, TFMaskedLanguageModelingLoss):
     def get_output_embeddings(self):
         return self.bert.embeddings
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -996,7 +1009,7 @@ class TFBertForNextSentencePrediction(TFBertPreTrainedModel):
         self.bert = TFBertMainLayer(config, name="bert")
         self.nsp = TFBertNSPHead(config, name="nsp___cls")
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFNextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
     def call(self, inputs, **kwargs):
         r"""
@@ -1050,7 +1063,7 @@ class TFBertForSequenceClassification(TFBertPreTrainedModel, TFSequenceClassific
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -1142,7 +1155,7 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -1166,8 +1179,8 @@ class TFBertForMultipleChoice(TFBertPreTrainedModel, TFMultipleChoiceLoss):
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -1260,7 +1273,7 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -1327,8 +1340,8 @@ class TFBertForTokenClassification(TFBertPreTrainedModel, TFTokenClassificationL
 
 
 @add_start_docstrings(
-    """Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear
+    layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     BERT_START_DOCSTRING,
 )
 class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss):
@@ -1341,7 +1354,7 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="bert-base-cased",
@@ -1366,11 +1379,11 @@ class TFBertForQuestionAnswering(TFBertPreTrainedModel, TFQuestionAnsweringLoss)
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.bert.return_dict
diff --git a/src/transformers/modeling_tf_camembert.py b/src/transformers/modeling_tf_camembert.py
index 2a6768d18d..8d5e6468d5 100644
--- a/src/transformers/modeling_tf_camembert.py
+++ b/src/transformers/modeling_tf_camembert.py
@@ -37,24 +37,32 @@ TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 CAMEMBERT_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.CamembertConfig`): Model configuration class with all the parameters of the
diff --git a/src/transformers/modeling_tf_ctrl.py b/src/transformers/modeling_tf_ctrl.py
index afd84b3a77..c9fdb5473d 100644
--- a/src/transformers/modeling_tf_ctrl.py
+++ b/src/transformers/modeling_tf_ctrl.py
@@ -436,23 +436,32 @@ class TFCTRLPreTrainedModel(TFPreTrainedModel):
 
 CTRL_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
+
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
@@ -463,60 +472,70 @@ CTRL_START_DOCSTRING = r"""
 CTRL_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):
-            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).
+            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]``
+            (``sequence_length`` of input past key value states).
 
             Indices of input sequence tokens in the vocabulary.
 
-            If `past` is used, only input_ids that do not have their past calculated should be passed as input_ids (see `past`).
+            If :obj:`past` is used, only input IDs that do not have their past calculated should be passed as
+            ``input_ids``.
 
-            Indices can be obtained using :class:`transformers.CTRLTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.CTRLTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.__call__` and
+            :meth:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `past` output below). Can be used to speed up sequential decoding.
+            (see :obj:`past` output below). Can be used to speed up sequential decoding.
             The token ids which have their past given to this model
             should not be passed as input ids as they have already been computed.
         attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
         position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
-        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            `What are position IDs? <../glossary.html#position-ids>`__
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        use_cache (:obj:`bool`):
-            If `use_cache` is True, `past` key value states are returned and
-            can be used to speed up decoding (see `past`). Defaults to `True`.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, ``past`` key value states are returned and can be used to speed up
+            decoding (see ``past``).
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
diff --git a/src/transformers/modeling_tf_distilbert.py b/src/transformers/modeling_tf_distilbert.py
index 4583328c3a..48d8595f78 100644
--- a/src/transformers/modeling_tf_distilbert.py
+++ b/src/transformers/modeling_tf_distilbert.py
@@ -505,28 +505,33 @@ class TFDistilBertPreTrainedModel(TFPreTrainedModel):
 
 
 DISTILBERT_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids})`
+          :obj:`model({"input_ids": input_ids})`
 
     Parameters:
         config (:class:`~transformers.DistilBertConfig`): Model configuration class with all the parameters of the model.
@@ -536,38 +541,44 @@ DISTILBERT_START_DOCSTRING = r"""
 
 DISTILBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.DistilBertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        iinputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -580,7 +591,7 @@ class TFDistilBertModel(TFDistilBertPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")  # Embeddings
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -631,7 +642,7 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel, TFMaskedLanguageModel
     def get_output_embeddings(self):
         return self.vocab_projector.input_embeddings
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -718,7 +729,7 @@ class TFDistilBertForSequenceClassification(TFDistilBertPreTrainedModel, TFSeque
         )
         self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -799,7 +810,7 @@ class TFDistilBertForTokenClassification(TFDistilBertPreTrainedModel, TFTokenCla
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -891,7 +902,7 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -913,8 +924,8 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -986,8 +997,8 @@ class TFDistilBertForMultipleChoice(TFDistilBertPreTrainedModel, TFMultipleChoic
 
 
 @add_start_docstrings(
-    """DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """DistilBert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
+    linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     DISTILBERT_START_DOCSTRING,
 )
 class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAnsweringLoss):
@@ -1001,7 +1012,7 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
         assert config.num_labels == 2, f"Incorrect number of labels {config.num_labels} instead of 2"
         self.dropout = tf.keras.layers.Dropout(config.qa_dropout)
 
-    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="distilbert-base-uncased",
@@ -1024,11 +1035,11 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel, TFQuestionAn
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.distilbert.return_dict
diff --git a/src/transformers/modeling_tf_electra.py b/src/transformers/modeling_tf_electra.py
index 0c4ed4115a..e444f5d63e 100644
--- a/src/transformers/modeling_tf_electra.py
+++ b/src/transformers/modeling_tf_electra.py
@@ -369,28 +369,33 @@ class TFElectraForPreTrainingOutput(ModelOutput):
 
 
 ELECTRA_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.ElectraConfig`): Model configuration class with all the parameters of the model.
@@ -400,21 +405,23 @@ ELECTRA_START_DOCSTRING = r"""
 
 ELECTRA_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.ElectraTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.ElectraTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -422,21 +429,25 @@ ELECTRA_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -453,7 +464,7 @@ class TFElectraModel(TFElectraPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.electra = TFElectraMainLayer(config, name="electra")
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -480,7 +491,7 @@ class TFElectraForPreTraining(TFElectraPreTrainedModel):
         self.electra = TFElectraMainLayer(config, name="electra")
         self.discriminator_predictions = TFElectraDiscriminatorPredictions(config, name="discriminator_predictions")
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
@@ -575,7 +586,7 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel, TFMaskedLanguageModelingLos
     def get_output_embeddings(self):
         return self.generator_lm_head
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-generator",
@@ -677,7 +688,7 @@ class TFElectraForSequenceClassification(TFElectraPreTrainedModel, TFSequenceCla
         self.electra = TFElectraMainLayer(config, name="electra")
         self.classifier = TFElectraClassificationHead(config, name="classifier")
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -767,7 +778,7 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -791,8 +802,8 @@ class TFElectraForMultipleChoice(TFElectraPreTrainedModel, TFMultipleChoiceLoss)
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -884,7 +895,7 @@ class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassific
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -963,7 +974,7 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/electra-small-discriminator",
@@ -988,11 +999,11 @@ class TFElectraForQuestionAnswering(TFElectraPreTrainedModel, TFQuestionAnswerin
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.electra.config.return_dict
diff --git a/src/transformers/modeling_tf_flaubert.py b/src/transformers/modeling_tf_flaubert.py
index 74fa96b7ce..1b7ba4f7f0 100644
--- a/src/transformers/modeling_tf_flaubert.py
+++ b/src/transformers/modeling_tf_flaubert.py
@@ -46,9 +46,32 @@ TF_FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 FLAUBERT_START_DOCSTRING = r"""
 
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
+    .. note::
+
+        TF 2.0 models accepts two formats as inputs:
+
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
+
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
+        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
+
+        If you choose this second option, there are three possibilities you can use to gather all the input Tensors
+        in the first positional argument :
+
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
+        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
+          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
+        - a dictionary with one or several input Tensors associated to the input names given in the docstring:
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.FlaubertConfig`): Model configuration class with all the parameters of the model.
@@ -58,57 +81,77 @@ FLAUBERT_START_DOCSTRING = r"""
 
 FLAUBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+
+            Indices can be obtained using :class:`~transformers.FlaubertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
+
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - ``1`` for tokens that are **not masked**,
+            - ``0`` for tokens that are **maked**.
+
             `What are attention masks? <../glossary.html#attention-mask>`__
         langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             A parallel sequence of tokens to be used to indicate the language of each token in the input.
             Indices are languages ids which can be obtained from the language names by using two conversion mappings
             provided in the configuration of the model (only provided for multilingual models).
-            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and
-            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).
-            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.
+            More precisely, the `language name to language id` mapping is in :obj:`model.config.lang2id` (which is a
+            dictionary strring to int) and the `language id to language name` mapping is in :obj:`model.config.id2lang`
+            (dictionary int to string).
+
+            See usage examples detailed in the :doc:`multilingual documentation <../multilingual>`.
         token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
+            Indices are selected in ``[0, 1]``:
+
+            - ``0`` corresponds to a `sentence A` token,
+            - ``1`` corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
         position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
-            `What are position IDs? <../glossary.html#position-ids>`_
+
+            `What are position IDs? <../glossary.html#position-ids>`__
         lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`):
             Length of each sentence that can be used to avoid performing attention on padding token indices.
             You can also use `attention_mask` for the same result (see above), kept here for compatbility.
             Indices selected in ``[0, ..., input_ids.size(-1)]``:
         cache (:obj:`Dict[str, tf.Tensor]`, `optional`):
-            dictionary with ``tf.Tensor`` that contains pre-computed
-            hidden-states (key and values in the attention blocks) as computed by the model
-            (see `cache` output below). Can be used to speed up sequential decoding.
-            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
-        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Dictionary string to ``tf.FloatTensor`` that contains precomputed hidden states (key and values in the
+            attention blocks) as computed by the model (see :obj:`cache` output below). Can be used to speed up
+            sequential decoding.
+
+            The dictionary object will be modified in-place during the forward pass to add newly computed
+            hidden-states.
+        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - ``1`` indicates the head is **not masked**,
+            - ``0`` indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -365,8 +408,8 @@ class TFFlaubertForSequenceClassification(TFXLMForSequenceClassification):
 
 
 @add_start_docstrings(
-    """Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Flaubert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a
+    linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     FLAUBERT_START_DOCSTRING,
 )
 class TFFlaubertForQuestionAnsweringSimple(TFXLMForQuestionAnsweringSimple):
diff --git a/src/transformers/modeling_tf_funnel.py b/src/transformers/modeling_tf_funnel.py
index 523dd9277e..3d35d2c4af 100644
--- a/src/transformers/modeling_tf_funnel.py
+++ b/src/transformers/modeling_tf_funnel.py
@@ -1045,32 +1045,37 @@ class TFFunnelForPreTrainingOutput(ModelOutput):
 
 
 FUNNEL_START_DOCSTRING = r"""
+
     The Funnel Transformer model was proposed in
     `Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
     <https://arxiv.org/abs/2006.03236>`__ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
 
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.XxxConfig`): Model configuration class with all the parameters of the model.
@@ -1080,49 +1085,45 @@ FUNNEL_START_DOCSTRING = r"""
 
 FUNNEL_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.XxxTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.FunnelTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`__
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
-            Indices of positions of each input sequence tokens in the position embeddings.
-            Selected in the range ``[0, config.max_position_embeddings - 1]``.
-
-            `What are position IDs? <../glossary.html#position-ids>`__
-        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
-            Mask to nullify selected heads of the self-attention modules.
-            Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -1136,7 +1137,7 @@ class TFFunnelBaseModel(TFFunnelPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.funnel = TFFunnelBaseLayer(config, name="funnel")
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -1156,7 +1157,7 @@ class TFFunnelModel(TFFunnelPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.funnel = TFFunnelMainLayer(config, name="funnel")
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1179,7 +1180,7 @@ class TFFunnelForPreTraining(TFFunnelPreTrainedModel):
         self.funnel = TFFunnelMainLayer(config, name="funnel")
         self.discriminator_predictions = TFFunnelDiscriminatorPredictions(config, name="discriminator_predictions")
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFFunnelForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
@@ -1239,7 +1240,7 @@ class TFFunnelForMaskedLM(TFFunnelPreTrainedModel, TFMaskedLanguageModelingLoss)
         self.funnel = TFFunnelMainLayer(config, name="funnel")
         self.lm_head = TFFunnelMaskedLMHead(config, self.funnel.embeddings, name="lm_head")
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1314,7 +1315,7 @@ class TFFunnelForSequenceClassification(TFFunnelPreTrainedModel, TFSequenceClass
         self.funnel = TFFunnelBaseLayer(config, name="funnel")
         self.classifier = TFFunnelClassificationHead(config, config.num_labels, name="classifier")
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -1398,7 +1399,7 @@ class TFFunnelForMultipleChoice(TFFunnelPreTrainedModel, TFMultipleChoiceLoss):
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small-base",
@@ -1420,9 +1421,8 @@ class TFFunnelForMultipleChoice(TFFunnelPreTrainedModel, TFMultipleChoiceLoss):
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)s after the attention softmax, used to compute the weighted average in the self-attention
-            heads.
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -1510,7 +1510,7 @@ class TFFunnelForTokenClassification(TFFunnelPreTrainedModel, TFTokenClassificat
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1587,7 +1587,7 @@ class TFFunnelForQuestionAnswering(TFFunnelPreTrainedModel, TFQuestionAnsweringL
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(FUNNEL_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="funnel-transformer/small",
@@ -1610,11 +1610,11 @@ class TFFunnelForQuestionAnswering(TFFunnelPreTrainedModel, TFQuestionAnsweringL
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.funnel.return_dict
diff --git a/src/transformers/modeling_tf_gpt2.py b/src/transformers/modeling_tf_gpt2.py
index 95b78a5253..ac1c4b2a8a 100644
--- a/src/transformers/modeling_tf_gpt2.py
+++ b/src/transformers/modeling_tf_gpt2.py
@@ -450,23 +450,32 @@ class TFGPT2DoubleHeadsModelOutput(ModelOutput):
 
 GPT2_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
+
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.GPT2Config`): Model configuration class with all the parameters of the model.
@@ -477,56 +486,66 @@ GPT2_START_DOCSTRING = r"""
 GPT2_INPUTS_DOCSTRING = r"""
     Args:
         input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, input_ids_length)`):
-            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]`` (``sequence_length`` of input past key value states).
+            :obj:`input_ids_length` = ``sequence_length`` if ``past`` is ``None`` else ``past[0].shape[-2]``
+            (``sequence_length`` of input past key value states).
             Indices of input sequence tokens in the vocabulary.
 
-            If `past` is used, only `input_ids` that do not have their past calculated should be passed as `input_ids`.
+            If :obj:`past` is used, only input IDs that do not have their past calculated should be passed as
+            ``input_ids``.
 
-            Indices can be obtained using :class:`transformers.GPT2Tokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.GPT2Tokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         past (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `past` output below). Can be used to speed up sequential decoding.
+            (see :obj:`past` output below). Can be used to speed up sequential decoding.
             The token ids which have their past given to this model
-            should not be passed as `input_ids` as they have already been computed.
+            should not be passed as input ids as they have already been computed.
         attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
         position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
-        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            `What are position IDs? <../glossary.html#position-ids>`__
+        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -685,9 +704,9 @@ class TFGPT2DoubleHeadsModel(TFGPT2PreTrainedModel):
         training=False,
     ):
         r"""
-            mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
-                Index of the classification token in each input sequence.
-                Selected in the range ``[0, input_ids.size(-1) - 1[``.
+        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1[``.
 
         Return:
 
diff --git a/src/transformers/modeling_tf_longformer.py b/src/transformers/modeling_tf_longformer.py
index 5e3d69de64..c5552f716e 100644
--- a/src/transformers/modeling_tf_longformer.py
+++ b/src/transformers/modeling_tf_longformer.py
@@ -1220,28 +1220,33 @@ class TFLongformerPreTrainedModel(TFPreTrainedModel):
 
 
 LONGFORMER_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.LongformerConfig`): Model configuration class with all the parameters of the model.
@@ -1252,53 +1257,61 @@ LONGFORMER_START_DOCSTRING = r"""
 
 LONGFORMER_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.LonmgformerTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.LongformerTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-
-        global_attention_mask (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        global_attention_mask (:obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to decide the attention given on each token, local attention or global attenion.
             Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for
             task-specific finetuning because it makes the model more flexible at representing the task. For example,
             for classification, the <s> token should be given global attention. For QA, all question tokens should also have
             global attention. Please refer to the `Longformer paper <https://arxiv.org/abs/2004.05150>`__ for more details.
             Mask values selected in ``[0, 1]``:
-            ``0`` for local attention (a sliding window attention),
-            ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
 
-        token_type_ids (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+            - 0 for local attention (a sliding window attention),
+            - 1 for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
+
+        token_type_ids (:obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        position_ids (:obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
-        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            `What are position IDs? <../glossary.html#position-ids>`__
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -1308,13 +1321,15 @@ LONGFORMER_INPUTS_DOCSTRING = r"""
 )
 class TFLongformerModel(TFLongformerPreTrainedModel):
     """
-    This class copies code from :class:`~transformers.RobertaModel` and overwrites standard self-attention with longformer self-attention to provide the ability to process
+
+    This class copies code from :class:`~transformers.TFRobertaModel` and overwrites standard self-attention with
+    longformer self-attention to provide the ability to process
     long sequences following the self-attention approach described in `Longformer: the Long-Document Transformer
     <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer self-attention
     combines a local (sliding window) and global attention to extend to long documents without the O(n^2) increase in
     memory and compute.
 
-    The self-attention module :obj:`LongformerSelfAttention` implemented here supports the combination of local and
+    The self-attention module :obj:`TFLongformerSelfAttention` implemented here supports the combination of local and
     global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive
     and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream
     tasks. Future release will add support for autoregressive attention, but the support for dilated attention
@@ -1326,7 +1341,7 @@ class TFLongformerModel(TFLongformerPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.longformer = TFLongformerMainLayer(config, name="longformer")
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     def call(self, inputs, **kwargs):
         outputs = self.longformer(inputs, **kwargs)
         return outputs
@@ -1346,7 +1361,7 @@ class TFLongformerForMaskedLM(TFLongformerPreTrainedModel, TFMaskedLanguageModel
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="allenai/longformer-base-4096",
@@ -1413,8 +1428,8 @@ class TFLongformerForMaskedLM(TFLongformerPreTrainedModel, TFMaskedLanguageModel
 
 
 @add_start_docstrings(
-    """Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD / TriviaQA (a linear layers on top of
-    the hidden-states output to compute `span start logits` and `span end logits`). """,
+    """Longformer Model with a span classification head on top for extractive question-answering tasks like SQuAD /
+    TriviaQA (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`). """,
     LONGFORMER_START_DOCSTRING,
 )
 class TFLongformerForQuestionAnswering(TFLongformerPreTrainedModel, TFQuestionAnsweringLoss):
@@ -1429,7 +1444,7 @@ class TFLongformerForQuestionAnswering(TFLongformerPreTrainedModel, TFQuestionAn
             name="qa_outputs",
         )
 
-    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="allenai/longformer-large-4096-finetuned-triviaqa",
diff --git a/src/transformers/modeling_tf_lxmert.py b/src/transformers/modeling_tf_lxmert.py
index 92d1578567..bd9da4e2ba 100644
--- a/src/transformers/modeling_tf_lxmert.py
+++ b/src/transformers/modeling_tf_lxmert.py
@@ -859,33 +859,35 @@ class TFLxmertPreTrainedModel(TFPreTrainedModel):
 
 
 LXMERT_START_DOCSTRING = r"""
+
     The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers <https://arxiv.org/abs/1908.07490>`__
     by Hao Tan and Mohit Bansal. It's a vision and language transformer model,
     pre-trained on a variety of multi-modal datasets comprising of GQA, VQAv2.0, MCSCOCO captions, and Visual genome,
     using a combination of masked language modeling, region of interest feature regression,
     cross entropy loss for question answering attribute prediction, and object tag predicition.
 
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
+    .. note::
 
-    Note on the model inputs:
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.LxmertConfig`): Model configuration class with all the parameters of the model.
@@ -898,32 +900,61 @@ LXMERT_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.LxmertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.LxmertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
-            Mask to avoid performing attention on padding token indices.
-            Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-
-            `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
-            Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
-
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
         visual_feats: (:obj:`tf.Tensor` of shape :obj:՝(batch_size, num_visual_features, visual_feat_dim)՝):
-            This input represents visual features. They ROI pooled object features from bounding boxes using a faster-RCNN model)
-            These are currently not provided by the transformers library
-        visual_attention_mask (:obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+            This input represents visual features. They ROI pooled object features from bounding boxes using a
+            faster-RCNN model)
+
+            These are currently not provided by the transformers library.
+        visual_pos: (:obj:`tf.Tensor` of shape :obj:՝(batch_size, num_visual_features, visual_feat_dim)՝):
+            This input represents spacial features corresponding to their relative (via index) visual features.
+            The pre-trained LXMERT model expects these spacial features to be normalized bounding boxes on a scale of
+            0 to 1.
+
+            These are currently not provided by the transformers library.
+        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
+        visual_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            MMask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        token_type_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Segment token indices to indicate first and second portions of the inputs.
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -936,7 +967,7 @@ class TFLxmertModel(TFLxmertPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.lxmert = TFLxmertMainLayer(config, name="lxmert")
 
-    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING)
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="unc-nlp/lxmert-base-uncased",
@@ -1189,7 +1220,7 @@ class TFLxmertForPreTraining(TFLxmertPreTrainedModel):
             **({"obj_labels": obj_labels} if self.config.task_obj_predict else {}),
         }
 
-    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(LXMERT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=TFLxmertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
@@ -1219,10 +1250,12 @@ class TFLxmertForPreTraining(TFLxmertPreTrainedModel):
             ``(batch_size, num_features)`` and ``(batch_size, num_features, visual_feature_dim)``
             for each the label id and the label score respectively
         matched_label (``tf.Tensor`` of shape ``(batch_size,)``, `optional`):
-            Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
-            Indices should be in ``[0, 1]``.
-            ``0`` indicates that the sentence does not match the image
-            ``1`` indicates that the sentence does match the image
+            Labels for computing the whether or not the text input matches the image (classification) loss. Input
+            should be a sequence pair (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``:
+
+            - 0 indicates that the sentence does not match the image,
+            - 1 indicates that the sentence does match the image.
         ans: (``Torch.Tensor`` of shape ``(batch_size)``, `optional`, defaults to :obj: `None`):
             a one hot representation hof the correct answer `optional`
 
diff --git a/src/transformers/modeling_tf_mobilebert.py b/src/transformers/modeling_tf_mobilebert.py
index f620e3add8..6d1efe8b2d 100644
--- a/src/transformers/modeling_tf_mobilebert.py
+++ b/src/transformers/modeling_tf_mobilebert.py
@@ -843,28 +843,33 @@ class TFMobileBertForPreTrainingOutput(ModelOutput):
 
 
 MOBILEBERT_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.MobileBertConfig`): Model configuration class with all the parameters of the model.
@@ -874,49 +879,57 @@ MOBILEBERT_START_DOCSTRING = r"""
 
 MOBILEBERT_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.MobileBertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.MobileBertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`__
-        position_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
             `What are position IDs? <../glossary.html#position-ids>`__
-        head_mask (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
-            Optionally, instead of passing :obj:`input_ids` you can  to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -929,7 +942,7 @@ class TFMobileBertModel(TFMobileBertPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.mobilebert = TFMobileBertMainLayer(config, name="mobilebert")
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -956,7 +969,7 @@ class TFMobileBertForPreTraining(TFMobileBertPreTrainedModel):
     def get_output_embeddings(self):
         return self.mobilebert.embeddings
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFMobileBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(self, inputs, **kwargs):
         r"""
@@ -1004,7 +1017,7 @@ class TFMobileBertForMaskedLM(TFMobileBertPreTrainedModel, TFMaskedLanguageModel
     def get_output_embeddings(self):
         return self.mobilebert.embeddings
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1090,7 +1103,7 @@ class TFMobileBertForNextSentencePrediction(TFMobileBertPreTrainedModel):
         self.mobilebert = TFMobileBertMainLayer(config, name="mobilebert")
         self.cls = TFMobileBertOnlyNSPHead(config, name="seq_relationship___cls")
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFNextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)
     def call(self, inputs, **kwargs):
         r"""
@@ -1143,7 +1156,7 @@ class TFMobileBertForSequenceClassification(TFMobileBertPreTrainedModel, TFSeque
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1226,7 +1239,7 @@ class TFMobileBertForQuestionAnswering(TFMobileBertPreTrainedModel, TFQuestionAn
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1251,11 +1264,11 @@ class TFMobileBertForQuestionAnswering(TFMobileBertPreTrainedModel, TFQuestionAn
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.mobilebert.return_dict
@@ -1331,7 +1344,7 @@ class TFMobileBertForMultipleChoice(TFMobileBertPreTrainedModel, TFMultipleChoic
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
@@ -1355,8 +1368,8 @@ class TFMobileBertForMultipleChoice(TFMobileBertPreTrainedModel, TFMultipleChoic
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -1449,7 +1462,7 @@ class TFMobileBertForTokenClassification(TFMobileBertPreTrainedModel, TFTokenCla
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(MOBILEBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="google/mobilebert-uncased",
diff --git a/src/transformers/modeling_tf_openai.py b/src/transformers/modeling_tf_openai.py
index 025d021978..d76d276226 100644
--- a/src/transformers/modeling_tf_openai.py
+++ b/src/transformers/modeling_tf_openai.py
@@ -395,23 +395,32 @@ class TFOpenAIGPTDoubleHeadsModelOutput(ModelOutput):
 
 OPENAI_GPT_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
+
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
 
     Parameters:
@@ -425,46 +434,54 @@ OPENAI_GPT_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.GPT2Tokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.OpenAIGPTTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
         position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
+            `What are position IDs? <../glossary.html#position-ids>`__
         head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -608,9 +625,9 @@ class TFOpenAIGPTDoubleHeadsModel(TFOpenAIGPTPreTrainedModel):
         training=False,
     ):
         r"""
-            mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
-                Index of the classification token in each input sequence.
-                Selected in the range ``[0, input_ids.size(-1) - 1]``.
+        mc_token_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_choices)`, `optional`, default to index of the last token of the input)
+            Index of the classification token in each input sequence.
+            Selected in the range ``[0, input_ids.size(-1) - 1]``.
 
         Return:
 
diff --git a/src/transformers/modeling_tf_outputs.py b/src/transformers/modeling_tf_outputs.py
index d0914b6ddf..cbe5157d95 100644
--- a/src/transformers/modeling_tf_outputs.py
+++ b/src/transformers/modeling_tf_outputs.py
@@ -77,7 +77,8 @@ class TFBaseModelOutputWithPast(ModelOutput):
         last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
             Sequence of hidden-states at the output of the last layer of the model.
 
-            If `past_key_values` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output.
+            If :obj:`past_key_values` is used only the last hidden-state of the sequences of shape
+            :obj:`(batch_size, 1, hidden_size)` is output.
         past_key_values (:obj:`List[tf.Tensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
             List of :obj:`tf.Tensor` of length :obj:`config.n_layers`,  with each tensor of shape
             :obj:`(2, batch_size, num_heads, sequence_length, embed_size_per_head)`).
@@ -476,9 +477,9 @@ class TFQuestionAnsweringModelOutput(ModelOutput):
     Args:
         loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
             Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
-        start_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
+        start_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-start scores (before SoftMax).
-        end_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
+        end_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-end scores (before SoftMax).
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
@@ -508,9 +509,9 @@ class TFSeq2SeqQuestionAnsweringModelOutput(ModelOutput):
     Args:
         loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
             Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
-        start_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
+        start_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-start scores (before SoftMax).
-        end_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
+        end_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Span-end scores (before SoftMax).
         past_key_values (:obj:`List[tf.Tensor]`, `optional`, returned when ``use_cache=True`` is passed or when ``config.use_cache=True``):
             List of :obj:`tf.Tensor` of length :obj:`config.n_layers`,  with each tensor of shape
diff --git a/src/transformers/modeling_tf_roberta.py b/src/transformers/modeling_tf_roberta.py
index e0f35aa88c..92a38f5927 100644
--- a/src/transformers/modeling_tf_roberta.py
+++ b/src/transformers/modeling_tf_roberta.py
@@ -130,28 +130,33 @@ class TFRobertaPreTrainedModel(TFPreTrainedModel):
 
 
 ROBERTA_START_DOCSTRING = r"""
-    This model is a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ sub-class.
-    Use it as a regular TF 2.0 Keras Model and
-    refer to the TF 2.0 documentation for all matter related to general usage and behavior.
+
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
 
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.RobertaConfig`): Model configuration class with all the parameters of the
@@ -161,27 +166,31 @@ ROBERTA_START_DOCSTRING = r"""
 
 ROBERTA_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.RobertaTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.RobertaTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
+
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
 
             `What are token type IDs? <../glossary.html#token-type-ids>`__
-        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
@@ -189,21 +198,25 @@ ROBERTA_INPUTS_DOCSTRING = r"""
         head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        training (:obj:`boolean`, `optional`, defaults to :obj:`False`):
-            Whether to activate dropout modules (if set to :obj:`True`) during training or to de-activate them
-            (if set to :obj:`False`) for evaluation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -216,7 +229,7 @@ class TFRobertaModel(TFRobertaPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.roberta = TFRobertaMainLayer(config, name="roberta")
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -270,7 +283,7 @@ class TFRobertaForMaskedLM(TFRobertaPreTrainedModel, TFMaskedLanguageModelingLos
     def get_output_embeddings(self):
         return self.lm_head.decoder
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -376,7 +389,7 @@ class TFRobertaForSequenceClassification(TFRobertaPreTrainedModel, TFSequenceCla
         self.roberta = TFRobertaMainLayer(config, name="roberta")
         self.classifier = TFRobertaClassificationHead(config, name="classifier")
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -466,7 +479,7 @@ class TFRobertaForMultipleChoice(TFRobertaPreTrainedModel, TFMultipleChoiceLoss)
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -490,8 +503,8 @@ class TFRobertaForMultipleChoice(TFRobertaPreTrainedModel, TFMultipleChoiceLoss)
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -579,7 +592,7 @@ class TFRobertaForTokenClassification(TFRobertaPreTrainedModel, TFTokenClassific
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -659,7 +672,7 @@ class TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel, TFQuestionAnswerin
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="roberta-base",
@@ -684,11 +697,11 @@ class TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel, TFQuestionAnswerin
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.roberta.return_dict
diff --git a/src/transformers/modeling_tf_t5.py b/src/transformers/modeling_tf_t5.py
index 0ee0bbc038..dd23bd3cb1 100644
--- a/src/transformers/modeling_tf_t5.py
+++ b/src/transformers/modeling_tf_t5.py
@@ -846,30 +846,38 @@ class TFT5PreTrainedModel(TFPreTrainedModel):
 
 
 T5_START_DOCSTRING = r"""
+
     The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
     <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
     Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
     It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting.
 
-    This model is a `tf.keras.Model <https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model>`__
-    sub-class. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to
-    general usage and behavior.
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
+    .. note::
 
-    Note on the model inputs:
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is usefull when using `tf.keras.Model.fit()` method which currently requires having all the tensors in the first argument of the model call function: `model(inputs)`.
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
+        all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
-        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :
+        If you choose this second option, there are three possibilities you can use to gather all the input Tensors
+        in the first positional argument :
 
-        - a single Tensor with inputs only and nothing else: `model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
-            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`
-        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:
-            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`
+          :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
+        - a dictionary with one or several input Tensors associated to the input names given in the docstring:
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
@@ -879,53 +887,82 @@ T5_START_DOCSTRING = r"""
 
 T5_INPUTS_DOCSTRING = r"""
     Args:
-        inputs are usually used as a `dict` (see T5 description above for more information) containing all the following.
-
         inputs (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
             T5 is a model with relative position embeddings so you should be able to pad the inputs on
             the right or the left.
-            Indices can be obtained using :class:`transformers.T5Tokenizer`.
+
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
+
             To know more on how to prepare :obj:`inputs` for pre-training take a look at
             `T5 Training <./t5.html#training>`__.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
         decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
-            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
-            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`).
+            Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
+            :obj:`decoder_input_ids` generation.
+            If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
+            :obj:`past_key_values`).
+
+            To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at
+            `T5 Training <./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
+            unset, :obj:`decoder_input_ids` takes the value of :obj:`input_ids`.
         attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
         encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`):
-            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
-            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`) is a sequence of hidden-states at the output of the last layer of the encoder.
-            Used in the cross-attention of the decoder.
+            Tuple consists of (:obj:`last_hidden_state`, :obj:`optional`: `hidden_states`, :obj:`optional`: `attentions`)
+            :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
+            hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
         decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`):
-            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+            Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
+            also be used by default.
         past_key_values (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains pre-computed key and value hidden-states of the attention blocks.
-            Can be used to speed up decoding.
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids`
+            ontains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
             (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
         use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If `use_cache` is True, `past_key_values` are returned and can be used to speed up decoding (see `past_key_values`).
+            If set to :obj:`True`, ``past_key_values`` key value states are returned and can be used to speed up
+            decoding (see ``past_key_values``).
         inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `inputs` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`):
-            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
-            `T5 Training <./t5.html#training>`__.
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded
+            representation.
+            If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_inputs_embeds` have to be input
+            (see :obj:`past_key_values`).
+            This is useful if you want more control over how to convert :obj:`decoder_input_ids` indices into
+            associated vectors than the model's internal embedding lookup matrix.
+
+            If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both
+            unset, :obj:`decoder_input_embeds` takes the value of :obj:`input_embeds`.
         head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -1206,9 +1243,9 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel, TFCausalLanguageModeling
         **kwargs,
     ):
         r"""
-            labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the cross entropy classification loss.
-                Indices should be in ``[0, ..., config.vocab_size - 1]``.
+        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the cross entropy classification loss.
+            Indices should be in ``[0, ..., config.vocab_size - 1]``.
 
         Returns:
 
diff --git a/src/transformers/modeling_tf_transfo_xl.py b/src/transformers/modeling_tf_transfo_xl.py
index 68c0e85fb3..616b645cec 100644
--- a/src/transformers/modeling_tf_transfo_xl.py
+++ b/src/transformers/modeling_tf_transfo_xl.py
@@ -665,8 +665,8 @@ class TFTransfoXLModelOutput(ModelOutput):
             Sequence of hidden-states at the output of the last layer of the model.
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Can be used (see :obj:`mems` input) to speed up sequential decoding. The token ids which have their past
+            given to this model should not be passed as input ids as they have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -698,8 +698,8 @@ class TFTransfoXLLMHeadModelOutput(ModelOutput):
             Prediction scores of the language modeling head (scores for each vocabulary token after SoftMax).
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Can be used (see :obj:`mems` input) to speed up sequential decoding. The token ids which have their past
+            given to this model should not be passed as input ids as they have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -721,24 +721,32 @@ class TFTransfoXLLMHeadModelOutput(ModelOutput):
 
 TRANSFO_XL_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.TransfoXLConfig`): Model configuration class with all the parameters of the model.
@@ -751,30 +759,36 @@ TRANSFO_XL_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.TransfoXLTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems
-            given to this model should not be passed as input ids as they have already been computed.
+            (see :obj:`mems` output below). Can be used to speed up sequential decoding. The token ids which have their
+            mems given to this model should not be passed as :obj:`input_ids` as they have already been computed.
         head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
         inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
diff --git a/src/transformers/modeling_tf_xlm.py b/src/transformers/modeling_tf_xlm.py
index 2e1d10e6b8..d4a70bbd82 100644
--- a/src/transformers/modeling_tf_xlm.py
+++ b/src/transformers/modeling_tf_xlm.py
@@ -554,24 +554,32 @@ class TFXLMWithLMHeadModelOutput(ModelOutput):
 
 XLM_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.XLMConfig`): Model configuration class with all the parameters of the model.
@@ -581,63 +589,77 @@ XLM_START_DOCSTRING = r"""
 
 XLM_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        langs (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`({0})`, `optional`):
             A parallel sequence of tokens to be used to indicate the language of each token in the input.
             Indices are languages ids which can be obtained from the language names by using two conversion mappings
             provided in the configuration of the model (only provided for multilingual models).
-            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and
-            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).
+            More precisely, the `language name to language id` mapping is in :obj:`model.config.lang2id` (which is a
+            dictionary strring to int) and the `language id to language name` mapping is in :obj:`model.config.id2lang`
+            (dictionary int to string).
 
-            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.
-        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            See usage examples detailed in the :doc:`multilingual documentation <../multilingual>`.
+        ttoken_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
+            `What are position IDs? <../glossary.html#position-ids>`__
         lengths (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size,)`, `optional`):
             Length of each sentence that can be used to avoid performing attention on padding token indices.
             You can also use `attention_mask` for the same result (see above), kept here for compatbility.
-            Indices selected in ``[0, ..., input_ids.size(-1)]``:
+            Indices selected in ``[0, ..., input_ids.size(-1)]``.
         cache (:obj:`Dict[str, tf.Tensor]`, `optional`):
-            dictionary with ``tf.Tensor`` that contains pre-computed
-            hidden-states (key and values in the attention blocks) as computed by the model
-            (see `cache` output below). Can be used to speed up sequential decoding.
-            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
-        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Dictionary string to ``torch.FloatTensor`` that contains precomputed hidden states (key and values in the
+            attention blocks) as computed by the model (see :obj:`cache` output below). Can be used to speed up
+            sequential decoding.
+
+            The dictionary object will be modified in-place during the forward pass to add newly computed
+            hidden-states.
+        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -650,7 +672,7 @@ class TFXLMModel(TFXLMPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.transformer = TFXLMMainLayer(config, name="transformer")
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -723,7 +745,7 @@ class TFXLMWithLMHeadModel(TFXLMPreTrainedModel):
             langs = None
         return {"inputs": inputs, "langs": langs}
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -759,7 +781,7 @@ class TFXLMForSequenceClassification(TFXLMPreTrainedModel, TFSequenceClassificat
         self.transformer = TFXLMMainLayer(config, name="transformer")
         self.sequence_summary = TFSequenceSummary(config, initializer_range=config.init_std, name="sequence_summary")
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -858,7 +880,7 @@ class TFXLMForMultipleChoice(TFXLMPreTrainedModel, TFMultipleChoiceLoss):
             "langs": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS),
         }
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -885,8 +907,8 @@ class TFXLMForMultipleChoice(TFXLMPreTrainedModel, TFMultipleChoiceLoss):
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -998,7 +1020,7 @@ class TFXLMForTokenClassification(TFXLMPreTrainedModel, TFTokenClassificationLos
             config.num_labels, kernel_initializer=get_initializer(config.init_std), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -1071,7 +1093,7 @@ class TFXLMForTokenClassification(TFXLMPreTrainedModel, TFTokenClassificationLos
 
 
 @add_start_docstrings(
-    """XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
+    """XLM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of
     the hidden-states output to compute `span start logits` and `span end logits`). """,
     XLM_START_DOCSTRING,
 )
@@ -1083,7 +1105,7 @@ class TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel, TFQuestionAnsweringL
             config.num_labels, kernel_initializer=get_initializer(config.init_std), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -1111,11 +1133,11 @@ class TFXLMForQuestionAnsweringSimple(TFXLMPreTrainedModel, TFQuestionAnsweringL
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.transformer.return_dict
diff --git a/src/transformers/modeling_tf_xlm_roberta.py b/src/transformers/modeling_tf_xlm_roberta.py
index 5ea919b3e4..e7ae781e31 100644
--- a/src/transformers/modeling_tf_xlm_roberta.py
+++ b/src/transformers/modeling_tf_xlm_roberta.py
@@ -37,24 +37,32 @@ TF_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 XLM_ROBERTA_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.XLMRobertaConfig`): Model configuration class with all the parameters of the
diff --git a/src/transformers/modeling_tf_xlnet.py b/src/transformers/modeling_tf_xlnet.py
index 00ece9aaaa..46bfe5e1e6 100644
--- a/src/transformers/modeling_tf_xlnet.py
+++ b/src/transformers/modeling_tf_xlnet.py
@@ -807,9 +807,9 @@ class TFXLNetModelOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -843,9 +843,9 @@ class TFXLNetLMHeadModelOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -877,9 +877,9 @@ class TFXLNetForSequenceClassificationOutput(ModelOutput):
         logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):
             Classification (or regression if config.num_labels==1) scores (before SoftMax).
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -911,9 +911,9 @@ class TFXLNetForTokenClassificationOutput(ModelOutput):
         logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
             Classification scores (before SoftMax).
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -947,9 +947,9 @@ class TFXLNetForMultipleChoiceOutput(ModelOutput):
 
             Classification scores (before SoftMax).
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -983,9 +983,9 @@ class TFXLNetForQuestionAnsweringSimpleOutput(ModelOutput):
         end_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length,)`):
             Span-end scores (before SoftMax).
         mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -1009,24 +1009,32 @@ class TFXLNetForQuestionAnsweringSimpleOutput(ModelOutput):
 
 XLNET_START_DOCSTRING = r"""
 
+    This model inherits from :class:`~transformers.TFPreTrainedModel`. Check the superclass documentation for the
+    generic methods the library implements for all its model (such as downloading or saving, resizing the input
+    embeddings, pruning heads etc.)
+
+    This model is also a `tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ subclass.
+    Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general
+    usage and behavior.
+
     .. note::
 
         TF 2.0 models accepts two formats as inputs:
 
-            - having all inputs as keyword arguments (like PyTorch models), or
-            - having all inputs as a list, tuple or dict in the first positional arguments.
+        - having all inputs as keyword arguments (like PyTorch models), or
+        - having all inputs as a list, tuple or dict in the first positional arguments.
 
-        This second option is useful when using :obj:`tf.keras.Model.fit()` method which currently requires having
+        This second option is useful when using :meth:`tf.keras.Model.fit` method which currently requires having
         all the tensors in the first argument of the model call function: :obj:`model(inputs)`.
 
         If you choose this second option, there are three possibilities you can use to gather all the input Tensors
         in the first positional argument :
 
-        - a single Tensor with input_ids only and nothing else: :obj:`model(inputs_ids)`
+        - a single Tensor with :obj:`input_ids` only and nothing else: :obj:`model(inputs_ids)`
         - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
           :obj:`model([input_ids, attention_mask])` or :obj:`model([input_ids, attention_mask, token_type_ids])`
         - a dictionary with one or several input Tensors associated to the input names given in the docstring:
-          :obj:`model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+          :obj:`model({"input_ids": input_ids, "token_type_ids": token_type_ids})`
 
     Parameters:
         config (:class:`~transformers.XLNetConfig`): Model configuration class with all the parameters of the model.
@@ -1036,64 +1044,80 @@ XLNET_START_DOCSTRING = r"""
 
 XLNET_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.XLNetTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.BertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.__call__` and
+            :func:`transformers.PreTrainedTokenizer.encode` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        mems (:obj:`List[tf.Tensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems
-            given to this model should not be passed as input ids as they have already been computed.
+        mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
+            Contains pre-computed hidden-states (see :obj:`mems` output below) . Can be used to speed up sequential
+            decoding. The token ids which have their past given to this model should not be passed as
+            :obj:`input_ids` as they have already been computed.
+
+            :obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
         perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
             Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
-            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
-            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.
-            If None, each token attends to all the others (full bidirectional attention).
+
+            - if ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
+            - if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.
+
+            If not set, each token attends to all the others (full bidirectional attention).
             Only used during pretraining (to define factorization order) or for sequential decoding (generation).
         target_mapping (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`):
             Mask to indicate the output tokens to use.
             If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.
-            Only used during pretraining for partial prediction or for sequential decoding (generation).
-        token_type_ids (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        input_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
-            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.
-            Kept for compatibility with the original code base.
-            You can only uses one of `input_mask` and `attention_mask`
+            Negative of :obj:`attention_mask`, i.e. with 0 for real tokens and 1 for padding which is kept for
+            compatibility with the original code base.
+
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.
-        head_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+
+            - 1 for tokens that are **masked**,
+            - 0 for tokens that are **not maked**.
+
+            You can only uses one of :obj:`input_mask` and :obj:`attention_mask`.
+        head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        use_cache (:obj:`bool`):
-            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use the model in training mode (some modules like dropout modules have different
+            behaviors between training and evaluation).
 """
 
 
@@ -1106,7 +1130,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
         super().__init__(config, *inputs, **kwargs)
         self.transformer = TFXLNetMainLayer(config, name="transformer")
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1172,7 +1196,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
 
         return inputs
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=TFXLNetLMHeadModelOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
@@ -1193,9 +1217,9 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
         training=False,
     ):
         r"""
-            labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-                Labels for computing the cross entropy classification loss.
-                Indices should be in ``[0, ..., config.vocab_size - 1]``.
+        labels (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the cross entropy classification loss.
+            Indices should be in ``[0, ..., config.vocab_size - 1]``.
 
         Return:
 
@@ -1287,7 +1311,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="logits_proj"
         )
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1388,7 +1412,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
         """
         return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1416,8 +1440,8 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
         r"""
         labels (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         if isinstance(inputs, (tuple, list)):
             input_ids = inputs[0]
@@ -1521,7 +1545,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1606,7 +1630,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1635,11 +1659,11 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
         r"""
         start_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`tf.Tensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.transformer.return_dict
diff --git a/src/transformers/modeling_transfo_xl.py b/src/transformers/modeling_transfo_xl.py
index 83c253be54..f63e8534de 100644
--- a/src/transformers/modeling_transfo_xl.py
+++ b/src/transformers/modeling_transfo_xl.py
@@ -603,8 +603,8 @@ class TransfoXLModelOutput(ModelOutput):
             Sequence of hidden-states at the output of the last layer of the model.
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Can be used (see :obj:`mems` input) to speed up sequential decoding. The token ids which have their past
+            given to this model should not be passed as input ids as they have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -636,8 +636,8 @@ class TransfoXLLMHeadModelOutput(ModelOutput):
             Prediction scores of the language modeling head (scores for each vocabulary token after SoftMax).
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks).
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Can be used (see :obj:`mems` input) to speed up sequential decoding. The token ids which have their past
+            given to this model should not be passed as input ids as they have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -669,7 +669,11 @@ class TransfoXLLMHeadModelOutput(ModelOutput):
 
 TRANSFO_XL_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -684,30 +688,34 @@ TRANSFO_XL_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.TransfoXLTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.TransfoXLTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
             Contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
-            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems
-            given to this model should not be passed as input ids as they have already been computed.
+            (see :obj:`mems` output below). Can be used to speed up sequential decoding. The token ids which have their
+            mems given to this model should not be passed as :obj:`input_ids` as they have already been computed.
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
         inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 74642696a0..65195335bf 100755
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -442,7 +442,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
 
     def set_input_embeddings(self, value: nn.Module):
         """
-        Set model's input embeddings
+        Set model's input embeddings.
 
         Args:
             value (:obj:`nn.Module`): A module mapping vocabulary to hidden states.
diff --git a/src/transformers/modeling_xlm.py b/src/transformers/modeling_xlm.py
index fe6f0ff109..7ea7eb3173 100755
--- a/src/transformers/modeling_xlm.py
+++ b/src/transformers/modeling_xlm.py
@@ -271,7 +271,8 @@ class XLMForQuestionAnsweringOutput(ModelOutput):
 
     Args:
         loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned if both :obj:`start_positions` and :obj:`end_positions` are provided):
-            Classification loss as the sum of start token, end token (and is_impossible if provided) classification losses.
+            Classification loss as the sum of start token, end token (and is_impossible if provided) classification
+            losses.
         start_top_log_probs (``torch.FloatTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):
             Log probabilities for the top config.start_n_top start token possibilities (beam-search).
         start_top_index (``torch.LongTensor`` of shape ``(batch_size, config.start_n_top)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):
@@ -307,7 +308,11 @@ class XLMForQuestionAnsweringOutput(ModelOutput):
 
 XLM_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -319,63 +324,74 @@ XLM_START_DOCSTRING = r"""
 
 XLM_INPUTS_DOCSTRING = r"""
     Args:
-        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
-            See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.__call__` for details.
+            Indices can be obtained using :class:`~transformers.XLMTokenizer`.
+            See :meth:`transformers.PreTrainedTokenizer.encode` and
+            :meth:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
-        langs (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+        langs (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             A parallel sequence of tokens to be used to indicate the language of each token in the input.
             Indices are languages ids which can be obtained from the language names by using two conversion mappings
             provided in the configuration of the model (only provided for multilingual models).
-            More precisely, the `language name -> language id` mapping is in `model.config.lang2id` (dict str -> int) and
-            the `language id -> language name` mapping is `model.config.id2lang` (dict int -> str).
+            More precisely, the `language name to language id` mapping is in :obj:`model.config.lang2id` (which is a
+            dictionary strring to int) and the `language id to language name` mapping is in :obj:`model.config.id2lang`
+            (dictionary int to string).
 
-            See usage examples detailed in the `multilingual documentation <https://huggingface.co/transformers/multilingual.html>`__.
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            See usage examples detailed in the :doc:`multilingual documentation <../multilingual>`.
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
-        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Indices of positions of each input sequence tokens in the position embeddings.
             Selected in the range ``[0, config.max_position_embeddings - 1]``.
 
-            `What are position IDs? <../glossary.html#position-ids>`_
+            `What are position IDs? <../glossary.html#position-ids>`__
         lengths (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Length of each sentence that can be used to avoid performing attention on padding token indices.
             You can also use `attention_mask` for the same result (see above), kept here for compatbility.
-            Indices selected in ``[0, ..., input_ids.size(-1)]``:
+            Indices selected in ``[0, ..., input_ids.size(-1)]``.
         cache (:obj:`Dict[str, torch.FloatTensor]`, `optional`):
-            dictionary with ``torch.FloatTensor`` that contains pre-computed
-            hidden-states (key and values in the attention blocks) as computed by the model
-            (see `cache` output below). Can be used to speed up sequential decoding.
-            The dictionary object will be modified in-place during the forward pass to add newly computed hidden-states.
+            Dictionary string to ``torch.FloatTensor`` that contains precomputed hidden states (key and values in the
+            attention blocks) as computed by the model (see :obj:`cache` output below). Can be used to speed up
+            sequential decoding.
+
+            The dictionary object will be modified in-place during the forward pass to add newly computed
+            hidden-states.
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -469,7 +485,7 @@ class XLMModel(XLMPreTrainedModel):
         for layer, heads in heads_to_prune.items():
             self.attentions[layer].prune_heads(heads)
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -684,7 +700,7 @@ class XLMWithLMHeadModel(XLMPreTrainedModel):
             langs = None
         return {"input_ids": input_ids, "langs": langs}
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -761,7 +777,7 @@ class XLMForSequenceClassification(XLMPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -847,7 +863,7 @@ class XLMForQuestionAnsweringSimple(XLMPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -874,11 +890,11 @@ class XLMForQuestionAnsweringSimple(XLMPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -949,7 +965,7 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=XLMForQuestionAnsweringOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -972,21 +988,21 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the start of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
-            end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the end of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
-            is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels whether a question has an answer or no answer (SQuAD 2.0)
-            cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
-            p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).
-                1.0 means token should be masked. 0.0 mean token is not masked.
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels whether a question has an answer or no answer (SQuAD 2.0)
+        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
+        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).
+            1.0 means token should be masked. 0.0 mean token is not masked.
 
         Returns:
 
@@ -1065,7 +1081,7 @@ class XLMForTokenClassification(XLMPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -1156,7 +1172,7 @@ class XLMForMultipleChoice(XLMPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING.format("batch_size, num_choicec, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlm-mlm-en-2048",
@@ -1180,10 +1196,10 @@ class XLMForMultipleChoice(XLMPreTrainedModel):
         return_dict=None,
     ):
         r"""
-        labels (:obj:`torch.Tensor` of shape :obj:`(batch_size,)`, `optional`):
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
diff --git a/src/transformers/modeling_xlm_roberta.py b/src/transformers/modeling_xlm_roberta.py
index e5375bc043..455759742c 100644
--- a/src/transformers/modeling_xlm_roberta.py
+++ b/src/transformers/modeling_xlm_roberta.py
@@ -44,7 +44,11 @@ XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST = [
 
 XLM_ROBERTA_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
diff --git a/src/transformers/modeling_xlnet.py b/src/transformers/modeling_xlnet.py
index 3dc5b6c5a6..49893df05e 100755
--- a/src/transformers/modeling_xlnet.py
+++ b/src/transformers/modeling_xlnet.py
@@ -595,9 +595,9 @@ class XLNetModelOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -631,9 +631,9 @@ class XLNetLMHeadModelOutput(ModelOutput):
             ``num_predict`` corresponds to ``target_mapping.shape[1]``. If ``target_mapping`` is ``None``, then
             ``num_predict`` corresponds to ``sequence_length``.
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -665,9 +665,9 @@ class XLNetForSequenceClassificationOutput(ModelOutput):
         logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
             Classification (or regression if config.num_labels==1) scores (before SoftMax).
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -699,9 +699,9 @@ class XLNetForTokenClassificationOutput(ModelOutput):
         logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
             Classification scores (before SoftMax).
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -735,9 +735,9 @@ class XLNetForMultipleChoiceOutput(ModelOutput):
 
             Classification scores (before SoftMax).
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -771,9 +771,9 @@ class XLNetForQuestionAnsweringSimpleOutput(ModelOutput):
         end_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
             Span-end scores (before SoftMax).
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -814,9 +814,9 @@ class XLNetForQuestionAnsweringOutput(ModelOutput):
         cls_logits (``torch.FloatTensor`` of shape ``(batch_size,)``, `optional`, returned if ``start_positions`` or ``end_positions`` is not provided):
             Log probabilities for the ``is_impossible`` label of the answers.
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states.
-            Can be used (see `mems` input) to speed up sequential decoding. The token ids which have their past given to this model
-            should not be passed as input ids as they have already been computed.
+            Contains pre-computed hidden-states. Can be used (see :obj:`mems` input) to speed up sequential decoding.
+            The token ids which have their past given to this model should not be passed as :obj:`input_ids` as they
+            have already been computed.
         hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
             Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
             of shape :obj:`(batch_size, sequence_length, hidden_size)`.
@@ -843,7 +843,11 @@ class XLNetForQuestionAnsweringOutput(ModelOutput):
 
 XLNET_START_DOCSTRING = r"""
 
-    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its model (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ subclass.
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
     usage and behavior.
 
@@ -858,62 +862,75 @@ XLNET_INPUTS_DOCSTRING = r"""
         input_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using :class:`transformers.BertTokenizer`.
+            Indices can be obtained using :class:`transformers.XLNetTokenizer`.
             See :func:`transformers.PreTrainedTokenizer.encode` and
             :func:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
-        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
             Mask to avoid performing attention on padding token indices.
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **maked**.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
         mems (:obj:`List[torch.FloatTensor]` of length :obj:`config.n_layers`):
-            Contains pre-computed hidden-states as computed by the model
-            (see `mems` output below). Can be used to speed up sequential decoding. The token ids which have their mems
-            given to this model should not be passed as input ids as they have already been computed.
-            `use_cache` has to be set to `True` to make use of `mems`.
+            Contains pre-computed hidden-states (see :obj:`mems` output below) . Can be used to speed up sequential
+            decoding. The token ids which have their past given to this model should not be passed as
+            :obj:`input_ids` as they have already been computed.
+
+            :obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
         perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
             Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
-            If ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
-            if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.
-            If None, each token attends to all the others (full bidirectional attention).
+
+            - if ``perm_mask[k, i, j] = 0``, i attend to j in batch k;
+            - if ``perm_mask[k, i, j] = 1``, i does not attend to j in batch k.
+
+            If not set, each token attends to all the others (full bidirectional attention).
             Only used during pretraining (to define factorization order) or for sequential decoding (generation).
         target_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_predict, sequence_length)`, `optional`):
             Mask to indicate the output tokens to use.
             If ``target_mapping[k, i, j] = 1``, the i-th predict in batch k is on the j-th token.
             Only used during pretraining for partial prediction or for sequential decoding (generation).
-        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`{0}`, `optional`):
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
             Segment token indices to indicate first and second portions of the inputs.
-            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
-            corresponds to a `sentence B` token. The classifier token should be represented by a ``2``.
+            Indices are selected in ``[0, 1]``:
 
-            `What are token type IDs? <../glossary.html#token-type-ids>`_
+            - 0 corresponds to a `sentence A` token,
+            - 1 corresponds to a `sentence B` token.
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`__
         input_mask (:obj:`torch.FloatTensor` of shape :obj:`{0}`, `optional`):
             Mask to avoid performing attention on padding token indices.
-            Negative of `attention_mask`, i.e. with 0 for real tokens and 1 for padding.
-            Kept for compatibility with the original code base.
-            You can only uses one of `input_mask` and `attention_mask`
+            Negative of :obj:`attention_mask`, i.e. with 0 for real tokens and 1 for padding which is kept for
+            compatibility with the original code base.
+
             Mask values selected in ``[0, 1]``:
-            ``1`` for tokens that are MASKED, ``0`` for tokens that are NOT MASKED.
+
+            - 1 for tokens that are **masked**,
+            - 0 for tokens that are **not maked**.
+
+            You can only uses one of :obj:`input_mask` and :obj:`attention_mask`.
         head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
             Mask to nullify selected heads of the self-attention modules.
             Mask values selected in ``[0, 1]``:
-            :obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
-        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
             Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
-            than the model's internal embedding lookup matrix.
-        use_cache (:obj:`bool`):
-            If `use_cache` is True, `mems` are returned and can be used to speed up decoding (see `mems`). Defaults to `True`.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
         output_attentions (:obj:`bool`, `optional`):
-            If set to ``True``, the attentions tensors of all attention layers are returned. See ``attentions`` under returned tensors for more detail.
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
         output_hidden_states (:obj:`bool`, `optional`):
-            If set to ``True``, the hidden states of all layers are returned. See ``hidden_states`` under returned tensors for more detail.
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
         return_dict (:obj:`bool`, `optional`):
-            If set to ``True``, the model will return a :class:`~transformers.file_utils.ModelOutput` instead of a
-            plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
 """
 
 
@@ -1051,7 +1068,7 @@ class XLNetModel(XLNetPreTrainedModel):
         pos_emb = pos_emb.to(self.device)
         return pos_emb
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1328,7 +1345,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
 
         return inputs
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=XLNetLMHeadModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1348,13 +1365,19 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
-                Labels for masked language modeling.
-                `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`.
-                The labels should correspond to the masked input words that should be predicted and depends on `target_mapping`. Note in order to perform standard auto-regressive language modeling a `<mask>` token has to be added to the `input_ids` (see `prepare_inputs_for_generation` fn and examples below)
-                Indices are selected in ``[-100, 0, ..., config.vocab_size]``
-                All labels set to ``-100`` are ignored, the loss is only
-                computed for labels in ``[0, ..., config.vocab_size]``
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
+            Labels for masked language modeling.
+            :obj:`num_predict` corresponds to :obj:`target_mapping.shape[1]`. If :obj:`target_mapping` is :obj`None`,
+            then :obj:`num_predict` corresponds to :obj:`sequence_length`.
+
+            The labels should correspond to the masked input words that should be predicted and depends on
+            :obj:`target_mapping`. Note in order to perform standard auto-regressive language modeling a
+            `<mask>` token has to be added to the :obj:`input_ids` (see the :obj:`prepare_inputs_for_generation`
+            function and examples below)
+
+            Indices are selected in ``[-100, 0, ..., config.vocab_size]``
+            All labels set to ``-100`` are ignored, the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
 
         Return:
 
@@ -1445,7 +1468,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1537,7 +1560,7 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1632,7 +1655,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, num_choices, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1659,8 +1682,8 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
         r"""
         labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for computing the multiple choice classification loss.
-            Indices should be in ``[0, ..., num_choices]`` where `num_choices` is the size of the second dimension
-            of the input tensors. (see `input_ids` above)
+            Indices should be in ``[0, ..., num_choices-1]`` where :obj:`num_choices` is the size of the second dimension
+            of the input tensors. (See :obj:`input_ids` above)
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
@@ -1731,7 +1754,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @add_code_sample_docstrings(
         tokenizer_class=_TOKENIZER_FOR_DOC,
         checkpoint="xlnet-base-cased",
@@ -1759,11 +1782,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
         r"""
         start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
             Labels for position (index) of the end of the labelled span for computing the token classification loss.
-            Positions are clamped to the length of the sequence (`sequence_length`).
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
             Position outside of the sequence are not taken into account for computing the loss.
         """
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@@ -1841,7 +1864,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
 
         self.init_weights()
 
-    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("(batch_size, sequence_length)"))
+    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
     @replace_return_docstrings(output_type=XLNetForQuestionAnsweringOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
@@ -1865,21 +1888,21 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
         return_dict=None,
     ):
         r"""
-            start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the start of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
-            end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
-                Labels for position (index) of the end of the labelled span for computing the token classification loss.
-                Positions are clamped to the length of the sequence (`sequence_length`).
-                Position outside of the sequence are not taken into account for computing the loss.
-            is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels whether a question has an answer or no answer (SQuAD 2.0)
-            cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
-                Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
-            p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
-                Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).
-                1.0 means token should be masked. 0.0 mean token is not masked.
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (:obj:`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        is_impossible (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels whether a question has an answer or no answer (SQuAD 2.0)
+        cls_index (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`):
+            Labels for position (index) of the classification token to use as input for computing plausibility of the answer.
+        p_mask (``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``, `optional`):
+            Optional mask of tokens which can't be in answers (e.g. [CLS], [PAD], ...).
+            1.0 means token should be masked. 0.0 mean token is not masked.
 
         Returns:
 
diff --git a/src/transformers/tokenization_albert.py b/src/transformers/tokenization_albert.py
index e604ecd7f5..b5d9296dc5 100644
--- a/src/transformers/tokenization_albert.py
+++ b/src/transformers/tokenization_albert.py
@@ -56,49 +56,49 @@ SPIECE_UNDERLINE = "▁"
 
 class AlbertTokenizer(PreTrainedTokenizer):
     """
-    Constructs an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__
+    Construct an ALBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
-            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
+        vocab_file (:obj:`str`):
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
             contains the vocabulary necessary to instantiate a tokenizer.
         do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
+            Whether or not to lowercase the input when tokenizing.
         remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to strip the text when tokenizing (removing excess spaces before and after the string).
+            Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).
         keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to keep accents when tokenizing.
-        bos_token (:obj:`string`, `optional`, defaults to "[CLS]"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+            Whether or not to keep accents when tokenizing.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
 
@@ -245,12 +245,12 @@ class AlbertTokenizer(PreTrainedTokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         sep = [self.sep_token_id]
         cls = [self.cls_token_id]
@@ -262,16 +262,16 @@ class AlbertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
         special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -293,7 +293,7 @@ class AlbertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         An ALBERT sequence pair mask has the following format:
 
         ::
@@ -301,11 +301,11 @@ class AlbertTokenizer(PreTrainedTokenizer):
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0s).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
diff --git a/src/transformers/tokenization_bert.py b/src/transformers/tokenization_bert.py
index 98f8287e34..2b2aceca6f 100644
--- a/src/transformers/tokenization_bert.py
+++ b/src/transformers/tokenization_bert.py
@@ -119,44 +119,45 @@ def whitespace_tokenize(text):
 
 class BertTokenizer(PreTrainedTokenizer):
     r"""
-    Constructs a BERT tokenizer. Based on WordPiece.
+    Construct a BERT tokenizer. Based on WordPiece.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             File containing the vocabulary.
         do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
+            Whether or not to lowercase the input when tokenizing.
         do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to do basic tokenization before WordPiece.
+            Whether or not to do basic tokenization before WordPiece.
         never_split (:obj:`Iterable`, `optional`):
             Collection of tokens which will never be split during tokenization. Only has an effect when
             :obj:`do_basic_tokenize=True`
-        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
             The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to tokenize Chinese characters.
-            This should likely be deactivated for Japanese:
-            see: https://github.com/huggingface/transformers/issues/328
+            Whether or not to tokenize Chinese characters.
+
+            This should likely be deactivated for Japanese (see this `issue
+            <https://github.com/huggingface/transformers/issues/328>`__).
         strip_accents: (:obj:`bool`, `optional`):
-            Whether to strip all accents. If this option is not specified (ie == None),
-            then it will be determined by the value for `lowercase` (as in the original Bert).
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for :obj:`lowercase` (as in the original BERT).
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -252,12 +253,12 @@ class BertTokenizer(PreTrainedTokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         if token_ids_1 is None:
             return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
@@ -269,16 +270,16 @@ class BertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
         special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -300,7 +301,7 @@ class BertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         A BERT sequence pair mask has the following format:
 
         ::
@@ -308,11 +309,11 @@ class BertTokenizer(PreTrainedTokenizer):
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
@@ -328,7 +329,7 @@ class BertTokenizer(PreTrainedTokenizer):
 
     def save_vocabulary(self, vocab_path):
         """
-        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+        Save the vocabulary (copy original file) and special tokens file to a directory.
 
         Args:
             vocab_path (:obj:`str`):
@@ -356,25 +357,26 @@ class BertTokenizer(PreTrainedTokenizer):
 
 
 class BasicTokenizer(object):
-    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+    """
+    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
+
+    Args:
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to lowercase the input when tokenizing.
+        never_split (:obj:`Iterable`, `optional`):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            :obj:`do_basic_tokenize=True`
+        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to tokenize Chinese characters.
+
+            This should likely be deactivated for Japanese (see this `issue
+            <https://github.com/huggingface/transformers/issues/328>`__).
+        strip_accents: (:obj:`bool`, `optional`):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for :obj:`lowercase` (as in the original BERT).
+    """
 
     def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
-        """Constructs a BasicTokenizer.
-
-        Args:
-            **do_lower_case**: Whether to lower case the input.
-            **never_split**: (`optional`) list of str
-                Kept for backward compatibility purposes.
-                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)
-                List of token not to split.
-            **tokenize_chinese_chars**: (`optional`) boolean (default True)
-                Whether to tokenize Chinese characters.
-                This should likely be deactivated for Japanese:
-                see: https://github.com/huggingface/pytorch-pretrained-BERT/issues/328
-            **strip_accents**: (`optional`) boolean (default None)
-                Whether to strip all accents. If this option is not specified (ie == None),
-                then it will be determined by the value for `lowercase` (as in the original Bert).
-        """
         if never_split is None:
             never_split = []
         self.do_lower_case = do_lower_case
@@ -564,45 +566,43 @@ class WordpieceTokenizer(object):
 
 class BertTokenizerFast(PreTrainedTokenizerFast):
     r"""
-    Constructs a "Fast" BERT tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" BERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece.
 
-    Bert tokenization is Based on WordPiece.
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             File containing the vocabulary.
         do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
-        unk_token (:obj:`string`, `optional`, defaults to "[UNK]"):
+            Whether or not to lowercase the input when tokenizing.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        sep_token (:obj:`string`, `optional`, defaults to "[SEP]"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "[PAD]"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
             The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`string`, `optional`, defaults to "[CLS]"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        mask_token (:obj:`string`, `optional`, defaults to "[MASK]"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to clean the text before tokenization by removing any control characters and
+            Whether or not to clean the text before tokenization by removing any control characters and
             replacing all whitespaces by the classic one.
         tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to tokenize Chinese characters.
-            This should likely be deactivated for Japanese:
-            see: https://github.com/huggingface/transformers/issues/328
+            Whether or not to tokenize Chinese characters.
+            This should likely be deactivated for Japanese (see `this issue
+            <https://github.com/huggingface/transformers/issues/328>`__).
         strip_accents: (:obj:`bool`, `optional`):
-            Whether to strip all accents. If this option is not specified (ie == None),
-            then it will be determined by the value for `lowercase` (as in the original Bert).
-        wordpieces_prefix: (:obj:`string`, `optional`, defaults to "##"):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for :obj:`lowercase` (as in the original BERT).
+        wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`):
             The prefix for subwords.
     """
 
@@ -651,6 +651,23 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
         self.do_lower_case = do_lower_case
 
     def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens.
+        A BERT sequence has the following format:
+
+        - single sequence: ``[CLS] X [SEP]``
+        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+        """
         output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
 
         if token_ids_1:
@@ -662,7 +679,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         A BERT sequence pair mask has the following format:
 
         ::
@@ -670,11 +687,11 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
diff --git a/src/transformers/tokenization_bert_generation.py b/src/transformers/tokenization_bert_generation.py
index 539b063556..fac4153fd8 100644
--- a/src/transformers/tokenization_bert_generation.py
+++ b/src/transformers/tokenization_bert_generation.py
@@ -34,23 +34,23 @@ tokenizer_url = (
 
 class BertGenerationTokenizer(PreTrainedTokenizer):
     """
-    Constructs a BertGenerationTokenizer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .
+    Construct a BertGeneration tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
             contains the vocabulary necessary to instantiate a tokenizer.
-        eos_token (:obj:`string`, `optional`, defaults to :obj:`"</s>"`):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
-        bos_token (:obj:`string`, `optional`, defaults to :obj:`"<s>"`):
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The begin of sequence token.
-        unk_token (:obj:`string`, `optional`, defaults to :obj:`"<unk>"`):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to :obj:`"<pad>"`):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
     """
 
@@ -142,8 +142,15 @@ class BertGenerationTokenizer(PreTrainedTokenizer):
         return out_string
 
     def save_vocabulary(self, save_directory):
-        """Save the sentencepiece vocabulary (copy original file) and special tokens file
-        to a directory.
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+
+        Args:
+            save_directory (:obj:`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
         """
         if not os.path.isdir(save_directory):
             logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
diff --git a/src/transformers/tokenization_bertweet.py b/src/transformers/tokenization_bertweet.py
index e5cf3a4d40..3c30c0d40a 100644
--- a/src/transformers/tokenization_bertweet.py
+++ b/src/transformers/tokenization_bertweet.py
@@ -66,48 +66,46 @@ def get_pairs(word):
 
 class BertweetTokenizer(PreTrainedTokenizer):
     """
-    Constructs a BERTweet tokenizer. Peculiarities:
+    Constructs a BERTweet tokenizer, using Byte-Pair-Encoding.
 
-    - Byte-Pair-Encoding
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        normalization (:obj:`boolean`, defaults to False)
-            Whether to apply a normalization pre-process.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
+        normalization (:obj:`bool`, `optional`, defaults to :obj:`False`)
+            Whether or not to apply a normalization preprocess.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        cls_token (:obj:`string`, `optional`, defaults to "<s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
     """
@@ -182,19 +180,19 @@ class BertweetTokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        A BERTweet sequence has the following format:
+         A BERTweet sequence has the following format:
 
         - single sequence: ``<s> X </s>``
         - pair of sequences: ``<s> A </s></s> B </s>``
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
 
         if token_ids_1 is None:
@@ -207,16 +205,16 @@ class BertweetTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -238,18 +236,17 @@ class BertweetTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         BERTweet does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
             :obj:`List[int]`: List of zeros.
-
         """
 
         sep = [self.sep_token_id]
@@ -389,10 +386,12 @@ class BertweetTokenizer(PreTrainedTokenizer):
 
     def save_vocabulary(self, save_directory):
         """
-        Save the vocabulary and special tokens file to a directory.
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+
         Args:
             save_directory (:obj:`str`):
                 The directory in which to save the vocabulary.
+
         Returns:
             :obj:`Tuple(str)`: Paths to the files saved.
         """
diff --git a/src/transformers/tokenization_camembert.py b/src/transformers/tokenization_camembert.py
index 7e8fe34b87..c23758f74b 100644
--- a/src/transformers/tokenization_camembert.py
+++ b/src/transformers/tokenization_camembert.py
@@ -51,45 +51,44 @@ SPIECE_UNDERLINE = "▁"
 
 class CamembertTokenizer(PreTrainedTokenizer):
     """
-        Adapted from RobertaTokenizer and XLNetTokenizer
-        SentencePiece based tokenizer. Peculiarities:
+    Adapted from :class:`~transformers.RobertaTokenizer` and :class:`~transformers.XLNetTokenizer`. Construct a
+    CamemBERT tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-        - requires `SentencePiece <https://github.com/google/sentencepiece>`_
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
-            Path to the vocabulary file.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+            `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        cls_token (:obj:`string`, `optional`, defaults to "<s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<s>NOTUSED", "</s>NOTUSED"]`):
@@ -146,19 +145,19 @@ class CamembertTokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        A CamemBERT sequence has the following format:
+        An CamemBERT sequence has the following format:
 
         - single sequence: ``<s> X </s>``
         - pair of sequences: ``<s> A </s></s> B </s>``
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
 
         if token_ids_1 is None:
@@ -171,16 +170,16 @@ class CamembertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
         special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -201,18 +200,17 @@ class CamembertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         CamemBERT, like RoBERTa, does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
             :obj:`List[int]`: List of zeros.
-
         """
         sep = [self.sep_token_id]
         cls = [self.cls_token_id]
diff --git a/src/transformers/tokenization_ctrl.py b/src/transformers/tokenization_ctrl.py
index bcff0dca2f..f27cc57d8b 100644
--- a/src/transformers/tokenization_ctrl.py
+++ b/src/transformers/tokenization_ctrl.py
@@ -116,19 +116,17 @@ def get_pairs(word):
 
 class CTRLTokenizer(PreTrainedTokenizer):
     """
-    Constructs a CTRL tokenizer. Peculiarities:
+    Construct a CTRL tokenizer. Based on Byte-Pair-Encoding.
 
-    - Byte-Pair-Encoding
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
     """
diff --git a/src/transformers/tokenization_distilbert.py b/src/transformers/tokenization_distilbert.py
index 10e9d54601..9de887328a 100644
--- a/src/transformers/tokenization_distilbert.py
+++ b/src/transformers/tokenization_distilbert.py
@@ -55,10 +55,10 @@ PRETRAINED_INIT_CONFIGURATION = {
 
 class DistilBertTokenizer(BertTokenizer):
     r"""
-    Constructs a  DistilBertTokenizer.
+    Construct a DistilBERT tokenizer.
 
-    :class:`~transformers.DistilBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DistilBertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -73,10 +73,10 @@ class DistilBertTokenizer(BertTokenizer):
 
 class DistilBertTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a  "Fast" DistilBertTokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" DistilBERT tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.DistilBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DistilBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
diff --git a/src/transformers/tokenization_dpr.py b/src/transformers/tokenization_dpr.py
index 41728d455d..e645ccd1c0 100644
--- a/src/transformers/tokenization_dpr.py
+++ b/src/transformers/tokenization_dpr.py
@@ -68,10 +68,10 @@ READER_PRETRAINED_INIT_CONFIGURATION = {
 
 class DPRContextEncoderTokenizer(BertTokenizer):
     r"""
-    Constructs a  DPRContextEncoderTokenizer.
+    Construct a DPRContextEncoder tokenizer.
 
-    :class:`~transformers.DPRContextEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DPRContextEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -85,10 +85,10 @@ class DPRContextEncoderTokenizer(BertTokenizer):
 
 class DPRContextEncoderTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a  "Fast" DPRContextEncoderTokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" DPRContextEncoder tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.DPRContextEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DPRContextEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and
+    runs end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
@@ -102,10 +102,10 @@ class DPRContextEncoderTokenizerFast(BertTokenizerFast):
 
 class DPRQuestionEncoderTokenizer(BertTokenizer):
     r"""
-    Constructs a  DPRQuestionEncoderTokenizer.
+    Constructs a DPRQuestionEncoder tokenizer.
 
-    :class:`~transformers.DPRQuestionEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DPRQuestionEncoderTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -119,10 +119,10 @@ class DPRQuestionEncoderTokenizer(BertTokenizer):
 
 class DPRQuestionEncoderTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a  "Fast" DPRQuestionEncoderTokenizer (backed by HuggingFace's `tokenizers` library).
+    Constructs a "fast" DPRQuestionEncoder tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.DPRQuestionEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DPRQuestionEncoderTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and
+    runs end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
@@ -142,59 +142,71 @@ DPRReaderOutput = collections.namedtuple("DPRReaderOutput", ["start_logits", "en
 
 
 CUSTOM_DPR_READER_DOCSTRING = r"""
-    Return a dictionary with the token ids of the input strings and other information to give to :obj:`.decode_best_spans`.
-    It converts the strings of a question and different passages (title + text) in a sequence of ids (integer), using the tokenizer and vocabulary.
-    The resulting `input_ids` is a matrix of size :obj:`(n_passages, sequence_length)` with the format:
+    Return a dictionary with the token ids of the input strings and other information to give to
+    :obj:`.decode_best_spans`.
+    It converts the strings of a question and different passages (title and text) in a sequence of IDs (integers),
+    using the tokenizer and vocabulary. The resulting :obj:`input_ids` is a matrix of size
+    :obj:`(n_passages, sequence_length)` with the format:
 
         [CLS] <question token ids> [SEP] <titles ids> [SEP] <texts ids>
 
-    Inputs:
-        questions (:obj:`str`, :obj:`List[str]`):
+    Args:
+        questions (:obj:`str` or :obj:`List[str]`):
             The questions to be encoded.
-            You can specify one question for many passages. In this case, the question will be duplicated like :obj:`[questions] * n_passages`.
+            You can specify one question for many passages. In this case, the question will be duplicated like
+            :obj:`[questions] * n_passages`.
             Otherwise you have to specify as many questions as in :obj:`titles` or :obj:`texts`.
-        titles (:obj:`str`, :obj:`List[str]`):
-            The passages titles to be encoded. This can be a string, a list of strings if there are several passages.
-        texts (:obj:`str`, :obj:`List[str]`):
-            The passages texts to be encoded. This can be a string, a list of strings if there are several passages.
-        padding (:obj:`Union[bool, str]`, `optional`, defaults to :obj:`False`):
-            Activate and control padding. Accepts the following values:
+        titles (:obj:`str` or :obj:`List[str]`):
+            The passages titles to be encoded. This can be a string or a list of strings if there are several passages.
+        texts (:obj:`str` or :obj:`List[str]`):
+            The passages texts to be encoded. This can be a string or a list of strings if there are several passages.
+        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
+            Activates and controls padding. Accepts the following values:
 
-            * `True` or `'longest'`: pad to the longest sequence in the batch (or no padding if only a single sequence if provided),
-            * `'max_length'`: pad to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`)
-            * `False` or `'do_not_pad'` (default): No padding (i.e. can output batch with sequences of uneven lengths)
-        truncation (:obj:`Union[bool, str]`, `optional`, defaults to :obj:`False`):
-            Activate and control truncation. Accepts the following values:
+            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
+              single sequence if provided).
+            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+              maximum acceptable input length for the model if that argument is not provided.
+            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+              different lengths).
+        truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`False`):
+            Activates and controls truncation. Accepts the following values:
 
-            * `True` or `'only_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`).
-            * `False` or `'do_not_truncate'` (default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
-        max_length (:obj:`Union[int, None]`, `optional`):
-            Control the length for padding/truncation. Accepts the following values
+            * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
+              :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
+              provided. This will truncate token by token, removing a token from the longest sequence in the pair
+              if a pair of sequences (or a batch of pairs) is provided.
+            * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
+              the maximum acceptable input length for the model if that argument is not provided. This will only
+              truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+            * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or
+              to the maximum acceptable input length for the model if that argument is not provided. This will only
+              truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
+            * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
+              sequence lengths greater than the model maximum admissible input size).
+        max_length (:obj:`int`, `optional`):
+                Controls the maximum length to use by one of the truncation/padding parameters.
 
-            * `None` (default): This will use the predefined model max length if required by one of the truncation/padding parameters. If the model has no specific max input length (e.g. XLNet) truncation/padding to max length is deactivated.
-            * `any integer value` (e.g. `42`): Use this specific maximum length value if required by one of the truncation/padding parameters.
-        return_tensors (:obj:`str`, `optional`):
-            Can be set to 'tf', 'pt' or 'np' to return respectively TensorFlow :obj:`tf.constant`,
-            PyTorch :obj:`torch.Tensor` or Numpy :obj: `np.ndarray` instead of a list of python integers.
-        return_attention_mask (:obj:`bool`, `optional`, defaults to :obj:`none`):
-            Whether to return the attention mask. If left to the default, will return the attention mask according
-            to the specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
+                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
+                length is required by one of the truncation/padding parameters. If the model has no specific maximum
+                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
+        return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
+                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
+                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
+        return_attention_mask (:obj:`bool`, `optional`):
+            Whether or not to return the attention mask. If not set, will return the attention mask according to the
+            specific tokenizer's default, defined by the :obj:`return_outputs` attribute.
 
             `What are attention masks? <../glossary.html#attention-mask>`__
 
     Return:
-        A Dictionary of shape::
-
-            {
-                input_ids: list[list[int]],
-                attention_mask: list[int] if return_attention_mask is True (default)
-            }
-
-        With the fields:
-
-        - ``input_ids``: list of token ids to be fed to a model
-        - ``attention_mask``: list of indices specifying which tokens should be attended to by the model
+        :obj:`Dict[str, List[List[int]]]`: A dictionary with the following keys:
 
+        - ``input_ids``: List of token ids to be fed to a model.
+        - ``attention_mask``: List of indices specifying which tokens should be attended to by the model.
         """
 
 
@@ -369,16 +381,14 @@ class CustomDPRReaderTokenizerMixin:
 @add_end_docstrings(CUSTOM_DPR_READER_DOCSTRING)
 class DPRReaderTokenizer(CustomDPRReaderTokenizerMixin, BertTokenizer):
     r"""
-    Constructs a  DPRReaderTokenizer.
+    Construct a DPRReader tokenizer.
 
-    :class:`~transformers.DPRReaderTokenizer` is alsmost identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
-
-    What is different is that is has three inputs strings: question, titles and texts that are combined to feed into the DPRReader model.
+    :class:`~transformers.DPRReaderTokenizer` is almost identical to :class:`~transformers.BertTokenizer` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece. The difference is that is has three inputs strings:
+    question, titles and texts that are combined to be fed to the :class:`~transformers.DPRReader` model.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
-
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -391,14 +401,13 @@ class DPRReaderTokenizer(CustomDPRReaderTokenizerMixin, BertTokenizer):
 @add_end_docstrings(CUSTOM_DPR_READER_DOCSTRING)
 class DPRReaderTokenizerFast(CustomDPRReaderTokenizerMixin, BertTokenizerFast):
     r"""
-    Constructs a  DPRReaderTokenizerFast.
+    Constructs a "fast" DPRReader tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.DPRReaderTokenizerFast` is almost identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.DPRReaderTokenizerFast` is almost identical to :class:`~transformers.BertTokenizerFast` and
+    runs end-to-end tokenization: punctuation splitting and wordpiece. The difference is that is has three inputs
+    strings: question, titles and texts that are combined to be fed to the :class:`~transformers.DPRReader` model.
 
-    What is different is that is has three inputs strings: question, titles and texts that are combined to feed into the DPRReader model.
-
-    Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
+    Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
 
     """
diff --git a/src/transformers/tokenization_electra.py b/src/transformers/tokenization_electra.py
index 80fb6a53b7..1184a0914a 100644
--- a/src/transformers/tokenization_electra.py
+++ b/src/transformers/tokenization_electra.py
@@ -51,9 +51,10 @@ PRETRAINED_INIT_CONFIGURATION = {
 
 class ElectraTokenizer(BertTokenizer):
     r"""
-    Constructs an Electra tokenizer.
+    Construct an ELECTRA tokenizer.
+
     :class:`~transformers.ElectraTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -67,10 +68,10 @@ class ElectraTokenizer(BertTokenizer):
 
 class ElectraTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a "Fast" Electra Fast tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" ELECTRA tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.ElectraTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
diff --git a/src/transformers/tokenization_flaubert.py b/src/transformers/tokenization_flaubert.py
index 73bf202032..182ad3441f 100644
--- a/src/transformers/tokenization_flaubert.py
+++ b/src/transformers/tokenization_flaubert.py
@@ -78,13 +78,13 @@ def convert_to_unicode(text):
 
 class FlaubertTokenizer(XLMTokenizer):
     """
-    BPE tokenizer for Flaubert
+    Construct a Flaubert tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following:
 
-    - Moses preprocessing & tokenization
-    - Normalize all inputs text
-    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \
-      (ex: "__classify__") to a vocabulary
-    - `do_lowercase` controle lower casing (automatically set for pretrained vocabularies)
+    - Moses preprocessing and tokenization.
+    - Normalizing all inputs text.
+    - The arguments ``special_tokens`` and the function ``set_special_tokens``, can be used to add additional symbols
+      (like "__classify__") to a vocabulary.
+    - The argument :obj:`do_lowercase` controls lower casing (automatically set for pretrained vocabularies).
 
     This tokenizer inherits from :class:`~transformers.XLMTokenizer`. Please check the superclass for usage examples
     and documentation regarding arguments.
diff --git a/src/transformers/tokenization_fsmt.py b/src/transformers/tokenization_fsmt.py
index 63f720c70a..e5e095ee8c 100644
--- a/src/transformers/tokenization_fsmt.py
+++ b/src/transformers/tokenization_fsmt.py
@@ -122,44 +122,43 @@ def remove_non_printing_char(text):
 
 class FSMTTokenizer(PreTrainedTokenizer):
     """
-    BPE tokenizer for FSMT (fairseq transformer)
-    See: https://github.com/pytorch/fairseq/tree/master/examples/wmt19
+    Construct an FAIRSEQ Transformer tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following:
 
-    - Moses preprocessing & tokenization for most supported languages
-    - (optionally) lower case & normalize all inputs text
-    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \
-      (ex: "__classify__") to a vocabulary
-    - `langs` defines a pair of languages
+    - Moses preprocessing and tokenization.
+    - Normalizing all inputs text.
+    - The arguments ``special_tokens`` and the function ``set_special_tokens``, can be used to add additional symbols
+      (like "__classify__") to a vocabulary.
+    - The argument :obj:`langs` defines a pair of languages.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         langs (:obj:`List[str]`):
-            a list of two languages to translate from and to, e.g. ``["en", "ru"]``.
-        src_vocab_file (:obj:`string`):
-            Source language vocabulary file.
-        tgt_vocab_file (:obj:`string`):
-            Target language vocabulary file.
-        merges_file (:obj:`string`):
-            Merges file.
+            A list of two languages to translate from and to, for instance :obj:`["en", "ru"]`.
+        src_vocab_file (:obj:`str`):
+            File containing the vocabulary for the source language.
+        tgt_vocab_file (:obj:`st`):
+            File containing the vocabulary for the target language.
+        merges_file (:obj:`str`):
+            File containing the merges.
         do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+            Whether or not to lowercase the input when tokenizing.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
 
     """
@@ -369,20 +368,19 @@ class FSMTTokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        A FAIRSEQ_TRANSFORMER sequence has the following format:
+        A FAIRSEQ Transformer sequence has the following format:
 
         - single sequence: ``<s> X </s>``
         - pair of sequences: ``<s> A </s> B </s>``
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
-
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         sep = [self.sep_token_id]
 
@@ -395,16 +393,16 @@ class FSMTTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -431,25 +429,28 @@ class FSMTTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
-        An FAIRSEQ_TRANSFORMER sequence pair mask has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A FAIRSEQ Transformer sequence pair mask has the following format:
 
         ::
 
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0s).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
             :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
             sequence(s).
+
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        An FAIRSEQ_TRANSFORMER sequence pair mask has the following format:
         """
         sep = [self.sep_token_id]
 
@@ -470,8 +471,6 @@ class FSMTTokenizer(PreTrainedTokenizer):
         padding="longest",
         **unused,
     ) -> BatchEncoding:
-        """Prepare model inputs for translation. For best performance, translate one sentence at a time."""
-
         if type(src_texts) is not list:
             raise ValueError("src_texts is expected to be a list")
         if "" in src_texts:
@@ -499,7 +498,7 @@ class FSMTTokenizer(PreTrainedTokenizer):
         Save the vocabulary and special tokens file to a directory.
 
         Args:
-            save_directory (:obj:`str`):
+            vocab_path (:obj:`str`):
                 The directory in which to save the vocabulary.
 
         Returns:
diff --git a/src/transformers/tokenization_funnel.py b/src/transformers/tokenization_funnel.py
index a0ca4e4011..96084933e3 100644
--- a/src/transformers/tokenization_funnel.py
+++ b/src/transformers/tokenization_funnel.py
@@ -57,10 +57,10 @@ PRETRAINED_INIT_CONFIGURATION = {f"funnel-transformer/{name}": {"do_lower_case":
 
 class FunnelTokenizer(BertTokenizer):
     r"""
-    Tokenizer for the Funnel Transformer models.
+    Construct a Funnel Transformer tokenizer.
 
     :class:`~transformers.FunnelTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -110,19 +110,19 @@ class FunnelTokenizer(BertTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
-        Funnel Transformer expects a sequence pair mask that has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A Funnel Transformer sequence pair mask has the following format:
 
         ::
 
             2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
@@ -139,10 +139,10 @@ class FunnelTokenizer(BertTokenizer):
 
 class FunnelTokenizerFast(BertTokenizerFast):
     r"""
-    "Fast" tokenizer for the Funnel Transformer models (backed by HuggingFace's :obj:`tokenizers` library).
+    Construct a "fast" Funnel Transformer tokenizer (backed by HuggingFace's `tokenizers` library).
 
     :class:`~transformers.FunnelTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
-    end-to-end tokenization: punctuation splitting + wordpiece.
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
@@ -192,19 +192,19 @@ class FunnelTokenizerFast(BertTokenizerFast):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
-        Funnel Transformer expects a sequence pair mask that has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        A Funnel Transformer sequence pair mask has the following format:
 
         ::
 
             2 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
diff --git a/src/transformers/tokenization_gpt2.py b/src/transformers/tokenization_gpt2.py
index dbe9899dc5..ee586050a9 100644
--- a/src/transformers/tokenization_gpt2.py
+++ b/src/transformers/tokenization_gpt2.py
@@ -103,7 +103,7 @@ def get_pairs(word):
 
 class GPT2Tokenizer(PreTrainedTokenizer):
     """
-    GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
+    Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.
 
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
@@ -124,24 +124,27 @@ class GPT2Tokenizer(PreTrainedTokenizer):
 
         When used with ``is_split_into_words=True``, this tokenizer will add a space before each word (even the first one).
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to "replace"):
+        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
             Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
             <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The beginning of sequence token.
-        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The end of sequence token.
+        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+            other word. (GPT2 tokenizer detect beginning of words by the preceding space).
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -305,7 +308,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
 
 class GPT2TokenizerFast(PreTrainedTokenizerFast):
     """
-    Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library), using byte-level
+    Construct a "fast" GPT-2 tokenizer (backed by HuggingFace's `tokenizers` library). Based on byte-level
     Byte-Pair-Encoding.
 
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
@@ -328,30 +331,29 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
         When used with ``is_split_into_words=True``, this tokenizer needs to be instantiated with
         ``add_prefix_space=True``.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to "replace"):
+        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
             Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
             <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The beginning of sequence token.
-        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`<|endoftext|>`):
             The end of sequence token.
-        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):
-            Whether to add a leading space to the first word.
-            This allows to treat the leading word just as any other word.
-            (GPT2 tokenizer detect beginning of words by the preceeding space)
-        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):
-            Whether the post processing step should trim offsets to avoid including whitespaces.
+        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+            other word. (GPT2 tokenizer detect beginning of words by the preceding space).
+        trim_offsets (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the post-processing step should trim offsets to avoid including whitespaces.
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
diff --git a/src/transformers/tokenization_longformer.py b/src/transformers/tokenization_longformer.py
index 4c7fff5b6f..4d2ee634f6 100644
--- a/src/transformers/tokenization_longformer.py
+++ b/src/transformers/tokenization_longformer.py
@@ -42,6 +42,12 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 
 class LongformerTokenizer(RobertaTokenizer):
+    r"""
+    Construct a Longformer tokenizer.
+
+    :class:`~transformers.LongformerTokenizer` is identical to :class:`~transformers.RobertaTokenizer`. Refer to
+    the superclass for usage examples and documentation concerning parameters.
+    """
     # merges and vocab same as Roberta
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
     pretrained_vocab_files_map = {
@@ -51,6 +57,12 @@ class LongformerTokenizer(RobertaTokenizer):
 
 
 class LongformerTokenizerFast(RobertaTokenizerFast):
+    r"""
+    Construct a "fast" Longformer tokenizer (backed by HuggingFace's `tokenizers` library).
+
+    :class:`~transformers.LongformerTokenizerFast` is identical to :class:`~transformers.RobertaTokenizerFast`. Refer
+    to the superclass for usage examples and documentation concerning parameters.
+    """
     # merges and vocab same as Roberta
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
     pretrained_vocab_files_map = {
diff --git a/src/transformers/tokenization_lxmert.py b/src/transformers/tokenization_lxmert.py
index 316d43f95b..f85c9d124e 100644
--- a/src/transformers/tokenization_lxmert.py
+++ b/src/transformers/tokenization_lxmert.py
@@ -50,9 +50,10 @@ PRETRAINED_INIT_CONFIGURATION = {
 
 class LxmertTokenizer(BertTokenizer):
     r"""
-    Constructs an Lxmert tokenizer.
+    Construct an LXMERT tokenizer.
+
     :class:`~transformers.LxmertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -66,10 +67,10 @@ class LxmertTokenizer(BertTokenizer):
 
 class LxmertTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a "Fast" Lxmert Fast tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" LXMERT tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.LxmertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.LxmertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
diff --git a/src/transformers/tokenization_marian.py b/src/transformers/tokenization_marian.py
index 253c091528..784b26a83f 100644
--- a/src/transformers/tokenization_marian.py
+++ b/src/transformers/tokenization_marian.py
@@ -137,7 +137,6 @@ class MarianTokenizer(PreTrainedTokenizer):
         padding="longest",
         **unused,
     ) -> BatchEncoding:
-        """Prepare model inputs for translation. For best performance, translate one sentence at a time."""
         if "" in src_texts:
             raise ValueError(f"found empty string in src_texts: {src_texts}")
         self.current_spm = self.spm_source
diff --git a/src/transformers/tokenization_mbart.py b/src/transformers/tokenization_mbart.py
index 22a16f4a48..bb729aa324 100644
--- a/src/transformers/tokenization_mbart.py
+++ b/src/transformers/tokenization_mbart.py
@@ -171,67 +171,6 @@ class MBartTokenizer(XLMRobertaTokenizer):
         add_prefix_space: bool = False,  # ignored
         **kwargs,
     ) -> BatchEncoding:
-        """Prepare a batch that can be passed directly to an instance of MBartModel.
-
-        Arguments:
-            src_texts: (:obj:`list`):
-                list of documents to summarize or source language texts
-            src_lang: (:obj:`str`, `optional`, default='en_XX'):
-                default en_XX (english), the language we are translating from
-            tgt_texts: (:obj:`list`, `optional`):
-                list of tgt language texts or summaries.
-            tgt_lang: (:obj:`str`, `optional`, default='ro_RO'):
-                default ro_RO (romanian), the language we are translating to
-            max_length (:obj:`int`, `optional`):
-                Controls the maximum length for encoder inputs (documents to summarize or source language texts)
-                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
-                length is required by one of the truncation/padding parameters. If the model has no specific maximum
-                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-            max_target_length (:obj:`int`, `optional`):
-                Controls the maximum length of decoder inputs (target language texts or summaries)
-                If left unset or set to :obj:`None`, this will use the max_length value.
-            padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
-                Activates and controls padding. Accepts the following values:
-
-                * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
-                  single sequence if provided).
-                * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
-                  maximum acceptable input length for the model if that argument is not provided.
-                * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
-                  different lengths).
-            return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`, defaults to "pt"):
-                If set, will return tensors instead of list of python integers. Acceptable values are:
-
-                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
-                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
-                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
-            truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`True`):
-                Activates and controls truncation. Accepts the following values:
-
-                * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
-                  :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
-                  provided. This will truncate token by token, removing a token from the longest sequence in the pair
-                  if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
-                  the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or
-                  to the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
-                  sequence lengths greater than the model maximum admissible input size).
-
-        Return:
-            :class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
-
-            - **input_ids** -- List of token ids to be fed to the encoder.
-            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model.
-            - **labels** -- List of token ids for tgt_texts
-
-            The full set of keys ``[input_ids, attention_mask, decoder_input_ids,  labels]``,
-            will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.
-
-        """
         if max_length is None:
             max_length = self.max_len
         self.set_src_lang_special_tokens(src_lang)
diff --git a/src/transformers/tokenization_mobilebert.py b/src/transformers/tokenization_mobilebert.py
index 4bcd961518..72c0c1ec7f 100644
--- a/src/transformers/tokenization_mobilebert.py
+++ b/src/transformers/tokenization_mobilebert.py
@@ -35,10 +35,10 @@ PRETRAINED_INIT_CONFIGURATION = {}
 
 class MobileBertTokenizer(BertTokenizer):
     r"""
-    Constructs a  MobileBertTokenizer.
+    Construct a MobileBERT tokenizer.
 
     :class:`~transformers.MobileBertTokenizer is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -52,10 +52,10 @@ class MobileBertTokenizer(BertTokenizer):
 
 class MobileBertTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a  "Fast" MobileBertTokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" MobileBERT tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.MobileBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.MobileBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
diff --git a/src/transformers/tokenization_openai.py b/src/transformers/tokenization_openai.py
index 5ed70c2540..7106030d62 100644
--- a/src/transformers/tokenization_openai.py
+++ b/src/transformers/tokenization_openai.py
@@ -75,20 +75,21 @@ def text_standardize(text):
 
 class OpenAIGPTTokenizer(PreTrainedTokenizer):
     """
-    BPE tokenizer. Peculiarities:
+    Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities:
 
-    - lower case all inputs
-    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
+    - lowercases all inputs,
+    - uses :obj:`SpaCy` tokenizer and :obj:`ftfy` for pre-BPE tokenization if they are installed, fallback to BERT's
+      :obj:`BasicTokenizer` if not.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
     """
@@ -206,7 +207,7 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
         Save the vocabulary and special tokens file to a directory.
 
         Args:
-            save_directory (:obj:`str`):
+            vocab_path (:obj:`str`):
                 The directory in which to save the vocabulary.
 
         Returns:
@@ -239,22 +240,22 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
 
 class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
     """
-    Construct a "Fast" BPE tokenizer for OpenAI GPT (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" GPT Tokenizer (backed by HuggingFace's `tokenizers` library). Based on Byte-Pair-Encoding with
+    the following peculiarities:
 
-    Peculiarities:
+    - lowercases all inputs,
+    - uses :obj:`SpaCy` tokenizer and :obj:`ftfy` for pre-BPE tokenization if they are installed, fallback to BERT's
+      :obj:`BasicTokenizer` if not.
 
-    - lower case all inputs
-    - uses SpaCy tokenizer and ftfy for pre-BPE tokenization if they are installed, fallback to BERT's BasicTokenizer if not.
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
     """
diff --git a/src/transformers/tokenization_phobert.py b/src/transformers/tokenization_phobert.py
index 4b41f3532d..cb7326ca32 100644
--- a/src/transformers/tokenization_phobert.py
+++ b/src/transformers/tokenization_phobert.py
@@ -66,46 +66,44 @@ def get_pairs(word):
 
 class PhobertTokenizer(PreTrainedTokenizer):
     """
-    Constructs a PhoBERT tokenizer. Peculiarities:
+    Construct a PhoBERT tokenizer. Based on Byte-Pair-Encoding.
 
-    - Byte-Pair-Encoding
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`st`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        cls_token (:obj:`string`, `optional`, defaults to "<s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
     """
@@ -171,12 +169,12 @@ class PhobertTokenizer(PreTrainedTokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
 
         if token_ids_1 is None:
@@ -189,16 +187,16 @@ class PhobertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -220,18 +218,17 @@ class PhobertTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         PhoBERT does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
-            token_ids_1 (:obj:`List[int]`, `optional`, defaults to :obj:`None`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
             :obj:`List[int]`: List of zeros.
-
         """
 
         sep = [self.sep_token_id]
@@ -318,9 +315,11 @@ class PhobertTokenizer(PreTrainedTokenizer):
     def save_vocabulary(self, save_directory):
         """
         Save the vocabulary and special tokens file to a directory.
+
         Args:
             save_directory (:obj:`str`):
                 The directory in which to save the vocabulary.
+
         Returns:
             :obj:`Tuple(str)`: Paths to the files saved.
         """
diff --git a/src/transformers/tokenization_reformer.py b/src/transformers/tokenization_reformer.py
index 0920f77ee8..e416d30921 100644
--- a/src/transformers/tokenization_reformer.py
+++ b/src/transformers/tokenization_reformer.py
@@ -53,26 +53,26 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class ReformerTokenizer(PreTrainedTokenizer):
     """
-    Constructs an Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .
+    Construct a Reformer tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
             contains the vocabulary necessary to instantiate a tokenizer.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
         additional_special_tokens (:obj:`List[str]`, `optional`):
             Additional special tokens used by the tokenizer.
@@ -165,8 +165,15 @@ class ReformerTokenizer(PreTrainedTokenizer):
         return out_string
 
     def save_vocabulary(self, save_directory):
-        """Save the sentencepiece vocabulary (copy original file) and special tokens file
-        to a directory.
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+
+        Args:
+            save_directory (:obj:`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
         """
         if not os.path.isdir(save_directory):
             logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
diff --git a/src/transformers/tokenization_retribert.py b/src/transformers/tokenization_retribert.py
index ad6ace2f9e..15bdad3a25 100644
--- a/src/transformers/tokenization_retribert.py
+++ b/src/transformers/tokenization_retribert.py
@@ -40,10 +40,10 @@ PRETRAINED_INIT_CONFIGURATION = {
 
 class RetriBertTokenizer(BertTokenizer):
     r"""
-    Constructs a  retribert.
+    Constructs a RetriBERT tokenizer.
 
-    :class:`~transformers.retribert is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.RetroBertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
+    tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
     parameters.
@@ -58,10 +58,10 @@ class RetriBertTokenizer(BertTokenizer):
 
 class RetriBertTokenizerFast(BertTokenizerFast):
     r"""
-    Constructs a  "Fast" RetriBertTokenizerFast (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" RetriBERT tokenizer (backed by HuggingFace's `tokenizers` library).
 
-    :class:`~transformers.RetriBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
-    tokenization: punctuation splitting + wordpiece.
+    :class:`~transformers.RetriBertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs
+    end-to-end tokenization: punctuation splitting and wordpiece.
 
     Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
     parameters.
diff --git a/src/transformers/tokenization_roberta.py b/src/transformers/tokenization_roberta.py
index 2601e21536..3aa312fd9b 100644
--- a/src/transformers/tokenization_roberta.py
+++ b/src/transformers/tokenization_roberta.py
@@ -62,7 +62,7 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class RobertaTokenizer(GPT2Tokenizer):
     """
-    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
+    Constructs a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
 
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
@@ -83,47 +83,50 @@ class RobertaTokenizer(GPT2Tokenizer):
 
         When used with ``is_split_into_words=True``, this tokenizer will add a space before each word (even the first one).
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to "replace"):
+        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
             Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
             <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        cls_token (:obj:`string`, `optional`, defaults to "<s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
+        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+            other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -184,12 +187,12 @@ class RobertaTokenizer(GPT2Tokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         if token_ids_1 is None:
             return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
@@ -201,16 +204,16 @@ class RobertaTokenizer(GPT2Tokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
         special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -231,18 +234,17 @@ class RobertaTokenizer(GPT2Tokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         RoBERTa does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: List of zeros.
-
+            :obj:`List[int]`:  List of zeros.
         """
         sep = [self.sep_token_id]
         cls = [self.cls_token_id]
@@ -267,7 +269,7 @@ class RobertaTokenizer(GPT2Tokenizer):
 
 class RobertaTokenizerFast(GPT2TokenizerFast):
     """
-    Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library), derived from the GPT-2
+    Construct a "fast" RoBERTa tokenizer (backed by HuggingFace's `tokenizers` library), derived from the GPT-2
     tokenizer, using byte-level Byte-Pair-Encoding.
 
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
@@ -290,29 +292,51 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
         When used with ``is_split_into_words=True``, this tokenizer needs to be instantiated with
         ``add_prefix_space=True``.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
         merges_file (:obj:`str`):
             Path to the merges file.
-        errors (:obj:`str`, `optional`, defaults to "replace"):
+        errors (:obj:`str`, `optional`, defaults to :obj:`"replace"`):
             Paradigm to follow when decoding bytes to UTF-8. See `bytes.decode
             <https://docs.python.org/3/library/stdtypes.html#bytes.decode>`__ for more information.
-        unk_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+
+            .. note::
+
+                When building a sequence using special tokens, this is not the token that is used for the beginning
+                of sequence. The token used is the :obj:`cls_token`.
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
+            The end of sequence token.
+
+            .. note::
+
+                When building a sequence using special tokens, this is not the token that is used for the end
+                of sequence. The token used is the :obj:`sep_token`.
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
+            for sequence classification or for a text and a question for question answering.
+            It is also used as the last token of a sequence built with special tokens.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The classifier token which is used when doing sequence classification (classification of the whole
+            sequence instead of per-token classification). It is the first token of the sequence when built with
+            special tokens.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
-            The beginning of sequence token.
-        eos_token (:obj:`string`, `optional`, defaults to `<|endoftext|>`):
-            The end of sequence token.
-        add_prefix_space (:obj:`bool`, `optional`, defaults to `False`):
-            Whether to add a leading space to the first word.
-            This allows to treat the leading word just as any other word.
-            (GPT2 tokenizer detect beginning of words by the preceeding space)
-        trim_offsets (:obj:`bool`, `optional`, defaults to `True`):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        add_prefix_space (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add an initial space to the input. This allows to treat the leading word just as any
+            other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
+        trim_offsets (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Whether the post processing step should trim offsets to avoid including whitespaces.
     """
 
@@ -377,18 +401,17 @@ class RobertaTokenizerFast(GPT2TokenizerFast):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         RoBERTa does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: List of zeros.
-
+            :obj:`List[int]`:  List of zeros.
         """
         sep = [self.sep_token_id]
         cls = [self.cls_token_id]
diff --git a/src/transformers/tokenization_t5.py b/src/transformers/tokenization_t5.py
index a5569b6739..08438aa0c4 100644
--- a/src/transformers/tokenization_t5.py
+++ b/src/transformers/tokenization_t5.py
@@ -21,7 +21,9 @@ import warnings
 from shutil import copyfile
 from typing import List, Optional
 
+from .file_utils import add_start_docstrings
 from .tokenization_utils import BatchEncoding, PreTrainedTokenizer
+from .tokenization_utils_base import PREPARE_SEQ2SEQ_BATCH_DOCSTRING
 from .utils import logging
 
 
@@ -61,32 +63,33 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class T5Tokenizer(PreTrainedTokenizer):
     """
-    Constructs a T5 tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .
+    Construct a T5 tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
             contains the vocabulary necessary to instantiate a tokenizer.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):
+        extra_ids (:obj:`int`, `optional`, defaults to 100):
             Add a number of extra ids added to the end of the vocabulary for use as sentinels.
             These tokens are accessible as "<extra_id_{%d}>" where "{%d}" is a number between 0 and extra_ids-1.
-            Extra tokens are indexed from the end of the vocabulary up to beginnning ("<extra_id_0>" is the last token in the vocabulary like in T5 preprocessing
-            see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)
+            Extra tokens are indexed from the end of the vocabulary up to beginnning ("<extra_id_0>" is the last token
+            in the vocabulary like in T5 preprocessing see `here
+            <https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117>`__).
         additional_special_tokens (:obj:`List[str]`, `optional`):
             Additional special tokens used by the tokenizer.
     """
@@ -149,19 +152,19 @@ class T5Tokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
         special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
-            :obj:`List[int]`: A list of integers in the range [0, 1], 1 for a special token, 0 for a sequence token.
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
         """
         if already_has_special_tokens:
             if token_ids_1 is not None:
@@ -191,7 +194,6 @@ class T5Tokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        For some t5 tasks, model.config.prefix is specified. This must be used before tokenization.
         A sequence has the following format:
 
         - single sequence: ``X </s>``
@@ -199,12 +201,12 @@ class T5Tokenizer(PreTrainedTokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         token_ids_0 = self._add_eos_if_not_present(token_ids_0)
         if token_ids_1 is None:
@@ -261,8 +263,15 @@ class T5Tokenizer(PreTrainedTokenizer):
         return out_string
 
     def save_vocabulary(self, save_directory):
-        """Save the sentencepiece vocabulary (copy original file) and special tokens file
-        to a directory.
+        """
+        Save the sentencepiece vocabulary (copy original file) and special tokens file to a directory.
+
+        Args:
+            save_directory (:obj:`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            :obj:`Tuple(str)`: Paths to the files saved.
         """
         if not os.path.isdir(save_directory):
             logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
@@ -274,6 +283,7 @@ class T5Tokenizer(PreTrainedTokenizer):
 
         return (out_vocab_file,)
 
+    @add_start_docstrings(PREPARE_SEQ2SEQ_BATCH_DOCSTRING)
     def prepare_seq2seq_batch(
         self,
         src_texts: List[str],
@@ -285,59 +295,6 @@ class T5Tokenizer(PreTrainedTokenizer):
         truncation: bool = True,
         **kwargs,
     ) -> BatchEncoding:
-        r"""
-        Prepare a batch that can be passed directly to an instance of :class:`~transformers.T5Model`.
-        Args:
-            src_texts: (:obj:`List[str]`):
-                List of documents to summarize or source language texts.
-            tgt_texts: (:obj:`List[str]`, `optional`):
-                List of summaries or target language texts.
-            max_length (:obj:`int`, `optional`):
-                Controls the maximum length for encoder inputs (documents to summarize or source language texts).
-                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
-                length is required by one of the truncation/padding parameters. If the model has no specific maximum
-                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
-            max_target_length (:obj:`int`, `optional`):
-                Controls the maximum length of decoder inputs (target language texts or summaries).
-                If left unset or set to :obj:`None`, this will use the max_length value.
-            padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
-                Activates and controls padding. Accepts the following values:
-                * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
-                  single sequence if provided).
-                * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
-                  maximum acceptable input length for the model if that argument is not provided.
-                * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
-                  different lengths).
-            return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`, defaults to "pt"):
-                If set, will return tensors instead of list of python integers. Acceptable values are:
-                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
-                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
-                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
-            truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.TruncationStrategy`, `optional`, defaults to :obj:`True`):
-                Activates and controls truncation. Accepts the following values:
-                * :obj:`True` or :obj:`'longest_first'`: Truncate to a maximum length specified with the argument
-                  :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
-                  provided. This will truncate token by token, removing a token from the longest sequence in the pair
-                  if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_first'`: Truncate to a maximum length specified with the argument :obj:`max_length` or to
-                  the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`'only_second'`: Truncate to a maximum length specified with the argument :obj:`max_length` or
-                  to the maximum acceptable input length for the model if that argument is not provided. This will only
-                  truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
-                * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
-                  sequence lengths greater than the model maximum admissible input size).
-            **kwargs:
-                Additional keyword arguments passed along to :obj:`self.__call__`.
-        Returns:
-            :class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
-            - **input_ids** -- List of token ids to be fed to the encoder.
-            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model.
-            - **labels** -- List of token ids for tgt_texts
-
-            The full set of keys ``[input_ids, attention_mask, decoder_input_ids,  labels]``,
-            will only be returned if tgt_texts is passed. Otherwise, input_ids, attention_mask will be the only keys.
-        """
         if max_length is None:
             max_length = self.max_len
         model_inputs = self(
diff --git a/src/transformers/tokenization_transfo_xl.py b/src/transformers/tokenization_transfo_xl.py
index 8a677a713a..08c454d718 100644
--- a/src/transformers/tokenization_transfo_xl.py
+++ b/src/transformers/tokenization_transfo_xl.py
@@ -116,10 +116,40 @@ def detokenize_numbers(text: str) -> str:
 
 class TransfoXLTokenizer(PreTrainedTokenizer):
     """
-    Transformer-XL tokenizer adapted from Vocab class in https://github.com/kimiyoung/transformer-xl
+    Construct a Transformer-XL tokenizer adapted from Vocab class in `the original code
+    <https://github.com/kimiyoung/transformer-xl>`__. The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
+
+    Args:
+        special (:obj:`List[str]`, `optional`):
+            A list of special tokens (to be treated by the original implementation of this tokenizer).
+        min_freq (:obj:`int`, `optional`, defaults to 0):
+            The minimum number of times a token has to be present in order to be kept in the vocabulary (otherwise it
+            will be mapped to :obj:`unk_token`).
+        max_size (:obj:`int`, `optional`):
+            The maximum size of the vocabulary. If left unset, it will default to the size of the vocabulary found
+            after excluding the tokens according to the :obj:`min_freq` rule.
+        lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to lowercase the input when tokenizing.
+        delimiter (:obj:`str`, `optional`):
+            The delimiter used btween tokens.
+        vocab_file (:obj:`str`, `optional`):
+            File containing the vocabulary (from the original implementation).
+        pretrained_vocab_file (:obj:`str`, `optional`):
+            File containing the vocabulary as saved with the :obj:`save_pretrained()` method.
+        never_split (xxx, `optional`):
+            Fill me with intesting stuff.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"<eos>"`):
+            The end of sequence token.
+        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<formula>"]`):
+            A list of additional special tokens (for the HuggingFace functionality).
+        language (:obj:`str`, `optional`, defaults to :obj:`"en"`):
+            The language of this tokenizer (used for mose preprocessing).
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -499,14 +529,45 @@ class _TransfoXLDelimiterLookupTokenizer(BaseTokenizer):
 
 class TransfoXLTokenizerFast(PreTrainedTokenizerFast):
     """
-    Construct a "Fast" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library).
+    Construct a "fast" Transformer-XL tokenizer (backed by HuggingFace's `tokenizers` library) adapted from Vocab class
+    in `the original code <https://github.com/kimiyoung/transformer-xl>`__. The Transformer-XL tokenizer is a
+    word-level tokenizer (no sub-word tokenization).
 
-    The Transformer-XL tokenizer is a word-level tokenizer (no sub-word tokenization).
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
+    methods. Users should refer to this superclass for more information regarding those methods.
 
-    Adapted from Vocab class in https://github.com/kimiyoung/transformer-xl
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    Args:
+        special (:obj:`List[str]`, `optional`):
+            A list of special tokens (to be treated by the original implementation of this tokenizer).
+        min_freq (:obj:`int`, `optional`, defaults to 0):
+            The minimum number of times a token has to be present in order to be kept in the vocabulary (otherwise it
+            will be mapped to :obj:`unk_token`).
+        max_size (:obj:`int`, `optional`):
+            The maximum size of the vocabulary. If left unset, it will default to the size of the vocabulary found
+            after excluding the tokens according to the :obj:`min_freq` rule.
+        lower_case (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to lowercase the input when tokenizing.
+        delimiter (:obj:`str`, `optional`):
+            The delimiter used btween tokens.
+        vocab_file (:obj:`str`, `optional`):
+            File containing the vocabulary (from the original implementation).
+        pretrained_vocab_file (:obj:`str`, `optional`):
+            File containing the vocabulary as saved with the :obj:`save_pretrained()` method.
+        never_split (xxx, `optional`):
+            Fill me with intesting stuff.
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"<eos>"`):
+            The end of sequence token.
+        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<formula>"]`):
+            A list of additional special tokens (for the HuggingFace functionality).
+        add_eos (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add the end-of-sentence token.
+        add_double_eos (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to add the end-of-sentence token.
+        normalization (xxx, `optional`):
+            Fill me with intesting stuff.
     """
 
     vocab_files_names = VOCAB_FILES_NAMES_FAST
diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py
index 7213abc255..3648f5e99b 100644
--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -1216,12 +1216,13 @@ INIT_TOKENIZER_DOCSTRING = r"""
 
 
 PREPARE_SEQ2SEQ_BATCH_DOCSTRING = """
+        Prepare model inputs for translation. For best performance, translate one sentence at a time.
 
         Arguments:
-            src_texts: (:obj:`list`):
-                list of documents to summarize or source language texts
-            tgt_texts: (:obj:`list`, `optional`):
-                list of tgt language texts or summaries.
+            src_texts (:obj:`List[str]`):
+                List of documents to summarize or source language texts.
+            tgt_texts (:obj:`list`, `optional`):
+                List of summaries or target language texts.
             max_length (:obj:`int`, `optional`):
                 Controls the maximum length for encoder inputs (documents to summarize or source language texts)
                 If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
@@ -1260,6 +1261,8 @@ PREPARE_SEQ2SEQ_BATCH_DOCSTRING = """
                   truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
                 * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
                   sequence lengths greater than the model maximum admissible input size).
+            **kwargs:
+                Additional keyword arguments passed along to :obj:`self.__call__`.
 
         Return:
             :class:`~transformers.BatchEncoding`: A :class:`~transformers.BatchEncoding` with the following fields:
diff --git a/src/transformers/tokenization_xlm.py b/src/transformers/tokenization_xlm.py
index 086a947bec..3fc4b80d9d 100644
--- a/src/transformers/tokenization_xlm.py
+++ b/src/transformers/tokenization_xlm.py
@@ -529,58 +529,53 @@ def romanian_preprocessing(text):
 
 class XLMTokenizer(PreTrainedTokenizer):
     """
-    BPE tokenizer for XLM
+    Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following:
 
-    - Moses preprocessing & tokenization for most supported languages
-    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP)
-    - (optionally) lower case & normalize all inputs text
-    - argument ``special_tokens`` and function ``set_special_tokens``, can be used to add additional symbols \
-      (ex: "__classify__") to a vocabulary
-    - `lang2id` attribute maps the languages supported by the model with their ids if provided (automatically set for pretrained vocabularies)
-    - `id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies)
+    - Moses preprocessing and tokenization for most supported languages.
+    - Language specific tokenization for Chinese (Jieba), Japanese (KyTea) and Thai (PyThaiNLP).
+    - Optionally lowercases and normalizes all inputs text.
+    - The arguments ``special_tokens`` and the function ``set_special_tokens``, can be used to add additional symbols
+      (like "__classify__") to a vocabulary.
+    - The :obj:`lang2id` attribute maps the languages supported by the model with their IDs if provided (automatically
+      set for pretrained vocabularies).
+    - The :obj:`id2lang` attributes does reverse mapping if provided (automatically set for pretrained vocabularies).
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             Vocabulary file.
-        merges_file (:obj:`string`):
+        merges_file (:obj:`str`):
             Merges file.
-        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to lowercase the input when tokenizing.
-        remove_space (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether to strip the text when tokenizing (removing excess spaces before and after the string).
-        keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to keep accents when tokenizing.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`string`, `optional`, defaults to "</s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        mask_token (:obj:`string`, `optional`, defaults to "<special1>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<special1>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"]`):
             List of additional special tokens.
         lang2id (:obj:`Dict[str, int]`, `optional`):
             Dictionary mapping languages string identifiers to their IDs.
-        id2lang (:obj:`Dict[int, str`, `optional`):
+        id2lang (:obj:`Dict[int, str]`, `optional`):
             Dictionary mapping language IDs to their string identifiers.
         do_lowercase_and_remove_accent (:obj:`bool`, `optional`, defaults to :obj:`True`):
             Whether to lowercase and remove accents when tokenizing.
@@ -855,19 +850,19 @@ class XLMTokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        A XLM sequence has the following format:
+        An XLM sequence has the following format:
 
         - single sequence: ``<s> X </s>``
         - pair of sequences: ``<s> A </s> B </s>``
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
 
         """
         bos = [self.bos_token_id]
@@ -881,16 +876,16 @@ class XLMTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -917,7 +912,7 @@ class XLMTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         An XLM sequence pair mask has the following format:
 
         ::
@@ -925,11 +920,11 @@ class XLMTokenizer(PreTrainedTokenizer):
             0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
             | first sequence    | second sequence |
 
-        if token_ids_1 is None, only returns the first portion of the mask (0s).
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
diff --git a/src/transformers/tokenization_xlm_roberta.py b/src/transformers/tokenization_xlm_roberta.py
index e0763d645f..c19a6b0f8f 100644
--- a/src/transformers/tokenization_xlm_roberta.py
+++ b/src/transformers/tokenization_xlm_roberta.py
@@ -51,45 +51,43 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 
 class XLMRobertaTokenizer(PreTrainedTokenizer):
     """
-        Adapted from RobertaTokenizer and XLNetTokenizer
-        SentencePiece based tokenizer. Peculiarities:
+    Adapted from :class:`~transfomers.RobertaTokenizer` and class:`~transfomers.XLNetTokenizer`. Based on
+    `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-        - requires `SentencePiece <https://github.com/google/sentencepiece>`_
-
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
         vocab_file (:obj:`str`):
             Path to the vocabulary file.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        sep_token (:obj:`string`, `optional`, defaults to "</s>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        cls_token (:obj:`string`, `optional`, defaults to "<s>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<s>NOTUSED", "</s>NOTUSED"]`):
@@ -180,19 +178,19 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
         """
         Build model inputs from a sequence or a pair of sequence for sequence classification tasks
         by concatenating and adding special tokens.
-        A XLM-R sequence has the following format:
+        An XLM-RoBERTa sequence has the following format:
 
         - single sequence: ``<s> X </s>``
         - pair of sequences: ``<s> A </s></s> B </s>``
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
 
         if token_ids_1 is None:
@@ -205,16 +203,16 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -236,12 +234,12 @@ class XLMRobertaTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
-        XLM-R does not make use of token type ids, therefore a list of zeros is returned.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
+        XLM-RoBERTa does not make use of token type ids, therefore a list of zeros is returned.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
diff --git a/src/transformers/tokenization_xlnet.py b/src/transformers/tokenization_xlnet.py
index 26eeff1d0a..59c61e5e7d 100644
--- a/src/transformers/tokenization_xlnet.py
+++ b/src/transformers/tokenization_xlnet.py
@@ -52,13 +52,13 @@ SEG_ID_PAD = 4
 
 class XLNetTokenizer(PreTrainedTokenizer):
     """
-    Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__
+    Construct an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__.
 
-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
-    should refer to the superclass for more information regarding methods.
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (:obj:`string`):
+        vocab_file (:obj:`str`):
             `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a .spm extension) that
             contains the vocabulary necessary to instantiate a tokenizer.
         do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
@@ -67,34 +67,34 @@ class XLNetTokenizer(PreTrainedTokenizer):
             Whether to strip the text when tokenizing (removing excess spaces before and after the string).
         keep_accents (:obj:`bool`, `optional`, defaults to :obj:`False`):
             Whether to keep accents when tokenizing.
-        bos_token (:obj:`string`, `optional`, defaults to "<s>"):
-            The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
+        bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the beginning
                 of sequence. The token used is the :obj:`cls_token`.
-        eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+        eos_token (:obj:`str`, `optional`, defaults to :obj:`"</s>"`):
             The end of sequence token.
 
             .. note::
 
                 When building a sequence using special tokens, this is not the token that is used for the end
                 of sequence. The token used is the :obj:`sep_token`.
-        unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"<unk>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        sep_token (:obj:`string`, `optional`, defaults to "<sep>"):
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"<sep>"`):
             The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences
             for sequence classification or for a text and a question for question answering.
             It is also used as the last token of a sequence built with special tokens.
-        pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"<pad>"`):
             The token used for padding, for example when batching sequences of different lengths.
-        cls_token (:obj:`string`, `optional`, defaults to "<cls>"):
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"<cls>"`):
             The classifier token which is used when doing sequence classification (classification of the whole
             sequence instead of per-token classification). It is the first token of the sequence when built with
             special tokens.
-        mask_token (:obj:`string`, `optional`, defaults to "<mask>"):
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask>"`):
             The token used for masking values. This is the token used when training this model with masked language
             modeling. This is the token which the model will try to predict.
         additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<eop>", "<eod>"]`):
@@ -249,12 +249,12 @@ class XLNetTokenizer(PreTrainedTokenizer):
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of IDs to which the special tokens will be added
+                List of IDs to which the special tokens will be added.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
 
         Returns:
-            :obj:`List[int]`: list of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
+            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
         """
         sep = [self.sep_token_id]
         cls = [self.cls_token_id]
@@ -266,16 +266,16 @@ class XLNetTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
     ) -> List[int]:
         """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` methods.
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.
             already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Set to True if the token list is already formatted with special tokens for the model
+                Whether or not the token list is already formatted with special tokens for the model.
 
         Returns:
             :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
@@ -297,16 +297,19 @@ class XLNetTokenizer(PreTrainedTokenizer):
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
         An XLNet sequence pair mask has the following format:
-        0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2
-        | first sequence    | second sequence     | CLS segment ID
 
-        if token_ids_1 is None, only returns the first portion of the mask (0's).
+        ::
+
+            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+            | first sequence    | second sequence |
+
+        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
 
         Args:
             token_ids_0 (:obj:`List[int]`):
-                List of ids.
+                List of IDs.
             token_ids_1 (:obj:`List[int]`, `optional`):
                 Optional second list of IDs for sequence pairs.