Release: v3.0.2

Fix fast tokenizers too (#5562 )
Various tokenizers fixes (#5558 )
2020-07-06 18:49:44 -04:00 · 2020-07-06 18:45:01 -04:00 · 2020-07-06 18:27:53 -04:00 · 2020-07-06 17:26:48 -04:00 · 2020-07-06 12:17:05 -04:00 · 2020-07-06 11:33:57 -04:00
138 changed files with 5346 additions and 3067 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -57,7 +57,10 @@ jobs:
        steps:
            - checkout
            - run: sudo pip install .[mecab,testing]
-            - run: python -m pytest -sv ./tests/test_tokenization_bert_japanese.py
+            - run: python -m pytest -s ./tests/test_tokenization_bert_japanese.py | tee output.txt
+            - store_artifacts:
+                path: ~/transformers/output.txt
+                destination: test_output.txt
    run_examples_torch:
        working_directory: ~/transformers
        docker:
--- a/.circleci/deploy.sh
+++ b/.circleci/deploy.sh
@@ -46,4 +46,5 @@ deploy_doc "11c3257" v2.8.0
 deploy_doc "e7cfc1a" v2.9.0
 deploy_doc "7cb203f" v2.9.1
 deploy_doc "10d7239" v2.10.0 
-deploy_doc "b42586e" #v2.11.0 Latest stable release
+deploy_doc "b42586e" v2.11.0
+deploy_doc "b62ca59" #v3.0.0 Latest stable release
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -51,4 +51,11 @@ jobs:
        USE_CUDA: yes
      run: |
        source .env/bin/activate
-        python -m pytest -n 2 --dist=loadfile -s -v ./tests/
+        python -m pytest -n 2 --dist=loadfile -s ./tests/ | tee output.txt
+    - name: cat output.txt
+      run: cat output.txt
+    - name: Upload output.txt
+      uses: actions/upload-artifact@v1
+      with:
+        name: pytest_output
+        path: output.txt
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -46,5 +46,11 @@ jobs:
        USE_CUDA: yes
      run: |
        source .env/bin/activate
-        python -m pytest -n 1 --dist=loadfile -s -v ./tests/
-        
+        python -m pytest -n 1 --dist=loadfile -s ./tests/  | tee output.txt
+    - name: cat output.txt
+      run: cat output.txt
+    - name: Upload output.txt
+      uses: actions/upload-artifact@v1
+      with:
+        name: pytest_output
+        path: output.txt
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@@ -1,10 +1,11 @@
 // These two things need to be updated at each release for the version selector.
 // Last stable version
-const stableVersion = "v2.11.0"
+const stableVersion = "v3.0.0"
 // Dictionary doc folder to label
 const versionMapping = {
    "master": "master",
-    "": "v2.11.0 (stable)",
+    "": "v3.0.0 (stable)",
+    "v2.11.0": "v2.11.0",
    "v2.10.0": "v2.10.0",
    "v2.9.1": "v2.9.0/v2.9.1",
    "v2.8.0": "v2.8.0",
@@ -86,7 +87,7 @@ function addVersionControl() {
    const parts = location.toString().split('/');
    let versionIndex = parts.length - 2;
    // Index page may not have a last part with filename.html so we need to go up
-    if (parts[parts.length - 1] != "" && ! parts[parts.length - 1].match(/\.html$/)) {
+    if (parts[parts.length - 1] != "" && ! parts[parts.length - 1].match(/\.html$|^search.html?/)) {
        versionIndex = parts.length - 1;
    }
    // Main classes and models are nested so we need to go deeper
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'3.0.0'
+release = u'3.0.2'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -11,7 +11,7 @@ General terms
  tokens at a certain timestep.
 - MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done
  by masking some tokens randomly, and has to predict the original text.
- multimodal: a task taht combines texts with another kind of inputs (for instance images).
+- multimodal: a task that combines texts with another kind of inputs (for instance images).
 - NLG: natural language generation, all tasks related to generating text ( for instance talk with transformers,
  translation)
 - NLP: natural language processing, a generic way to say "deal with texts".
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -142,6 +142,7 @@ conversion utilities for the following models:
    preprocessing
    training
    model_sharing
+    tokenizer_summary
    multilingual

 .. toctree::
@@ -173,6 +174,7 @@ conversion utilities for the following models:
    main_classes/pipelines
    main_classes/optimizer_schedules
    main_classes/processors
+    main_classes/trainer
    model_doc/auto
    model_doc/encoderdecoder
    model_doc/bert
--- a/docs/source/main_classes/optimizer_schedules.rst
+++ b/docs/source/main_classes/optimizer_schedules.rst
@@ -1,4 +1,4 @@
-Optimizer
+Optimization
 ----------------------------------------------------

 The ``.optimization`` module provides:
@@ -7,24 +7,25 @@ The ``.optimization`` module provides:
 - several schedules in the form of schedule objects that inherit from ``_LRSchedule``:
 - a gradient accumulation class to accumulate the gradients of multiple batches

-``AdamW``
-~~~~~~~~~~~~~~~~
+``AdamW`` (PyTorch)
+~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AdamW
    :members:

-``AdamWeightDecay``
-~~~~~~~~~~~~~~~~~~~
+``AdamWeightDecay`` (TensorFlow)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AdamWeightDecay

 .. autofunction:: transformers.create_optimizer

 Schedules
----------------------------------------------------
+~~~~~~~~~~~~~~~~~~~
+
+Learning Rate Schedules (Pytorch)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Learning Rate Schedules
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: transformers.get_constant_schedule


@@ -56,16 +57,16 @@ Learning Rate Schedules
    :target: /imgs/warmup_linear_schedule.png
    :alt:

-``Warmup``
-~~~~~~~~~~~~~~~~
+``Warmup`` (TensorFlow)
+^^^^^^^^^^^^^^^^^^^^^^^

 .. autoclass:: transformers.WarmUp
    :members:

 Gradient Strategies
----------------------------------------------------
+~~~~~~~~~~~~~~~~~~~~

-``GradientAccumulator``
-~~~~~~~~~~~~~~~~~~~~~~~
+``GradientAccumulator`` (TensorFlow)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 .. autoclass:: transformers.GradientAccumulator
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -0,0 +1,45 @@
+Trainer
+----------
+
+The :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` classes provide an API for feature-complete
+training in most standard use cases. It's used in most of the :doc:`example scripts <../examples>`.
+
+Before instantiating your :class:`~transformers.Trainer`/:class:`~transformers.TFTrainer`, create a 
+:class:`~transformers.TrainingArguments`/:class:`~transformers.TFTrainingArguments` to access all the points of
+customization during training.
+
+The API supports distributed training on multiple GPUs/TPUs, mixed precision through `NVIDIA Apex
+<https://github.com/NVIDIA/apex>`__ for PyTorch and :obj:`tf.keras.mixed_precision` for TensorFlow.
+
+``Trainer`` 
+~~~~~~~~~~~
+
+.. autoclass:: transformers.Trainer
+    :members:
+
+``TFTrainer`` 
+~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFTrainer
+    :members:
+
+``TrainingArguments``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TrainingArguments
+    :members:
+
+``TFTrainingArguments``
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFTrainingArguments
+    :members:
+
+Utilities
+~~~~~~~~~
+
+.. autoclass:: transformers.EvalPrediction
+
+.. autofunction:: transformers.set_seed
+
+.. autofunction:: transformers.torch_distributed_zero_first
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -112,3 +112,17 @@ ReformerModelWithLMHead

 .. autoclass:: transformers.ReformerModelWithLMHead
    :members:
+
+
+ReformerForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ReformerForMaskedLM
+    :members:
+
+
+ReformerForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ReformerForQuestionAnswering
+    :members:
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -171,8 +171,11 @@ Add a model card
 ^^^^^^^^^^^^^^^^

 To make sure everyone knows what your model can do, what its limitations and potential bias or ethetical
-considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should be named
-`README.md` and follow `this template <https://github.com/huggingface/model_card>`__.
+considerations, please add a README.md model card to the 🤗 Transformers repo under `model_cards/`. It should then be
+placed in a subfolder with your username or organization, then another subfolder named like your model
+(`awesome-name-you-picked`). Or just click on the "Create a model card on GitHub" button on the model page, it will
+get you directly to the right location. If you need one, `here <https://github.com/huggingface/model_card>`__ is a
+model card template (meta-suggestions are welcome).

 If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do),
 don't forget to link to its model card so that people can fully trace how your model was built.
@@ -180,6 +183,11 @@ don't forget to link to its model card so that people can fully trace how your m
 If you have never made a pull request to the 🤗 Transformers repo, look at the
 :doc:`contributing guide <contributing>` to see the steps to follow.

+.. Note::
+
+    You can also send your model card in the folder you uploaded with the CLI by placing it in a `README.md` file
+    inside `path/to/awesome-name-you-picked/`.
+
 Using your model
 ^^^^^^^^^^^^^^^^

--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -146,8 +146,9 @@ Using the tokenizer

 We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
 words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
-that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the
-same rules as when the model was pretrained.
+that process (you can learn more about them in the :doc:`tokenizer_summary <tokenizer_summary>`, which is why we need
+to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
+pretrained.

 The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
 the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -0,0 +1,243 @@
+Tokenizer summary
+-----------------
+
+In this page, we will have a closer look at tokenization. As we saw in
+:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
+are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
+specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
+:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
+:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
+
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
+algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
+using :ref:`WordPiece <wordpiece>`.
+
+Introduction to tokenization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
+instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
+this text is just to split it by spaces, which would give:
+
+::
+
+    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
+
+This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
+will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
+into account. This would give:
+
+::
+
+    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
+it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
+part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
+into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
+perform properly if you don't use the exact same rules as the persons who pretrained it.
+
+`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
+rule-based tokenizers. On the text above, they'd output something like:
+
+::
+
+    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
+sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
+you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
+:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
+size of 267,735!
+
+A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
+TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
+transformers model rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
+language.
+
+So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
+While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
+as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
+all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
+
+Subword tokenization
+^^^^^^^^^^^^^^^^^^^^
+
+Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
+should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
+decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
+form (almost) arbitrarily long complex words by stringing together some subwords.
+
+This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
+subwords. This also gives the ability to the model to process words it has never seen before, by decomposing them into
+subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
+this:
+
+::
+
+    >>> from transformers import BertTokenizer
+    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    >>> tokenizer.tokenize("I have a new GPU!")
+    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
+
+Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
+vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
+means that the rest of the token should be attached to the previous one, without space (for when we need to decode
+predictions and reverse the tokenization).
+
+Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
+
+::
+
+    >>> from transformers import XLNetTokenizer
+    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
+    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
+
+We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
+Transformers has been split into "Transform" and "ers".
+
+Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
+training which is usually done on the corpus the corresponding model will be trained on.
+
+.. _byte-pair-encoding:
+
+Byte-Pair Encoding
+~~~~~~~~~~~~~~~~~~
+
+Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
+splitting the training data into words, which can be a simple space tokenization
+(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
+(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+
+:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy) and, counts the frequency of each word in the training corpus.
+
+It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
+vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
+
+Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
+word):
+
+::
+
+    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
+
+Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
+
+::
+
+    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
+
+We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
+times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
+`10 + 5 + 2 + 5 = 22` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
+then it adds 'ug' to the vocabulary. Our corpus then becomes
+
+::
+
+    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
+
+and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
+and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
+to the vocabulary.
+
+At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
+represented as
+
+::
+
+    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
+
+If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
+were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
+``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
+base corpus uses all of them), but to special characters like emojis.
+
+As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
+to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
+and chose to stop the training of the tokenizer at 40,000 merges.
+
+Byte-level BPE
+^^^^^^^^^^^^^^
+
+To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
+all unicode characters, the
+`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
+introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
+additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
+token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
+256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
+
+.. _wordpiece:
+
+WordPiece
+=========
+
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
+:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
+`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
+on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
+progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
+frequent but the one that will maximize the likelihood on the corpus once merged. 
+
+What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
+having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
+subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
+sure it's `worth it`.
+
+.. _unigram:
+
+Unigram
+=======
+
+Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
+Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
+from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
+progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
+with :ref:`SentencePiece <sentencepiece>`.
+
+More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
+for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
+sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
+all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
+reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
+BPE or WordPiece).
+
+Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
+tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
+vocabulary
+
+::
+
+    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
+
+we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
+one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
+training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
+tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
+of the tokenization according to their probabilities).
+
+Those probabilities are what are used to define the loss that trains the tokenizer: if our corpus consists of the
+words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
+tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
+
+.. math::
+    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
+
+.. _sentencepiece:
+
+SentencePiece
+=============
+
+All the methods we have been looking at so far required some from of pretrokenization, which has a central problem: not
+all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
+pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
+SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
+includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
+
+That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
+some '▁' characters, that represent spaces. Decoding a tokenized text is then super easy: we just have to concatenate
+all of them together and replace those '▁' by spaces.
+
+All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
+:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -39,7 +39,7 @@ of the specified model are used to initialize the model. The
 library also includes a number of task-specific final layers or 'heads' whose
 weights are instantiated randomly when not present in the specified
 pre-trained model. For example, instantiating a model with
-``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_classes=2)``
+``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)``
 will create a BERT model instance with encoder weights copied from the
 ``bert-base-uncased`` model and a randomly initialized sequence
 classification head on top of the encoder with an output size of 2. Models
@@ -272,7 +272,7 @@ optimize.
 :func:`~transformers.Trainer` uses a built-in default function to collate
 batches and prepare them to be fed into the model. If needed, you can also
 use the ``data_collator`` argument to pass your own collator function which
-takes in the data in the format provides by your dataset and returns a
+takes in the data in the format provided by your dataset and returns a
 batch ready to be fed into the model. Note that
 :func:`~transformers.TFTrainer` expects the passed datasets to be dataset
 objects from ``tensorflow_datasets``.
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,4 +1,4 @@
-## Examples
+# Examples

 Version 2.9 of 🤗 Transformers introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
 Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.1+.
@@ -13,7 +13,7 @@ Here is the list of all our examples:
 This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**


-# The Big Table of Tasks
+## The Big Table of Tasks

 | Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
 |---|---|:---:|:---:|:---:|:---:|
--- a/examples/multiple-choice/run_tf_multiple_choice.py
+++ b/examples/multiple-choice/run_tf_multiple_choice.py
@@ -108,7 +108,10 @@ def main():
        level=logging.INFO,
    )
    logger.warning(
-        "device: %s, n_gpu: %s, 16-bits training: %s", training_args.device, training_args.n_gpu, training_args.fp16,
+        "device: %s, n_replicas: %s, 16-bits training: %s",
+        training_args.device,
+        training_args.n_replicas,
+        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)

--- a/examples/question-answering/run_tf_squad.py
+++ b/examples/question-answering/run_tf_squad.py
@@ -137,9 +137,9 @@ def main():
        level=logging.INFO,
    )
    logger.info(
-        "n_gpu: %s, distributed training: %s, 16-bits training: %s",
-        training_args.n_gpu,
-        bool(training_args.n_gpu > 1),
+        "n_replicas: %s, distributed training: %s, 16-bits training: %s",
+        training_args.n_replicas,
+        bool(training_args.n_replicas > 1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -1,3 +1,5 @@
+## Sequence to Sequence
+
 This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
 Summarization support is more mature than translation support.
 Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
@@ -69,7 +71,7 @@ Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_t
 - If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
 - For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
 - `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
- `wandb` can be used by specifying `--logger wandb_shared` or `--logger wandb`. It is useful for reproducibility. 
+- `wandb` can be used by specifying `--logger wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task. 
 - This warning can be safely ignored: 
    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
 - Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
@@ -109,14 +111,14 @@ Compare XSUM results with others by using `--logger wandb_shared`. This requires

 Here is an example command, but you can do whatever you want. Hopefully this will make debugging and collaboration easier!
 ```bash
-./finetune.sh \
+WANDB_PROJECT='hf_xsum' ./finetune.sh \
    --data_dir $XSUM_DIR \
    --output_dir xsum_frozen_embs \
    --model_name_or_path facebook/bart-large \
-    --logger wandb_shared \
    --train_batch_size 16 --eval_batch_size 16 --freeze_embeds --freeze_encoder \
    --num_train_epochs 6 \
-    --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100
+    --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
+    --logger wandb
 ```

 You can see your wandb logs [here](https://app.wandb.ai/sshleifer/hf_xsum?workspace=user-)
@@ -168,6 +170,7 @@ python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_


 ### DistilBART
+![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)

 For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works:
 you just copy alternating layers from `bart-large-cnn` and finetune more on the same data. 
--- a/examples/seq2seq/bertabs/run_summarization.py
+++ b/examples/seq2seq/bertabs/run_summarization.py
@@ -30,7 +30,7 @@ Batch = namedtuple("Batch", ["document_names", "batch_size", "src", "segs", "mas

 def evaluate(args):
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
-    model = BertAbs.from_pretrained("bertabs-finetuned-cnndm")
+    model = BertAbs.from_pretrained("remi/bertabs-finetuned-extractive-abstractive-summarization")
    model.to(args.device)
    model.eval()

--- a/examples/seq2seq/finetune.py
+++ b/examples/seq2seq/finetune.py
@@ -298,8 +298,6 @@ def main(args, model=None) -> SummarizationModule:
            model: SummarizationModule = SummarizationModule(args)
        else:
            model: SummarizationModule = TranslationModule(args)
-
-    dataset = Path(args.data_dir).name
    if (
        args.logger == "default"
        or args.fast_dev_run
@@ -310,12 +308,12 @@ def main(args, model=None) -> SummarizationModule:
    elif args.logger == "wandb":
        from pytorch_lightning.loggers import WandbLogger

-        logger = WandbLogger(name=model.output_dir.name, project=dataset)
+        logger = WandbLogger(name=model.output_dir.name)

    elif args.logger == "wandb_shared":
        from pytorch_lightning.loggers import WandbLogger

-        logger = WandbLogger(name=model.output_dir.name, project=f"hf_{dataset}")
+        logger = WandbLogger(name=model.output_dir.name)
    trainer: pl.Trainer = generic_train(
        model,
        args,
--- a/examples/seq2seq/finetune_bart_tiny.sh
+++ b/examples/seq2seq/finetune_bart_tiny.sh
@@ -12,7 +12,7 @@ export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
 # Make output directory if it doesn't exist
 mkdir -p $OUTPUT_DIR

-# Add parent directory to python path to access lightning_base.py and utils.py
+# Add parent directory to python path to access lightning_base.py and testing_utils.py
 export PYTHONPATH="../":"${PYTHONPATH}"
 python finetune.py \
 --data_dir=cnn_tiny/ \
--- a/examples/seq2seq/test_seq2seq_examples.py
+++ b/examples/seq2seq/test_seq2seq_examples.py
@@ -12,6 +12,7 @@ import torch
 from torch.utils.data import DataLoader

 from transformers import AutoTokenizer
+from transformers.testing_utils import require_multigpu

 from .distillation import distill_main, evaluate_checkpoint
 from .finetune import main
@@ -107,7 +108,7 @@ class TestSummarizationDistiller(unittest.TestCase):
        logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
        return cls

-    @unittest.skipUnless(torch.cuda.device_count() > 1, "skipping multiGPU test")
+    @require_multigpu
    def test_multigpu(self):
        updates = dict(no_teacher=True, freeze_encoder=True, gpus=2, sortish_sampler=False,)
        self._test_distiller_cli(updates)
--- a/examples/text-classification/run_tf_glue.py
+++ b/examples/text-classification/run_tf_glue.py
@@ -131,9 +131,9 @@ def main():
        level=logging.INFO,
    )
    logger.info(
-        "n_gpu: %s, distributed training: %s, 16-bits training: %s",
-        training_args.n_gpu,
-        bool(training_args.n_gpu > 1),
+        "n_replicas: %s, distributed training: %s, 16-bits training: %s",
+        training_args.n_replicas,
+        bool(training_args.n_replicas > 1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)
--- a/examples/text-generation/run_generation.py
+++ b/examples/text-generation/run_generation.py
@@ -214,8 +214,14 @@ def main():
    if requires_preprocessing:
        prepare_input = PREPROCESSING_FUNCTIONS.get(args.model_type)
        preprocessed_prompt_text = prepare_input(args, model, tokenizer, prompt_text)
+
+        if model.__class__.__name__ in ["TransfoXLLMHeadModel"]:
+            tokenizer_kwargs = {"add_space_before_punct_symbol": True}
+        else:
+            tokenizer_kwargs = {}
+
        encoded_prompt = tokenizer.encode(
-            preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", add_space_before_punct_symbol=True
+            preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", **tokenizer_kwargs
        )
    else:
        encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -75,7 +75,8 @@ class DataTrainingArguments:
        metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."}
    )
    labels: Optional[str] = field(
-        metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."}
+        default=None,
+        metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."},
    )
    max_seq_length: int = field(
        default=128,
--- a/examples/token-classification/run_tf_ner.py
+++ b/examples/token-classification/run_tf_ner.py
@@ -109,9 +109,9 @@ def main():
        level=logging.INFO,
    )
    logger.info(
-        "n_gpu: %s, distributed training: %s, 16-bits training: %s",
-        training_args.n_gpu,
-        bool(training_args.n_gpu > 1),
+        "n_replicas: %s, distributed training: %s, 16-bits training: %s",
+        training_args.n_replicas,
+        bool(training_args.n_replicas > 1),
        training_args.fp16,
    )
    logger.info("Training/evaluation parameters %s", training_args)
--- a/model_cards/MoseliMotsoehli/TswanaBert/README.md
+++ b/model_cards/MoseliMotsoehli/TswanaBert/README.md
@@ -0,0 +1,74 @@
+---
+language: setswana
+---
+
+# TswanaBert
+Pretrained model on the Tswana language using a masked language modeling (MLM) objective.
+
+## Model Description.
+TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
+
+## Intended uses & limitations
+The model can  be used for either masked language modeling or next word prediction. It can also be fine-tuned on a specific down-stream NLP application. 
+
+#### How to use
+
+```python
+>>> from transformers import pipeline
+>>> from transformers import AutoTokenizer, AutoModelWithLMHead
+
+>>> tokenizer = AutoTokenizer.from_pretrained("MoseliMotsoehli/TswanaBert")
+>>> model = AutoModelWithLMHead.from_pretrained("MoseliMotsoehli/TswanaBert")
+>>> unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
+>>> unmasker("Ntshopotse <mask> e godile.")
+
+[{'score': 0.32749542593955994,
+  'sequence': '<s>Ntshopotse setse e godile.</s>',
+  'token': 538,
+  'token_str': 'Ġsetse'},
+ {'score': 0.060260992497205734,
+  'sequence': '<s>Ntshopotse le e godile.</s>',
+  'token': 270,
+  'token_str': 'Ġle'},
+ {'score': 0.058460816740989685,
+  'sequence': '<s>Ntshopotse bone e godile.</s>',
+  'token': 364,
+  'token_str': 'Ġbone'},
+ {'score': 0.05694682151079178,
+  'sequence': '<s>Ntshopotse ga e godile.</s>',
+  'token': 298,
+  'token_str': 'Ġga'},
+ {'score': 0.0565204992890358,
+  'sequence': '<s>Ntshopotse, e godile.</s>',
+  'token': 16,
+  'token_str': ','}]
+```
+
+#### Limitations and bias
+The model is trained on a relatively small collection of setwana, mostly from news articles and creative writtings, and so is not representative enough of the language as yet.
+
+## Training data
+
+1. The largest portion of this dataset (10k)  sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
+
+2. I Then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020)  that is generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus. 
+
+3. I went on to add 300 more sentences by scrapping following news sites and blogs that mosty originate in Botswana. I actively continue to expand the dataset.
+
+* http://setswana.blogspot.com/
+* https://omniglot.com/writing/tswana.php
+* http://www.dailynews.gov.bw/
+* http://www.mmegi.bw/index.php
+* https://tsena.co.bw
+* http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html
+* https://www.poemhunter.com/poem/2013-setswana/
+https://www.poemhunter.com/poem/ngwana-wa-mosetsana/
+ 
+
+### BibTeX entry and citation info
+
+```bibtex
+@inproceedings{author = {Moseli Motsoehli},
+  year={2020}
+}
+```
--- a/model_cards/chrisliu298/arxiv_ai_gpt2/README.md
+++ b/model_cards/chrisliu298/arxiv_ai_gpt2/README.md
@@ -12,13 +12,13 @@ datasets:

 ## Model description

-This GPT-2 (774M) model is capable of generating abstracts given paper titles. It was trained using all research papers under aritficial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) on arXiv.
+This GPT-2 (774M) model is capable of generating abstracts given paper titles. It was trained using all research paper titles and abstracts under artificial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) on arXiv.

 ## Intended uses & limitations

 #### How to use

-To generate paper abstracts, use the provided `generate.py`. This file is very similar to HuggingFace's `run_generation.py` [here](https://github.com/huggingface/transformers/tree/master/examples/text-generation). You can simply replace the text with with your own model path (line 89) and change the input string to your paper title (line 127).
+To generate paper abstracts, use the provided `generate.py` [here](https://gist.github.com/chrisliu298/ccb8144888eace069da64ad3e6472d64). This is very similar to the HuggingFace's `run_generation.py` [here](https://github.com/huggingface/transformers/tree/master/examples/text-generation). You can simply replace the text with with your own model path (line 89) and change the input string to your paper title (line 127). If you want to use your own script, make sure to prepend `<|startoftext|> ` at the front and append ` <|sep|>` at the end of the paper title.

 ## Training data
 I selected a subset of the [arXiv Archive](https://github.com/staeiou/arxiv_archive) dataset (Geiger, 2019) as the training and evaluation data to fine-tune GPT-2. The original arXiv Archive dataset contains a full archive of metadata about papers on arxiv.org, from the start of the site in 1993 to the end of 2019. Our subset includes all the paper titles (query) and abstracts (context) under the Artificial Intelligence (cs.AI), Machine Learning (cs.LG), Computation and Language (cs.CL), and Computer Vision and Pattern Recognition (cs.CV) categories. I provide the information  of the sub-dataset and the distribution of the training and evaluation dataset as follows.
--- a/model_cards/gpt2-README.md
+++ b/model_cards/gpt2-README.md
@@ -13,8 +13,9 @@ Pretrained model on English language using a causal language modeling (CLM) obje
 [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
 and first released at [this page](https://openai.com/blog/better-language-models/).

-Disclaimer: The team releasing GPT-2 did not write a model card for this model so this model card has been written by
-the Hugging Face team.
+Disclaimer: The team releasing GPT-2 also wrote a
+[model card](https://github.com/openai/gpt-2/blob/master/model_card.md) for their model. Content from this model card
+has been written by the Hugging Face team to complete the information they provided and give specific examples of bias.

 ## Model description

@@ -79,7 +80,19 @@ output = model(encoded_input)
 ### Limitations and bias

 The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
-unfiltered from the internet, which is far from neutral. Therefore, the model can have biased predictions:
+unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their
+[model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases):
+
+> Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases
+> that require the generated text to be true.
+>
+> Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do
+> not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a
+> study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race,
+> and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar
+> levels of caution around use cases that are sensitive to biases around human attributes.
+
+Here's an example of how the model can have biased predictions:

 ```python
 >>> from transformers import pipeline, set_seed
@@ -110,7 +123,8 @@ This bias will also affect all fine-tuned versions of this model.
 The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web
 pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from
 this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights
-40GB of texts but has not been publicly released.
+40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText
+[here](https://github.com/openai/gpt-2/blob/master/domains.txt).

 ## Training procedure

--- a/model_cards/mrm8488/electra-base-finetuned-squadv1/README.md
+++ b/model_cards/mrm8488/electra-base-finetuned-squadv1/README.md
@@ -0,0 +1,85 @@
+---
+language: english
+---
+
+# Electra base ⚡ + SQuAD v1 ❓
+
+[Electra-base-discriminator](https://huggingface.co/google/electra-base-discriminator) fine-tuned on [SQUAD v1.1 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) for **Q&A** downstream task.
+
+## Details of the downstream task (Q&A) - Model 🧠
+
+**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
+
+
+## Details of the downstream task (Q&A) - Dataset 📚
+
+**S**tanford **Q**uestion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
+SQuAD v1.1 contains **100,000+** question-answer pairs on **500+** articles.
+
+## Model training 🏋️‍
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+python transformers/examples/question-answering/run_squad.py \
+  --model_type electra \
+  --model_name_or_path 'google/electra-base-discriminator' \
+  --do_eval \
+  --do_train \
+  --do_lower_case \
+  --train_file '/content/dataset/train-v1.1.json' \
+  --predict_file '/content/dataset/dev-v1.1.json' \
+  --per_gpu_train_batch_size 16 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 10 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir '/content/output' \
+  --overwrite_output_dir \
+  --save_steps 1000
+```
+
+## Test set Results 🧾
+
+| Metric | # Value   |
+| ------ | --------- |
+| **EM** | **83.03** |
+| **F1** | **90.77** |
+| **Size**| **+ 400 MB** |
+
+Very good metrics for such a "small" model!
+
+```json
+{
+'exact': 83.03689687795648,
+'f1': 90.77486052446231,
+'total': 10570,
+'HasAns_exact': 83.03689687795648,
+'HasAns_f1': 90.77486052446231,
+'HasAns_total': 10570,
+'best_exact': 83.03689687795648,
+'best_exact_thresh': 0.0,
+'best_f1': 90.77486052446231,
+'best_f1_thresh': 0.0
+}
+```
+
+### Model in action 🚀
+
+Fast usage with **pipelines**:
+
+```python
+from transformers import pipeline
+
+QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv1')
+
+QnA_pipeline({
+    'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
+    'question': 'What has been discovered by scientists from China ?'
+})
+# Output:
+{'answer': 'A new strain of flu', 'end': 19, 'score': 0.9995211430099182, 'start': 0}
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/electra-small-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/electra-small-finetuned-squadv2/README.md
@@ -0,0 +1,86 @@
+---
+language: english
+---
+
+# Electra small ⚡ + SQuAD v2 ❓
+
+[Electra-small-discriminator](https://huggingface.co/google/electra-small-discriminator) fine-tuned on [SQUAD v2.0 dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/) for **Q&A** downstream task.
+
+## Details of the downstream task (Q&A) - Model 🧠
+
+**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
+
+
+## Details of the downstream task (Q&A) - Dataset 📚
+
+**SQuAD2.0** combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
+
+## Model training 🏋️‍
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+python transformers/examples/question-answering/run_squad.py \
+  --model_type electra \
+  --model_name_or_path 'google/electra-small-discriminator' \
+  --do_eval \
+  --do_train \
+  --do_lower_case \
+  --train_file '/content/dataset/train-v2.0.json' \
+  --predict_file '/content/dataset/dev-v2.0.json' \
+  --per_gpu_train_batch_size 16 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 10 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir '/content/output' \
+  --overwrite_output_dir \
+  --save_steps 1000 \
+  --version_2_with_negative
+```
+
+## Test set Results 🧾
+
+| Metric | # Value   |
+| ------ | --------- |
+| **EM** | **69.71** |
+| **F1** | **73.44** |
+| **Size**| **50 MB** |
+
+
+```json
+{
+'exact': 69.71279373368147,
+'f1': 73.4439546123672,
+'total': 11873,
+'HasAns_exact': 69.92240215924427,
+'HasAns_f1': 77.39542393937836,
+'HasAns_total': 5928,
+'NoAns_exact': 69.50378469301934,
+'NoAns_f1': 69.50378469301934,
+'NoAns_total': 5945,
+'best_exact': 69.71279373368147,
+'best_exact_thresh': 0.0,
+'best_f1': 73.44395461236732,
+'best_f1_thresh': 0.0
+}
+```
+
+### Model in action 🚀
+
+Fast usage with **pipelines**:
+
+```python
+from transformers import pipeline
+QnA_pipeline = pipeline('question-answering', model='mrm8488/electra-base-finetuned-squadv2')
+QnA_pipeline({
+    'context': 'A new strain of flu that has the potential to become a pandemic has been identified in China by scientists.',
+    'question': 'What has been discovered by scientists from China ?'
+})
+# Output:
+{'answer': 'A new strain of flu', 'end': 19, 'score': 0.8650811568752914, 'start': 0}
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/electricidad-small-finetuned-squadv1-es/README.md
+++ b/model_cards/mrm8488/electricidad-small-finetuned-squadv1-es/README.md
@@ -0,0 +1,102 @@
+---
+language: spanish
+thumbnail: https://imgur.com/uxAvBfh
+---
+
+# Electricidad small + Spanish SQuAD v1 ⚡❓
+
+[Electricidad-small-discriminator](https://huggingface.co/mrm8488/electricidad-small-discriminator) fine-tuned on [Spanish SQUAD v1.1 dataset](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1) for **Q&A** downstream task.
+
+## Details of the downstream task (Q&A) - Dataset 📚
+
+[SQuAD-es-v1.1](https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1)
+
+| Dataset split | # Samples |
+| ------------- | --------- |
+| Train         | 130 K     |
+| Test          | 11 K      |
+
+## Model training 🏋️‍
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+python /content/transformers/examples/question-answering/run_squad.py \
+  --model_type electra \
+  --model_name_or_path 'mrm8488/electricidad-small-discriminator' \
+  --do_eval \
+  --do_train \
+  --do_lower_case \
+  --train_file '/content/dataset/train-v1.1-es.json' \
+  --predict_file '/content/dataset/dev-v1.1-es.json' \
+  --per_gpu_train_batch_size 16 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 10 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir '/content/electricidad-small-finetuned-squadv1-es' \
+  --overwrite_output_dir \
+  --save_steps 1000
+```
+
+## Test set Results 🧾
+
+| Metric | # Value   |
+| ------ | --------- |
+| **EM** | **46.82** |
+| **F1** | **64.79** |
+
+```json
+{
+'exact': 46.82119205298013,
+'f1': 64.79435260021918,
+'total': 10570,
+'HasAns_exact': 46.82119205298013,
+HasAns_f1': 64.79435260021918,
+'HasAns_total': 10570,
+'best_exact': 46.82119205298013,
+'best_exact_thresh': 0.0,
+'best_f1': 64.79435260021918,
+'best_f1_thresh': 0.0
+}
+```
+
+### Model in action 🚀
+
+Fast usage with **pipelines**:
+
+```python
+from transformers import pipeline
+
+qa_pipeline = pipeline(
+    "question-answering",
+    model="mrm8488/electricidad-small-finetuned-squadv1-es",
+    tokenizer="mrm8488/electricidad-small-finetuned-squadv1-es"
+)
+
+context = "Manuel ha creado una versión del modelo Electra small en español que alcanza una puntuación F1 de 65 en el dataset SQUAD-es y sólo pesa 50 MB"
+
+q1 = "Cuál es su marcador F1?"
+q2 = "¿Cuál es el tamaño del modelo?"
+q3 = "¿Quién lo ha creado?"
+q4 = "¿Que es lo que ha hecho Manuel?"
+
+
+questions = [q1, q2, q3, q4]
+
+for question in questions:
+  result = qa_pipeline({
+    'context': context,
+    'question': question})
+  print(result)
+
+# Output:
+{'score': 0.14836778166355025, 'start': 98, 'end': 100, 'answer': '65'}
+{'score': 0.32219420810758237, 'start': 136, 'end': 140, 'answer': '50 MB'}
+{'score': 0.9672326951118713, 'start': 0, 'end': 6, 'answer': 'Manuel'}
+{'score': 0.23552458113848118, 'start': 10, 'end': 53, 'answer': 'creado una versión del modelo Electra small'}
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/t5-base-finetuned-emotion/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-emotion/README.md
@@ -1,4 +1,4 @@
--
+---
 language: english
 ---

@@ -13,7 +13,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit

 Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
+![model image](https://i.imgur.com/jVFMMWR.png)

 ## Details of the downstream task (Sentiment Recognition) - Dataset 📚

--- a/model_cards/mrm8488/t5-base-finetuned-sarcasm-twitter/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-sarcasm-twitter/README.md
@@ -1,4 +1,4 @@
--
+---
 language: english
 ---

@@ -11,8 +11,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit

 Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
-
+![model image](https://i.imgur.com/jVFMMWR.png)
 ## Details of the downstream task (Sequence Classification as Text generation) - Dataset 📚

 [ Twitter Sarcasm Dataset](https://github.com/EducationalTestingService/sarcasm)
@@ -105,7 +104,7 @@ conversation = twit1 + me

 eval_conversation(conversation) #Output: 'derison'

-# We will get 'normal' when not sarcasm detected and 'derison' when detected
+# We will get 'normal' when sarcasm is not detected and 'derison' when detected
 ```

 > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -14,14 +14,17 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit

 Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

+![model image](https://i.imgur.com/jVFMMWR.png)
+

 ## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓

 Dataset ID: ```squad_v2``` from  [HugginFace/NLP](https://github.com/huggingface/nlp)
+
 | Dataset  | Split | # samples |
 | -------- | ----- | --------- |
-| squad_v2 | train | 130319      |
-| squad_v2 | valid  | 11873     |
+| squad_v2 | train | 130319    |
+| squad_v2 | valid  | 11873    |

 How to load it from [nlp](https://github.com/huggingface/nlp)

--- a/model_cards/mrm8488/t5-base-finetuned-summarize-news/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-summarize-news/README.md
@@ -15,7 +15,7 @@ The **T5** model was presented in [Exploring the Limits of Transfer Learning wit

 Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

-![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
+![model image](https://i.imgur.com/jVFMMWR.png)

 ## Details of the downstream task (Summarization) - Dataset 📚

--- a/model_cards/roberta-large-mnli-README.md
+++ b/model_cards/roberta-large-mnli-README.md
@@ -0,0 +1,22 @@
+---
+license: mit
+widget:
+- text: "I like you. </s></s> I love you."
+---
+
+
+## roberta-large-mnli
+
+Trained by Facebook, [original source](https://github.com/pytorch/fairseq/tree/master/examples/roberta)
+
+```bibtex
+@article{liu2019roberta,
+    title = {RoBERTa: A Robustly Optimized BERT Pretraining Approach},
+    author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and
+              Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and
+              Luke Zettlemoyer and Veselin Stoyanov},
+    journal={arXiv preprint arXiv:1907.11692},
+    year = {2019},
+}
+```
+
--- a/model_cards/schmidek/electra-small-cased/README.md
+++ b/model_cards/schmidek/electra-small-cased/README.md
@@ -0,0 +1,11 @@
+---
+language: english
+license: apache-2.0
+---
+
+## ELECTRA-small-cased
+
+This is a cased version of `google/electra-small-discriminator`, trained on the
+[OpenWebText corpus](https://skylion007.github.io/OpenWebTextCorpus/).
+
+Uses the same tokenizer and vocab from `bert-base-cased`
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -39,3 +39,4 @@ Pull Request so it can be included under the Community notebooks.
 |[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
 |[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
 |[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
+|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
--- a/setup.py
+++ b/setup.py
@@ -73,11 +73,15 @@ extras["tf"] = [
    "tensorflow",
    "onnxconverter-common",
    "keras2onnx"
+    # "onnxconverter-common @ git+git://github.com/microsoft/onnxconverter-common.git@f64ca15989b6dc95a1f3507ff6e4c395ba12dff5#egg=onnxconverter-common",
+    # "keras2onnx @ git+git://github.com/onnx/keras-onnx.git@cbdc75cb950b16db7f0a67be96a278f8d2953b48#egg=keras2onnx"
 ]
 extras["tf-cpu"] = [
    "tensorflow-cpu",
    "onnxconverter-common",
    "keras2onnx"
+    # "onnxconverter-common @ git+git://github.com/microsoft/onnxconverter-common.git@f64ca15989b6dc95a1f3507ff6e4c395ba12dff5#egg=onnxconverter-common",
+    # "keras2onnx @ git+git://github.com/onnx/keras-onnx.git@cbdc75cb950b16db7f0a67be96a278f8d2953b48#egg=keras2onnx"
 ]
 extras["torch"] = ["torch"]

@@ -90,13 +94,14 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
 extras["quality"] = [
    "black",
    "isort",
+    # "isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
    "flake8",
 ]
 extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3<1", "scikit-learn", "tensorflow", "torch"]

 setup(
    name="transformers",
-    version="3.0.0",
+    version="3.0.2",
    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
@@ -109,7 +114,7 @@ setup(
    packages=find_packages("src"),
    install_requires=[
        "numpy",
-        "tokenizers == 0.8.0-rc4",
+        "tokenizers == 0.8.1.rc1",
        # dataclasses for Python versions that don't have it
        "dataclasses;python_version<'3.7'",
        # utilities from PyPA to e.g. compare versions
@@ -123,7 +128,7 @@ setup(
        # for OpenAI GPT
        "regex != 2019.12.17",
        # for XLNet
-        "sentencepiece",
+        "sentencepiece != 0.1.92",
        # for XLM
        "sacremoses",
    ],
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "3.0.0"
+__version__ = "3.0.2"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -155,7 +155,7 @@ from .tokenization_xlm_roberta import XLMRobertaTokenizer
 from .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer

 # Trainer
-from .trainer_utils import EvalPrediction
+from .trainer_utils import EvalPrediction, set_seed
 from .training_args import TrainingArguments
 from .training_args_tf import TFTrainingArguments

@@ -169,7 +169,8 @@ if is_sklearn_available():

 # Modeling
 if is_torch_available():
-    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, top_k_top_p_filtering, apply_chunking_to_forward
+    from .generation_utils import top_k_top_p_filtering
+    from .modeling_utils import PreTrainedModel, prune_layer, Conv1D, apply_chunking_to_forward
    from .modeling_auto import (
        AutoModel,
        AutoModelForPreTraining,
@@ -365,7 +366,9 @@ if is_torch_available():
        ReformerAttention,
        ReformerLayer,
        ReformerModel,
+        ReformerForMaskedLM,
        ReformerModelWithLMHead,
+        ReformerForQuestionAnswering,
        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
    )

@@ -396,7 +399,7 @@ if is_torch_available():
    )

    # Trainer
-    from .trainer import Trainer, set_seed, torch_distributed_zero_first, EvalPrediction
+    from .trainer import Trainer, torch_distributed_zero_first
    from .data.data_collator import default_data_collator, DataCollator, DataCollatorForLanguageModeling
    from .data.datasets import GlueDataset, TextDataset, LineByLineTextDataset, GlueDataTrainingArguments

@@ -406,9 +409,9 @@ if is_torch_available():

 # TensorFlow
 if is_tf_available():
+    from .generation_tf_utils import tf_top_k_top_p_filtering
    from .modeling_tf_utils import (
        shape_list,
-        tf_top_k_top_p_filtering,
        TFPreTrainedModel,
        TFSequenceSummary,
        TFSharedEmbeddings,
--- a/src/transformers/convert_pytorch_checkpoint_to_tf2.py
+++ b/src/transformers/convert_pytorch_checkpoint_to_tf2.py
@@ -71,10 +71,10 @@ from transformers import (
    XLMRobertaConfig,
    XLNetConfig,
    cached_path,
-    hf_bucket_url,
    is_torch_available,
    load_pytorch_checkpoint_in_tf2_model,
 )
+from transformers.file_utils import hf_bucket_url


 if is_torch_available():
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -43,7 +43,8 @@ def default_data_collator(features: List[InputDataClass]) -> Dict[str, torch.Ten
    # Ensure that tensor is created with the correct type
    # (it should be automatically the case, but let's make sure of it.)
    if "label" in first and first["label"] is not None:
-        dtype = torch.long if type(first["label"]) is int else torch.float
+        label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]
+        dtype = torch.long if isinstance(label, int) else torch.float
        batch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)
    elif "label_ids" in first and first["label_ids"] is not None:
        if isinstance(first["label_ids"], torch.Tensor):
--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@@ -17,6 +17,7 @@

 import logging
 import os
+from dataclasses import asdict
 from enum import Enum
 from typing import List, Optional, Union

@@ -81,26 +82,16 @@ if is_tf_available():

        def gen():
            for ex in features:
-                yield (
-                    {
-                        "input_ids": ex.input_ids,
-                        "attention_mask": ex.attention_mask,
-                        "token_type_ids": ex.token_type_ids,
-                    },
-                    ex.label,
-                )
+                d = {k: v for k, v in asdict(ex).items() if v is not None}
+                label = d.pop("label")
+                yield (d, label)
+
+        input_names = ["input_ids"] + tokenizer.model_input_names

        return tf.data.Dataset.from_generator(
            gen,
-            ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
-            (
-                {
-                    "input_ids": tf.TensorShape([None]),
-                    "attention_mask": tf.TensorShape([None]),
-                    "token_type_ids": tf.TensorShape([None]),
-                },
-                tf.TensorShape([]),
-            ),
+            ({k: tf.int32 for k in input_names}, tf.int64),
+            ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])),
        )


--- a/src/transformers/data/processors/squad.py
+++ b/src/transformers/data/processors/squad.py
@@ -389,57 +389,102 @@ def squad_convert_examples_to_features(

        def gen():
            for i, ex in enumerate(features):
-                yield (
-                    {
-                        "input_ids": ex.input_ids,
-                        "attention_mask": ex.attention_mask,
-                        "token_type_ids": ex.token_type_ids,
-                        "feature_index": i,
-                        "qas_id": ex.qas_id,
-                    },
-                    {
-                        "start_positions": ex.start_position,
-                        "end_positions": ex.end_position,
-                        "cls_index": ex.cls_index,
-                        "p_mask": ex.p_mask,
-                        "is_impossible": ex.is_impossible,
-                    },
-                )
+                if ex.token_type_ids is None:
+                    yield (
+                        {
+                            "input_ids": ex.input_ids,
+                            "attention_mask": ex.attention_mask,
+                            "feature_index": i,
+                            "qas_id": ex.qas_id,
+                        },
+                        {
+                            "start_positions": ex.start_position,
+                            "end_positions": ex.end_position,
+                            "cls_index": ex.cls_index,
+                            "p_mask": ex.p_mask,
+                            "is_impossible": ex.is_impossible,
+                        },
+                    )
+                else:
+                    yield (
+                        {
+                            "input_ids": ex.input_ids,
+                            "attention_mask": ex.attention_mask,
+                            "token_type_ids": ex.token_type_ids,
+                            "feature_index": i,
+                            "qas_id": ex.qas_id,
+                        },
+                        {
+                            "start_positions": ex.start_position,
+                            "end_positions": ex.end_position,
+                            "cls_index": ex.cls_index,
+                            "p_mask": ex.p_mask,
+                            "is_impossible": ex.is_impossible,
+                        },
+                    )

        # Why have we split the batch into a tuple? PyTorch just has a list of tensors.
-        train_types = (
-            {
-                "input_ids": tf.int32,
-                "attention_mask": tf.int32,
-                "token_type_ids": tf.int32,
-                "feature_index": tf.int64,
-                "qas_id": tf.string,
-            },
-            {
-                "start_positions": tf.int64,
-                "end_positions": tf.int64,
-                "cls_index": tf.int64,
-                "p_mask": tf.int32,
-                "is_impossible": tf.int32,
-            },
-        )
+        if "token_type_ids" in tokenizer.model_input_names:
+            train_types = (
+                {
+                    "input_ids": tf.int32,
+                    "attention_mask": tf.int32,
+                    "token_type_ids": tf.int32,
+                    "feature_index": tf.int64,
+                    "qas_id": tf.string,
+                },
+                {
+                    "start_positions": tf.int64,
+                    "end_positions": tf.int64,
+                    "cls_index": tf.int64,
+                    "p_mask": tf.int32,
+                    "is_impossible": tf.int32,
+                },
+            )

-        train_shapes = (
-            {
-                "input_ids": tf.TensorShape([None]),
-                "attention_mask": tf.TensorShape([None]),
-                "token_type_ids": tf.TensorShape([None]),
-                "feature_index": tf.TensorShape([]),
-                "qas_id": tf.TensorShape([]),
-            },
-            {
-                "start_positions": tf.TensorShape([]),
-                "end_positions": tf.TensorShape([]),
-                "cls_index": tf.TensorShape([]),
-                "p_mask": tf.TensorShape([None]),
-                "is_impossible": tf.TensorShape([]),
-            },
-        )
+            train_shapes = (
+                {
+                    "input_ids": tf.TensorShape([None]),
+                    "attention_mask": tf.TensorShape([None]),
+                    "token_type_ids": tf.TensorShape([None]),
+                    "feature_index": tf.TensorShape([]),
+                    "qas_id": tf.TensorShape([]),
+                },
+                {
+                    "start_positions": tf.TensorShape([]),
+                    "end_positions": tf.TensorShape([]),
+                    "cls_index": tf.TensorShape([]),
+                    "p_mask": tf.TensorShape([None]),
+                    "is_impossible": tf.TensorShape([]),
+                },
+            )
+        else:
+            train_types = (
+                {"input_ids": tf.int32, "attention_mask": tf.int32, "feature_index": tf.int64, "qas_id": tf.string},
+                {
+                    "start_positions": tf.int64,
+                    "end_positions": tf.int64,
+                    "cls_index": tf.int64,
+                    "p_mask": tf.int32,
+                    "is_impossible": tf.int32,
+                },
+            )
+
+            train_shapes = (
+                {
+                    "input_ids": tf.TensorShape([None]),
+                    "attention_mask": tf.TensorShape([None]),
+                    "feature_index": tf.TensorShape([]),
+                    "qas_id": tf.TensorShape([]),
+                },
+                {
+                    "start_positions": tf.TensorShape([]),
+                    "end_positions": tf.TensorShape([]),
+                    "cls_index": tf.TensorShape([]),
+                    "p_mask": tf.TensorShape([None]),
+                    "is_impossible": tf.TensorShape([]),
+                },
+            )

        return tf.data.Dataset.from_generator(gen, train_types, train_shapes)
    else:
--- a/src/transformers/generation_tf_utils.py
+++ b/src/transformers/generation_tf_utils.py
--- a/src/transformers/generation_utils.py
+++ b/src/transformers/generation_utils.py
@@ -0,0 +1,993 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors, Facebook AI Research authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Iterable, Optional, Tuple
+
+import torch
+from torch import Tensor
+from torch.nn import functional as F
+
+
+logger = logging.getLogger(__name__)
+
+
+class GenerationMixin:
+    """
+    A class contraining all of the functions supporting generation, to be used as a mixin in PreTrainedModel.
+    """
+
+    def prepare_inputs_for_generation(self, input_ids, **kwargs):
+        return {"input_ids": input_ids}
+
+    def adjust_logits_during_generation(self, logits, **kwargs):
+        return logits
+
+    def _use_cache(self, outputs, use_cache):
+        """During generation, decide whether to pass the `past` variable to the next forward pass."""
+        if len(outputs) <= 1 or use_cache is False:
+            return False
+        if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
+            return False
+        return True
+
+    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):
+        """repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
+        for i in range(batch_size * num_beams):
+            for previous_token in set(prev_output_tokens[i].tolist()):
+                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
+                if lprobs[i, previous_token] < 0:
+                    lprobs[i, previous_token] *= repetition_penalty
+                else:
+                    lprobs[i, previous_token] /= repetition_penalty
+
+    def postprocess_next_token_scores(
+        self,
+        scores,
+        input_ids,
+        no_repeat_ngram_size,
+        bad_words_ids,
+        cur_len,
+        min_length,
+        max_length,
+        eos_token_id,
+        repetition_penalty,
+        batch_size,
+        num_beams,
+    ):
+        # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)
+        if repetition_penalty != 1.0:
+            self.enforce_repetition_penalty_(
+                scores, batch_size, num_beams, input_ids, repetition_penalty,
+            )
+
+        # set eos token prob to zero if min_length is not reached
+        if eos_token_id is not None and cur_len < min_length:
+            scores[:, eos_token_id] = -float("inf")
+
+        if no_repeat_ngram_size > 0:
+            # calculate a list of banned tokens to prevent repetitively generating the same ngrams
+            num_batch_hypotheses = batch_size * num_beams
+            # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
+            banned_batch_tokens = calc_banned_ngram_tokens(
+                input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
+            )
+            for i, banned_tokens in enumerate(banned_batch_tokens):
+                scores[i, banned_tokens] = -float("inf")
+
+        if bad_words_ids is not None:
+            # calculate a list of banned tokens according to bad words
+            banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
+
+            for i, banned_tokens in enumerate(banned_tokens):
+                scores[i, banned_tokens] = -float("inf")
+
+        return scores
+
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        max_length: Optional[int] = None,
+        min_length: Optional[int] = None,
+        do_sample: Optional[bool] = None,
+        early_stopping: Optional[bool] = None,
+        num_beams: Optional[int] = None,
+        temperature: Optional[float] = None,
+        top_k: Optional[int] = None,
+        top_p: Optional[float] = None,
+        repetition_penalty: Optional[float] = None,
+        bad_words_ids: Optional[Iterable[int]] = None,
+        bos_token_id: Optional[int] = None,
+        pad_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        length_penalty: Optional[float] = None,
+        no_repeat_ngram_size: Optional[int] = None,
+        num_return_sequences: Optional[int] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        decoder_start_token_id: Optional[int] = None,
+        use_cache: Optional[bool] = None,
+        **model_specific_kwargs
+    ) -> torch.LongTensor:
+        r""" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
+
+        Adapted in part from `Facebook's XLM beam search code`_.
+
+        .. _`Facebook's XLM beam search code`:
+           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529
+
+
+        Parameters:
+
+            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`
+                The sequence used as a prompt for the generation. If `None` the method initializes
+                it as an empty `torch.LongTensor` of shape `(1,)`.
+
+            max_length: (`optional`) int
+                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.
+
+            min_length: (`optional`) int
+                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.
+
+            do_sample: (`optional`) bool
+                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
+
+            early_stopping: (`optional`) bool
+                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
+
+            num_beams: (`optional`) int
+                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
+
+            temperature: (`optional`) float
+                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
+
+            top_k: (`optional`) int
+                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.
+
+            top_p: (`optional`) float
+                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
+
+            repetition_penalty: (`optional`) float
+                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
+
+            pad_token_id: (`optional`) int
+                Padding token. Default to specicic model pad_token_id or None if it does not exist.
+
+            bos_token_id: (`optional`) int
+                BOS token. Defaults to `bos_token_id` as defined in the models config.
+
+            eos_token_id: (`optional`) int
+                EOS token. Defaults to `eos_token_id` as defined in the models config.
+
+            length_penalty: (`optional`) float
+                Exponential penalty to the length. Default to 1.
+
+            no_repeat_ngram_size: (`optional`) int
+                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
+            bad_words_ids: (`optional`) list of lists of int
+                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
+
+            num_return_sequences: (`optional`) int
+                The number of independently computed returned sequences for each element in the batch. Default to 1.
+
+            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`
+                Mask to avoid performing attention on padding token indices.
+                Mask values selected in ``[0, 1]``:
+                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+                Defaults to `None`.
+
+                `What are attention masks? <../glossary.html#attention-mask>`__
+
+            decoder_start_token_id=None: (`optional`) int
+                If an encoder-decoder model starts decoding with a different token than BOS.
+                Defaults to `None` and is changed to `BOS` later.
+
+            use_cache: (`optional`) bool
+                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.
+
+            model_specific_kwargs: (`optional`) dict
+                Additional model specific kwargs will be forwarded to the `forward` function of the model.
+
+        Return:
+
+            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`
+                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`
+
+        Examples::
+
+            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer
+            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.
+            outputs = model.generate(max_length=40)  # do greedy decoding
+            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
+
+            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer
+            model = AutoModelWithLMHead.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.
+            input_context = 'The dog'
+            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
+            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
+            for i in range(3): #  3 output sequences were generated
+                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
+
+            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer
+            model = AutoModelWithLMHead.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.
+            input_context = 'The dog'
+            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
+            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3)  # 3 generate sequences using by sampling
+            for i in range(3): #  3 output sequences were generated
+                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
+
+            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer
+            model = AutoModelWithLMHead.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.
+            input_context = 'Legal My neighbor is'  # "Legal" is one of the control codes for ctrl
+            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
+            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences
+            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
+
+            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer
+            model = AutoModelWithLMHead.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.
+            input_context = 'My cute dog'  # "Legal" is one of the control codes for ctrl
+            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
+            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
+            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated
+        """
+
+        # We cannot generate if the model does not have a LM head
+        if self.get_output_embeddings() is None:
+            raise AttributeError(
+                "You tried to generate sequences with a model that does not have a LM Head."
+                "Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )"
+            )
+
+        max_length = max_length if max_length is not None else self.config.max_length
+        min_length = min_length if min_length is not None else self.config.min_length
+        do_sample = do_sample if do_sample is not None else self.config.do_sample
+        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        num_beams = num_beams if num_beams is not None else self.config.num_beams
+        temperature = temperature if temperature is not None else self.config.temperature
+        top_k = top_k if top_k is not None else self.config.top_k
+        top_p = top_p if top_p is not None else self.config.top_p
+        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty
+        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
+        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
+        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
+        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
+        no_repeat_ngram_size = (
+            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
+        )
+        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
+        num_return_sequences = (
+            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
+        )
+        decoder_start_token_id = (
+            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
+        )
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]  # overriden by the input batch_size
+        else:
+            batch_size = 1
+
+        assert isinstance(max_length, int) and max_length > 0, "`max_length` should be a strictly positive integer."
+        assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
+        assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
+        assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
+        assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
+        assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
+        assert temperature > 0, "`temperature` should be strictly positive."
+        assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
+        assert 0 <= top_p <= 1, "`top_p` should be between 0 and 1."
+        assert repetition_penalty >= 1.0, "`repetition_penalty` should be >= 1."
+        assert input_ids is not None or (
+            isinstance(bos_token_id, int) and bos_token_id >= 0
+        ), "If input_ids is not defined, `bos_token_id` should be a positive integer."
+        assert pad_token_id is None or (
+            isinstance(pad_token_id, int) and (pad_token_id >= 0)
+        ), "`pad_token_id` should be a positive integer."
+        assert (eos_token_id is None) or (
+            isinstance(eos_token_id, int) and (eos_token_id >= 0)
+        ), "`eos_token_id` should be a positive integer."
+        assert length_penalty > 0, "`length_penalty` should be strictly positive."
+        assert (
+            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0
+        ), "`no_repeat_ngram_size` should be a positive integer."
+        assert (
+            isinstance(num_return_sequences, int) and num_return_sequences > 0
+        ), "`num_return_sequences` should be a strictly positive integer."
+        assert (
+            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
+        ), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
+
+        if input_ids is None:
+            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
+                "you should either supply a context to complete as `input_ids` input "
+                "or a `bos_token_id` (integer >= 0) as a first token to start the generation."
+            )
+            input_ids = torch.full(
+                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,
+            )
+        else:
+            assert input_ids.dim() == 2, "Input prompt should be of shape (batch_size, sequence length)."
+
+        # not allow to duplicate outputs when greedy decoding
+        if do_sample is False:
+            if num_beams == 1:
+                # no_beam_search greedy generation conditions
+                assert (
+                    num_return_sequences == 1
+                ), "Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1"
+
+            else:
+                # beam_search greedy generation conditions
+                assert (
+                    num_beams >= num_return_sequences
+                ), "Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences"
+
+        # create attention mask if necessary
+        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140
+        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):
+            attention_mask = input_ids.ne(pad_token_id).long()
+        elif attention_mask is None:
+            attention_mask = input_ids.new_ones(input_ids.shape)
+
+        # set pad_token_id to eos_token_id if not set. Important that this is done after
+        # attention_mask is created
+        if pad_token_id is None and eos_token_id is not None:
+            logger.warning(
+                "Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence".format(eos_token_id)
+            )
+            pad_token_id = eos_token_id
+
+        # current position and vocab size
+        if hasattr(self.config, "vocab_size"):
+            vocab_size = self.config.vocab_size
+        elif (
+            self.config.is_encoder_decoder
+            and hasattr(self.config, "decoder")
+            and hasattr(self.config.decoder, "vocab_size")
+        ):
+            vocab_size = self.config.decoder.vocab_size
+
+        # set effective batch size and effective batch multiplier according to do_sample
+        if do_sample:
+            effective_batch_size = batch_size * num_return_sequences
+            effective_batch_mult = num_return_sequences
+        else:
+            effective_batch_size = batch_size
+            effective_batch_mult = 1
+
+        if self.config.is_encoder_decoder:
+            if decoder_start_token_id is None:
+                decoder_start_token_id = bos_token_id
+
+            assert (
+                decoder_start_token_id is not None
+            ), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
+            assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
+            assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
+
+            # get encoder and store encoder outputs
+            encoder = self.get_encoder()
+
+            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
+
+        # Expand input ids if num_beams > 1 or num_return_sequences > 1
+        if num_return_sequences > 1 or num_beams > 1:
+            input_ids_len = input_ids.shape[-1]
+            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)
+            attention_mask = attention_mask.unsqueeze(1).expand(
+                batch_size, effective_batch_mult * num_beams, input_ids_len
+            )
+
+            input_ids = input_ids.contiguous().view(
+                effective_batch_size * num_beams, input_ids_len
+            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)
+            attention_mask = attention_mask.contiguous().view(
+                effective_batch_size * num_beams, input_ids_len
+            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)
+
+        if self.config.is_encoder_decoder:
+            # create empty decoder_input_ids
+            input_ids = torch.full(
+                (effective_batch_size * num_beams, 1),
+                decoder_start_token_id,
+                dtype=torch.long,
+                device=next(self.parameters()).device,
+            )
+            cur_len = 1
+
+            assert (
+                batch_size == encoder_outputs[0].shape[0]
+            ), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} "
+
+            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)
+            expanded_batch_idxs = (
+                torch.arange(batch_size)
+                .view(-1, 1)
+                .repeat(1, num_beams * effective_batch_mult)
+                .view(-1)
+                .to(input_ids.device)
+            )
+            # expand encoder_outputs
+            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])
+
+        else:
+            encoder_outputs = None
+            cur_len = input_ids.shape[-1]
+
+        assert (
+            cur_len < max_length
+        ), f"The context has {cur_len} number of tokens, but `max_length` is only {max_length}. Please make sure that `max_length` is bigger than the number of tokens, by setting either `generate(max_length=...,...)` or `config.max_length = ...`"
+
+        if num_beams > 1:
+            output = self._generate_beam_search(
+                input_ids,
+                cur_len=cur_len,
+                max_length=max_length,
+                min_length=min_length,
+                do_sample=do_sample,
+                early_stopping=early_stopping,
+                temperature=temperature,
+                top_k=top_k,
+                top_p=top_p,
+                repetition_penalty=repetition_penalty,
+                no_repeat_ngram_size=no_repeat_ngram_size,
+                bad_words_ids=bad_words_ids,
+                pad_token_id=pad_token_id,
+                eos_token_id=eos_token_id,
+                batch_size=effective_batch_size,
+                num_return_sequences=num_return_sequences,
+                length_penalty=length_penalty,
+                num_beams=num_beams,
+                vocab_size=vocab_size,
+                encoder_outputs=encoder_outputs,
+                attention_mask=attention_mask,
+                use_cache=use_cache,
+                model_specific_kwargs=model_specific_kwargs,
+            )
+        else:
+            output = self._generate_no_beam_search(
+                input_ids,
+                cur_len=cur_len,
+                max_length=max_length,
+                min_length=min_length,
+                do_sample=do_sample,
+                temperature=temperature,
+                top_k=top_k,
+                top_p=top_p,
+                repetition_penalty=repetition_penalty,
+                no_repeat_ngram_size=no_repeat_ngram_size,
+                bad_words_ids=bad_words_ids,
+                pad_token_id=pad_token_id,
+                eos_token_id=eos_token_id,
+                batch_size=effective_batch_size,
+                encoder_outputs=encoder_outputs,
+                attention_mask=attention_mask,
+                use_cache=use_cache,
+                model_specific_kwargs=model_specific_kwargs,
+            )
+
+        return output
+
+    def _generate_no_beam_search(
+        self,
+        input_ids,
+        cur_len,
+        max_length,
+        min_length,
+        do_sample,
+        temperature,
+        top_k,
+        top_p,
+        repetition_penalty,
+        no_repeat_ngram_size,
+        bad_words_ids,
+        pad_token_id,
+        eos_token_id,
+        batch_size,
+        encoder_outputs,
+        attention_mask,
+        use_cache,
+        model_specific_kwargs,
+    ):
+        """ Generate sequences for each example without beam search (num_beams == 1).
+            All returned sequence are generated independantly.
+        """
+        # length of generated sentences / unfinished sentences
+        unfinished_sents = input_ids.new(batch_size).fill_(1)
+        sent_lengths = input_ids.new(batch_size).fill_(max_length)
+
+        past = (encoder_outputs, None) if encoder_outputs is not None else None
+
+        while cur_len < max_length:
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
+            )
+
+            outputs = self(**model_inputs)
+            next_token_logits = outputs[0][:, -1, :]
+
+            scores = self.postprocess_next_token_scores(
+                scores=next_token_logits,
+                input_ids=input_ids,
+                no_repeat_ngram_size=no_repeat_ngram_size,
+                bad_words_ids=bad_words_ids,
+                cur_len=cur_len,
+                min_length=min_length,
+                max_length=max_length,
+                eos_token_id=eos_token_id,
+                repetition_penalty=repetition_penalty,
+                batch_size=batch_size,
+                num_beams=1,
+            )
+
+            # if model has past, then set the past variable to speed up decoding
+            if self._use_cache(outputs, use_cache):
+                past = outputs[1]
+
+            if do_sample:
+                # Temperature (higher temperature => more likely to sample low probability tokens)
+                if temperature != 1.0:
+                    scores = scores / temperature
+                # Top-p/top-k filtering
+                next_token_logscores = top_k_top_p_filtering(scores, top_k=top_k, top_p=top_p)
+                # Sample
+                probs = F.softmax(next_token_logscores, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
+            else:
+                # Greedy decoding
+                next_token = torch.argmax(next_token_logits, dim=-1)
+
+            # update generations and finished sentences
+            if eos_token_id is not None:
+                # pad finished sentences if eos_token_id exist
+                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)
+            else:
+                tokens_to_add = next_token
+
+            # add token and increase length by one
+            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
+            cur_len = cur_len + 1
+
+            if eos_token_id is not None:
+                eos_in_sents = tokens_to_add == eos_token_id
+                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
+                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
+                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
+                # unfinished_sents is set to zero if eos in sentence
+                unfinished_sents.mul_((~eos_in_sents).long())
+
+            # stop when there is a </s> in each sentence, or if we exceed the maximul length
+            if unfinished_sents.max() == 0:
+                break
+
+            # extend attention_mask for new generated input if only decoder
+            if self.config.is_encoder_decoder is False:
+                attention_mask = torch.cat(
+                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
+                )
+
+        return input_ids
+
+    def _generate_beam_search(
+        self,
+        input_ids,
+        cur_len,
+        max_length,
+        min_length,
+        do_sample,
+        early_stopping,
+        temperature,
+        top_k,
+        top_p,
+        repetition_penalty,
+        no_repeat_ngram_size,
+        bad_words_ids,
+        pad_token_id,
+        eos_token_id,
+        batch_size,
+        num_return_sequences,
+        length_penalty,
+        num_beams,
+        vocab_size,
+        encoder_outputs,
+        attention_mask,
+        use_cache,
+        model_specific_kwargs,
+    ):
+        """ Generate sequences for each example with beam search.
+        """
+
+        # generated hypotheses
+        generated_hyps = [
+            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)
+            for _ in range(batch_size)
+        ]
+
+        # scores for each sentence in the beam
+        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)
+
+        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times
+        if do_sample is False:
+            beam_scores[:, 1:] = -1e9
+        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)
+
+        # cache compute states
+        past = (encoder_outputs, None) if encoder_outputs is not None else None
+
+        # done sentences
+        done = [False for _ in range(batch_size)]
+
+        while cur_len < max_length:
+            model_inputs = self.prepare_inputs_for_generation(
+                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
+            )
+            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)
+            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)
+
+            # if model has past, then set the past variable to speed up decoding
+            if self._use_cache(outputs, use_cache):
+                past = outputs[1]
+            if self.config.is_encoder_decoder and do_sample is False:
+                # TODO (PVP) still a bit hacky here - there might be a better solution
+                next_token_logits = self.adjust_logits_during_generation(
+                    next_token_logits, cur_len=cur_len, max_length=max_length
+                )
+
+            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)
+
+            scores = self.postprocess_next_token_scores(
+                scores=scores,
+                input_ids=input_ids,
+                no_repeat_ngram_size=no_repeat_ngram_size,
+                bad_words_ids=bad_words_ids,
+                cur_len=cur_len,
+                min_length=min_length,
+                max_length=max_length,
+                eos_token_id=eos_token_id,
+                repetition_penalty=repetition_penalty,
+                batch_size=batch_size,
+                num_beams=num_beams,
+            )
+
+            assert scores.shape == (batch_size * num_beams, vocab_size), "Shapes of scores: {} != {}".format(
+                scores.shape, (batch_size * num_beams, vocab_size)
+            )
+
+            if do_sample:
+                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)
+                # Temperature
+                if temperature != 1.0:
+                    _scores = _scores / temperature
+                # Top-p/top-k filtering
+                _scores = top_k_top_p_filtering(
+                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2
+                )  # (batch_size * num_beams, vocab_size)
+                # re-organize to group the beam together to sample from all beam_idxs
+                _scores = _scores.contiguous().view(
+                    batch_size, num_beams * vocab_size
+                )  # (batch_size, num_beams * vocab_size)
+
+                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)
+                probs = F.softmax(_scores, dim=-1)
+                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)
+                # Compute next scores
+                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)
+                # sort the sampled vector to make sure that the first num_beams samples are the best
+                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)
+                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)
+
+            else:
+                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)
+
+                # re-organize to group the beam together (we are keeping top hypothesis accross beams)
+                next_scores = next_scores.view(
+                    batch_size, num_beams * vocab_size
+                )  # (batch_size, num_beams * vocab_size)
+
+                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
+
+            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)
+
+            # next batch beam content
+            next_batch_beam = []
+
+            # for each sentence
+            for batch_idx in range(batch_size):
+
+                # if we are done with this sentence, add a pad token
+                if done[batch_idx]:
+                    assert (
+                        len(generated_hyps[batch_idx]) >= num_beams
+                    ), "Batch can only be done if at least {} beams have been generated".format(num_beams)
+                    assert (
+                        eos_token_id is not None and pad_token_id is not None
+                    ), "generated beams >= num_beams -> eos_token_id and pad_token have to be defined"
+                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch
+                    continue
+
+                # next sentence beam content, this will get added to next_batch_beam
+                next_sent_beam = []
+
+                # next tokens for this sentence
+                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
+                    zip(next_tokens[batch_idx], next_scores[batch_idx])
+                ):
+                    # get beam and token IDs
+                    beam_id = beam_token_id // vocab_size
+                    token_id = beam_token_id % vocab_size
+
+                    effective_beam_id = batch_idx * num_beams + beam_id
+                    # add to generated hypotheses if end of sentence
+                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):
+                        # if beam_token does not belong to top num_beams tokens, it should not be added
+                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
+                        if is_beam_token_worse_than_top_num_beams:
+                            continue
+                        generated_hyps[batch_idx].add(
+                            input_ids[effective_beam_id].clone(), beam_token_score.item(),
+                        )
+                    else:
+                        # add next predicted token since it is not eos_token
+                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))
+
+                    # once the beam for next step is full, don't add more tokens to it.
+                    if len(next_sent_beam) == num_beams:
+                        break
+
+                # Check if we are done so that we can save a pad step if all(done)
+                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(
+                    next_scores[batch_idx].max().item(), cur_len
+                )
+
+                # update next beam content
+                assert len(next_sent_beam) == num_beams, "Beam should always be full"
+                next_batch_beam.extend(next_sent_beam)
+                assert len(next_batch_beam) == num_beams * (batch_idx + 1), "We should have added num_beams each step"
+
+            # stop when we are done with each sentence
+            if all(done):
+                break
+
+            # sanity check / prepare next batch
+            assert len(next_batch_beam) == batch_size * num_beams
+            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])
+            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
+            beam_idx = input_ids.new([x[2] for x in next_batch_beam])
+
+            # re-order batch and update current length
+            input_ids = input_ids[beam_idx, :]
+            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
+            cur_len = cur_len + 1
+
+            # re-order internal states
+            if past is not None:
+                past = self._reorder_cache(past, beam_idx)
+
+            # extend attention_mask for new generated input if only decoder
+            if self.config.is_encoder_decoder is False:
+                attention_mask = torch.cat(
+                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
+                )
+
+        # finalize all open beam hypotheses and add to generated hypotheses
+        for batch_idx in range(batch_size):
+            if done[batch_idx]:
+                continue
+
+            # test that beam scores match previously calculated scores if not eos and batch_idx not done
+            if eos_token_id is not None and all(
+                (token_id % vocab_size).item() != eos_token_id for token_id in next_tokens[batch_idx]
+            ):
+                assert torch.all(
+                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]
+                ), "If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}".format(
+                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],
+                )
+
+            # need to add best num_beams hypotheses to generated hyps
+            for beam_id in range(num_beams):
+                effective_beam_id = batch_idx * num_beams + beam_id
+                final_score = beam_scores[effective_beam_id].item()
+                final_tokens = input_ids[effective_beam_id]
+                generated_hyps[batch_idx].add(final_tokens, final_score)
+
+        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch
+        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences
+        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences
+
+        # select the best hypotheses
+        sent_lengths = input_ids.new(output_batch_size)
+        best = []
+
+        # retrieve best hypotheses
+        for i, hypotheses in enumerate(generated_hyps):
+            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])
+            for j in range(output_num_return_sequences_per_batch):
+                effective_batch_idx = output_num_return_sequences_per_batch * i + j
+                best_hyp = sorted_hyps.pop()[1]
+                sent_lengths[effective_batch_idx] = len(best_hyp)
+                best.append(best_hyp)
+
+        # shorter batches are padded
+        if sent_lengths.min().item() != sent_lengths.max().item():
+            assert pad_token_id is not None, "`Pad_token_id` has to be defined"
+            sent_max_len = min(sent_lengths.max().item() + 1, max_length)
+            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)
+
+            # fill with hypothesis and eos_token_id if necessary
+            for i, hypo in enumerate(best):
+                decoded[i, : sent_lengths[i]] = hypo
+                if sent_lengths[i] < max_length:
+                    decoded[i, sent_lengths[i]] = eos_token_id
+        else:
+            # none of the hypotheses have an eos_token
+            assert (len(hypo) == max_length for hypo in best)
+            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)
+
+        return decoded
+
+    @staticmethod
+    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:
+        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)
+
+
+def calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:
+    """Copied from fairseq for no_repeat_ngram in beam_search"""
+    if cur_len + 1 < no_repeat_ngram_size:
+        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
+        return [[] for _ in range(num_hypos)]
+    generated_ngrams = [{} for _ in range(num_hypos)]
+    for idx in range(num_hypos):
+        gen_tokens = prev_input_ids[idx].tolist()
+        generated_ngram = generated_ngrams[idx]
+        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
+            prev_ngram_tuple = tuple(ngram[:-1])
+            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
+
+    def _get_generated_ngrams(hypo_idx):
+        # Before decoding the next token, prevent decoding of ngrams that have already appeared
+        start_idx = cur_len + 1 - no_repeat_ngram_size
+        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
+        return generated_ngrams[hypo_idx].get(ngram_idx, [])
+
+    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
+    return banned_tokens
+
+
+def calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:
+    banned_tokens = []
+
+    def _tokens_match(prev_tokens, tokens):
+        if len(tokens) == 0:
+            # if bad word tokens is just one token always ban it
+            return True
+        if len(tokens) > len(prev_input_ids):
+            # if bad word tokens are longer then prev input_ids they can't be equal
+            return False
+
+        if prev_tokens[-len(tokens) :] == tokens:
+            # if tokens match
+            return True
+        else:
+            return False
+
+    for prev_input_ids_slice in prev_input_ids:
+        banned_tokens_slice = []
+
+        for banned_token_seq in bad_words_ids:
+            assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
+                bad_words_ids
+            )
+
+            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:
+                # if tokens do not match continue
+                continue
+
+            banned_tokens_slice.append(banned_token_seq[-1])
+
+        banned_tokens.append(banned_tokens_slice)
+
+    return banned_tokens
+
+
+def top_k_top_p_filtering(
+    logits: Tensor,
+    top_k: int = 0,
+    top_p: float = 1.0,
+    filter_value: float = -float("Inf"),
+    min_tokens_to_keep: int = 1,
+) -> Tensor:
+    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
+        Args:
+            logits: logits distribution shape (batch size, vocabulary size)
+            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
+            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
+                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
+            Make sure we keep at least min_tokens_to_keep per batch example in the output
+        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
+    """
+    if top_k > 0:
+        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check
+        # Remove all tokens with a probability less than the last token of the top-k
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+
+    if top_p < 1.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
+        sorted_indices_to_remove = cumulative_probs > top_p
+        if min_tokens_to_keep > 1:
+            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
+            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
+        # Shift the indices to the right to keep also the first token above the threshold
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+
+        # scatter sorted tensors to original indexing
+        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
+        logits[indices_to_remove] = filter_value
+    return logits
+
+
+class BeamHypotheses(object):
+    def __init__(self, num_beams, max_length, length_penalty, early_stopping):
+        """
+        Initialize n-best list of hypotheses.
+        """
+        self.max_length = max_length - 1  # ignoring bos_token
+        self.length_penalty = length_penalty
+        self.early_stopping = early_stopping
+        self.num_beams = num_beams
+        self.beams = []
+        self.worst_score = 1e9
+
+    def __len__(self):
+        """
+        Number of hypotheses in the list.
+        """
+        return len(self.beams)
+
+    def add(self, hyp, sum_logprobs):
+        """
+        Add a new hypothesis to the list.
+        """
+        score = sum_logprobs / len(hyp) ** self.length_penalty
+        if len(self) < self.num_beams or score > self.worst_score:
+            self.beams.append((score, hyp))
+            if len(self) > self.num_beams:
+                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])
+                del self.beams[sorted_scores[0][1]]
+                self.worst_score = sorted_scores[1][0]
+            else:
+                self.worst_score = min(score, self.worst_score)
+
+    def is_done(self, best_sum_logprobs, cur_len):
+        """
+        If there are enough hypotheses and that none of the hypotheses being generated
+        can become better than the worst one in the heap, then we are done with this sentence.
+        """
+
+        if len(self) < self.num_beams:
+            return False
+        elif self.early_stopping:
+            return True
+        else:
+            cur_score = best_sum_logprobs / cur_len ** self.length_penalty
+            ret = self.worst_score >= cur_score
+            return ret
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -122,7 +122,12 @@ from .modeling_mobilebert import (
    MobileBertModel,
 )
 from .modeling_openai import OpenAIGPTLMHeadModel, OpenAIGPTModel
-from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
+from .modeling_reformer import (
+    ReformerForMaskedLM,
+    ReformerForQuestionAnswering,
+    ReformerModel,
+    ReformerModelWithLMHead,
+)
 from .modeling_retribert import RetriBertModel
 from .modeling_roberta import (
    RobertaForMaskedLM,
@@ -266,6 +271,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
        (FlaubertConfig, FlaubertWithLMHeadModel),
        (XLMConfig, XLMWithLMHeadModel),
        (ElectraConfig, ElectraForMaskedLM),
+        (ReformerConfig, ReformerForMaskedLM),
    ]
 )

@@ -310,6 +316,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
        (MobileBertConfig, MobileBertForQuestionAnswering),
        (XLMConfig, XLMForQuestionAnsweringSimple),
        (ElectraConfig, ElectraForQuestionAnswering),
+        (ReformerConfig, ReformerForQuestionAnswering),
    ]
 )

--- a/src/transformers/modeling_electra.py
+++ b/src/transformers/modeling_electra.py
@@ -133,7 +133,7 @@ class ElectraDiscriminatorPredictions(nn.Module):
        self.dense_prediction = nn.Linear(config.hidden_size, 1)
        self.config = config

-    def forward(self, discriminator_hidden_states, attention_mask):
+    def forward(self, discriminator_hidden_states):
        hidden_states = self.dense(discriminator_hidden_states)
        hidden_states = get_activation(self.config.hidden_act)(hidden_states)
        logits = self.dense_prediction(hidden_states).squeeze()
@@ -518,7 +518,7 @@ class ElectraForPreTraining(ElectraPreTrainedModel):
        )
        discriminator_sequence_output = discriminator_hidden_states[0]

-        logits = self.discriminator_predictions(discriminator_sequence_output, attention_mask)
+        logits = self.discriminator_predictions(discriminator_sequence_output)

        output = (logits,)

--- a/src/transformers/modeling_longformer.py
+++ b/src/transformers/modeling_longformer.py
--- a/src/transformers/modeling_mobilebert.py
+++ b/src/transformers/modeling_mobilebert.py
@@ -575,7 +575,7 @@ class MobileBertPooler(nn.Module):
            return first_token_tensor
        else:
            pooled_output = self.dense(first_token_tensor)
-            pooled_output = F.tanh(pooled_output)
+            pooled_output = torch.tanh(pooled_output)
            return pooled_output


--- a/src/transformers/modeling_reformer.py
+++ b/src/transformers/modeling_reformer.py
@@ -1704,6 +1704,7 @@ class ReformerModel(ReformerPreTrainedModel):
 class ReformerModelWithLMHead(ReformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
+        assert config.is_decoder, "If you want to use `ReformerLMHeadModel` make sure that `is_decoder=True`."
        self.reformer = ReformerModel(config)
        self.lm_head = ReformerOnlyLMHead(config)

@@ -1789,3 +1790,190 @@ class ReformerModelWithLMHead(ReformerPreTrainedModel):
            inputs_dict["num_hashes"] = kwargs["num_hashes"]

        return inputs_dict
+
+
+@add_start_docstrings("""Reformer Model with a `language modeling` head on top. """, REFORMER_START_DOCSTRING)
+class ReformerForMaskedLM(ReformerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        assert (
+            not config.is_decoder
+        ), "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention."
+        self.reformer = ReformerModel(config)
+        self.lm_head = ReformerOnlyLMHead(config)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+
+    def tie_weights(self):
+        # word embeddings are not tied in Reformer
+        pass
+
+    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/reformer-crime-and-punishment")
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        num_hashes=None,
+        labels=None,
+        output_hidden_states=None,
+        output_attentions=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+                Labels for computing the masked language modeling loss.
+                Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+                Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+
+    Return:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Classification loss (cross entropy).
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        """
+
+        reformer_outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            num_hashes=num_hashes,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+
+        sequence_output = reformer_outputs[0]
+        logits = self.lm_head(sequence_output)
+        outputs = (logits,) + reformer_outputs[1:]
+
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+            outputs = (masked_lm_loss,) + outputs
+
+        return outputs  # (mlm_loss), lm_logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings(
+    """Reformer Model with a span classification head on top for
+    extractive question-answering tasks like SQuAD / TriviaQA ( a linear layer on
+    top of hidden-states output to compute `span start logits` and `span end logits`. """,
+    REFORMER_START_DOCSTRING,
+)
+class ReformerForQuestionAnswering(ReformerPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.reformer = ReformerModel(config)
+        # 2 * config.hidden_size because we use reversible residual layers
+        self.qa_outputs = nn.Linear(2 * config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    def tie_weights(self):
+        # word embeddings are not tied in Reformer
+        pass
+
+    @add_start_docstrings_to_callable(REFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint="google/reformer-crime-and-punishment")
+    def forward(
+        self,
+        input_ids=None,
+        position_ids=None,
+        attention_mask=None,
+        head_mask=None,
+        inputs_embeds=None,
+        num_hashes=None,
+        start_positions=None,
+        end_positions=None,
+        output_hidden_states=None,
+        output_attentions=None,
+    ):
+        r"""
+        start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+        end_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`).
+            Position outside of the sequence are not taken into account for computing the loss.
+    Return:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.ReformerConfig`) and inputs:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
+        start_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+            Span-start scores (before SoftMax).
+        end_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length,)`):
+            Span-end scores (before SoftMax).
+        all_hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        all_attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        """
+
+        reformer_outputs = self.reformer(
+            input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            num_hashes=num_hashes,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+        )
+
+        sequence_output = reformer_outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1)
+        end_logits = end_logits.squeeze(-1)
+
+        outputs = (start_logits, end_logits,) + reformer_outputs[1:]
+
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions.clamp_(0, ignored_index)
+            end_positions.clamp_(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
--- a/src/transformers/modeling_tf_auto.py
+++ b/src/transformers/modeling_tf_auto.py
@@ -141,7 +141,6 @@ logger = logging.getLogger(__name__)
 TF_MODEL_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertModel),
-        (BertConfig, TFBertModel),
        (CamembertConfig, TFCamembertModel),
        (CTRLConfig, TFCTRLModel),
        (DistilBertConfig, TFDistilBertModel),
@@ -151,6 +150,7 @@ TF_MODEL_MAPPING = OrderedDict(
        (MobileBertConfig, TFMobileBertModel),
        (OpenAIGPTConfig, TFOpenAIGPTModel),
        (RobertaConfig, TFRobertaModel),
+        (BertConfig, TFBertModel),
        (T5Config, TFT5Model),
        (TransfoXLConfig, TFTransfoXLModel),
        (XLMConfig, TFXLMModel),
@@ -162,7 +162,6 @@ TF_MODEL_MAPPING = OrderedDict(
 TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForPreTraining),
-        (BertConfig, TFBertForPreTraining),
        (CamembertConfig, TFCamembertForMaskedLM),
        (CTRLConfig, TFCTRLLMHeadModel),
        (DistilBertConfig, TFDistilBertForMaskedLM),
@@ -172,6 +171,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
        (MobileBertConfig, TFMobileBertForPreTraining),
        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
        (RobertaConfig, TFRobertaForMaskedLM),
+        (BertConfig, TFBertForPreTraining),
        (T5Config, TFT5ForConditionalGeneration),
        (TransfoXLConfig, TFTransfoXLLMHeadModel),
        (XLMConfig, TFXLMWithLMHeadModel),
@@ -183,7 +183,6 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
 TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForMaskedLM),
-        (BertConfig, TFBertForMaskedLM),
        (CamembertConfig, TFCamembertForMaskedLM),
        (CTRLConfig, TFCTRLLMHeadModel),
        (DistilBertConfig, TFDistilBertForMaskedLM),
@@ -193,6 +192,7 @@ TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (MobileBertConfig, TFMobileBertForMaskedLM),
        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
        (RobertaConfig, TFRobertaForMaskedLM),
+        (BertConfig, TFBertForMaskedLM),
        (T5Config, TFT5ForConditionalGeneration),
        (TransfoXLConfig, TFTransfoXLLMHeadModel),
        (XLMConfig, TFXLMWithLMHeadModel),
@@ -204,12 +204,12 @@ TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
 TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForMultipleChoice),
-        (BertConfig, TFBertForMultipleChoice),
        (CamembertConfig, TFCamembertForMultipleChoice),
        (DistilBertConfig, TFDistilBertForMultipleChoice),
        (FlaubertConfig, TFFlaubertForMultipleChoice),
        (MobileBertConfig, TFMobileBertForMultipleChoice),
        (RobertaConfig, TFRobertaForMultipleChoice),
+        (BertConfig, TFBertForMultipleChoice),
        (XLMConfig, TFXLMForMultipleChoice),
        (XLMRobertaConfig, TFXLMRobertaForMultipleChoice),
        (XLNetConfig, TFXLNetForMultipleChoice),
@@ -219,13 +219,13 @@ TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
 TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForQuestionAnswering),
-        (BertConfig, TFBertForQuestionAnswering),
        (CamembertConfig, TFCamembertForQuestionAnswering),
        (DistilBertConfig, TFDistilBertForQuestionAnswering),
        (ElectraConfig, TFElectraForQuestionAnswering),
        (FlaubertConfig, TFFlaubertForQuestionAnsweringSimple),
        (MobileBertConfig, TFMobileBertForQuestionAnswering),
        (RobertaConfig, TFRobertaForQuestionAnswering),
+        (BertConfig, TFBertForQuestionAnswering),
        (XLMConfig, TFXLMForQuestionAnsweringSimple),
        (XLMRobertaConfig, TFXLMRobertaForQuestionAnswering),
        (XLNetConfig, TFXLNetForQuestionAnsweringSimple),
@@ -235,12 +235,12 @@ TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
 TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForSequenceClassification),
-        (BertConfig, TFBertForSequenceClassification),
        (CamembertConfig, TFCamembertForSequenceClassification),
        (DistilBertConfig, TFDistilBertForSequenceClassification),
        (FlaubertConfig, TFFlaubertForSequenceClassification),
        (MobileBertConfig, TFMobileBertForSequenceClassification),
        (RobertaConfig, TFRobertaForSequenceClassification),
+        (BertConfig, TFBertForSequenceClassification),
        (XLMConfig, TFXLMForSequenceClassification),
        (XLMRobertaConfig, TFXLMRobertaForSequenceClassification),
        (XLNetConfig, TFXLNetForSequenceClassification),
@@ -250,13 +250,13 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
 TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
    [
        (AlbertConfig, TFAlbertForTokenClassification),
-        (BertConfig, TFBertForTokenClassification),
        (CamembertConfig, TFCamembertForTokenClassification),
        (DistilBertConfig, TFDistilBertForTokenClassification),
        (ElectraConfig, TFElectraForTokenClassification),
        (FlaubertConfig, TFFlaubertForTokenClassification),
        (MobileBertConfig, TFMobileBertForTokenClassification),
        (RobertaConfig, TFRobertaForTokenClassification),
+        (BertConfig, TFBertForTokenClassification),
        (XLMConfig, TFXLMForTokenClassification),
        (XLMRobertaConfig, TFXLMRobertaForTokenClassification),
        (XLNetConfig, TFXLNetForTokenClassification),
--- a/src/transformers/modeling_tf_t5.py
+++ b/src/transformers/modeling_tf_t5.py
@@ -478,7 +478,7 @@ class TFT5Block(tf.keras.layers.Layer):
        return outputs  # hidden-states, present_key_value_states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)


-class _NoLayerEmbedTokens(object):
+class _NoLayerEmbedTokens:
    """
     this class wraps a the TFSharedEmbeddingTokens layer into a python 'no-keras-layer'
     class to avoid problem with weight restoring. Also it makes sure that the layer is
@@ -655,7 +655,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.

-        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposistion
+        # T5 has a mask that can compare sequence ids, we can simulate this here with this transposition
        # Cf. https://github.com/tensorflow/mesh/blob/8d2465e9bc93129b913b5ccc6a59aa97abd96ec6/mesh_tensorflow/transformer/transformer_layers.py#L270
        # extended_attention_mask = tf.math.equal(extended_attention_mask,
        #                                         tf.transpose(extended_attention_mask, perm=(-1, -2)))
@@ -682,16 +682,8 @@ class TFT5MainLayer(tf.keras.layers.Layer):
        else:
            encoder_extended_attention_mask = None

-        # Prepare head mask if needed
-        # 1.0 in head_mask indicate we keep the head
-        # attention_probs has shape bsz x n_heads x N x N
-        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
-        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
-        if head_mask is not None:
-            raise NotImplementedError
-        else:
-            head_mask = [None] * self.num_hidden_layers
-            # head_mask = tf.constant([0] * self.num_hidden_layers)
+        assert head_mask is None, "Head mask not supported"
+        head_mask = [None] * self.num_hidden_layers

        present_key_value_states = ()
        all_hidden_states = ()
@@ -1054,8 +1046,6 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
        r"""
    Returns:
        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
-            Classification loss (cross entropy).
        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        decoder_past_key_value_states (:obj:`tuple(tuple(tf.Tensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length, embed_size_per_head)`, `optional`, returned when ``use_cache=True``):
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
--- a/src/transformers/modeling_transfo_xl.py
+++ b/src/transformers/modeling_transfo_xl.py
@@ -1016,11 +1016,14 @@ class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
            return self.crit.out_layers[-1]

    def prepare_inputs_for_generation(self, input_ids, past, **model_kwargs):
-        inputs = {"input_ids": input_ids}
+        inputs = {}

        # if past is defined in model kwargs then use it for faster decoding
        if past:
            inputs["mems"] = past
+            inputs["input_ids"] = input_ids[:, -1].unsqueeze(-1)
+        else:
+            inputs["input_ids"] = input_ids

        return inputs

--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -17,7 +17,7 @@
 import inspect
 import logging
 import os
-from typing import Callable, Dict, Iterable, List, Optional, Tuple
+from typing import Callable, Dict, List, Optional, Tuple

 import torch
 from torch import Tensor, device, dtype, nn
@@ -35,6 +35,7 @@ from .file_utils import (
    hf_bucket_url,
    is_remote_url,
 )
+from .generation_utils import GenerationMixin


 logger = logging.getLogger(__name__)
@@ -261,7 +262,7 @@ class ModuleUtilsMixin:
        return head_mask


-class PreTrainedModel(nn.Module, ModuleUtilsMixin):
+class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
    r""" Base class for all models.

        :class:`~transformers.PreTrainedModel` takes care of storing the configuration of the models and handles methods for loading/downloading/saving models
@@ -801,967 +802,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):

        return model

-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-        return {"input_ids": input_ids}
-
-    def adjust_logits_during_generation(self, logits, **kwargs):
-        return logits
-
-    def _use_cache(self, outputs, use_cache):
-        """During generation, decide whether to pass the `past` variable to the next forward pass."""
-        if len(outputs) <= 1 or use_cache is False:
-            return False
-        if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
-            return False
-        return True
-
-    def enforce_repetition_penalty_(self, lprobs, batch_size, num_beams, prev_output_tokens, repetition_penalty):
-        """repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858). """
-        for i in range(batch_size * num_beams):
-            for previous_token in set(prev_output_tokens[i].tolist()):
-                # if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
-                if lprobs[i, previous_token] < 0:
-                    lprobs[i, previous_token] *= repetition_penalty
-                else:
-                    lprobs[i, previous_token] /= repetition_penalty
-
-    def postprocess_next_token_scores(
-        self,
-        scores,
-        input_ids,
-        no_repeat_ngram_size,
-        bad_words_ids,
-        cur_len,
-        min_length,
-        max_length,
-        eos_token_id,
-        repetition_penalty,
-        batch_size,
-        num_beams,
-    ):
-        # repetition penalty (from CTRL paper https://arxiv.org/abs/1909.05858)
-        if repetition_penalty != 1.0:
-            self.enforce_repetition_penalty_(
-                scores, batch_size, num_beams, input_ids, repetition_penalty,
-            )
-
-        # set eos token prob to zero if min_length is not reached
-        if eos_token_id is not None and cur_len < min_length:
-            scores[:, eos_token_id] = -float("inf")
-
-        if no_repeat_ngram_size > 0:
-            # calculate a list of banned tokens to prevent repetitively generating the same ngrams
-            num_batch_hypotheses = batch_size * num_beams
-            # from fairseq: https://github.com/pytorch/fairseq/blob/a07cb6f40480928c9e0548b737aadd36ee66ac76/fairseq/sequence_generator.py#L345
-            banned_batch_tokens = calc_banned_ngram_tokens(
-                input_ids, num_batch_hypotheses, no_repeat_ngram_size, cur_len
-            )
-            for i, banned_tokens in enumerate(banned_batch_tokens):
-                scores[i, banned_tokens] = -float("inf")
-
-        if bad_words_ids is not None:
-            # calculate a list of banned tokens according to bad words
-            banned_tokens = calc_banned_bad_words_ids(input_ids, bad_words_ids)
-
-            for i, banned_tokens in enumerate(banned_tokens):
-                scores[i, banned_tokens] = -float("inf")
-
-        return scores
-
-    @torch.no_grad()
-    def generate(
-        self,
-        input_ids: Optional[torch.LongTensor] = None,
-        max_length: Optional[int] = None,
-        min_length: Optional[int] = None,
-        do_sample: Optional[bool] = None,
-        early_stopping: Optional[bool] = None,
-        num_beams: Optional[int] = None,
-        temperature: Optional[float] = None,
-        top_k: Optional[int] = None,
-        top_p: Optional[float] = None,
-        repetition_penalty: Optional[float] = None,
-        bad_words_ids: Optional[Iterable[int]] = None,
-        bos_token_id: Optional[int] = None,
-        pad_token_id: Optional[int] = None,
-        eos_token_id: Optional[int] = None,
-        length_penalty: Optional[float] = None,
-        no_repeat_ngram_size: Optional[int] = None,
-        num_return_sequences: Optional[int] = None,
-        attention_mask: Optional[torch.LongTensor] = None,
-        decoder_start_token_id: Optional[int] = None,
-        use_cache: Optional[bool] = None,
-        **model_specific_kwargs
-    ) -> torch.LongTensor:
-        r""" Generates sequences for models with a LM head. The method currently supports greedy decoding, beam-search decoding, sampling with temperature, sampling with top-k or nucleus sampling.
-
-        Adapted in part from `Facebook's XLM beam search code`_.
-
-        .. _`Facebook's XLM beam search code`:
-           https://github.com/facebookresearch/XLM/blob/9e6f6814d17be4fe5b15f2e6c43eb2b2d76daeb4/src/model/transformer.py#L529
-
-
-        Parameters:
-
-            input_ids: (`optional`) `torch.LongTensor` of shape `(batch_size, sequence_length)`
-                The sequence used as a prompt for the generation. If `None` the method initializes
-                it as an empty `torch.LongTensor` of shape `(1,)`.
-
-            max_length: (`optional`) int
-                The max length of the sequence to be generated.  Between `min_length` and infinity. Default to 20.
-
-            min_length: (`optional`) int
-                The min length of the sequence to be generated.  Between 0 and infinity. Default to 0.
-
-            do_sample: (`optional`) bool
-                If set to `False` greedy decoding is used. Otherwise sampling is used. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
-
-            early_stopping: (`optional`) bool
-                if set to `True` beam search is stopped when at least `num_beams` sentences finished per batch. Defaults to `False` as defined in `configuration_utils.PretrainedConfig`.
-
-            num_beams: (`optional`) int
-                Number of beams for beam search. Must be between 1 and infinity. 1 means no beam search. Default to 1.
-
-            temperature: (`optional`) float
-                The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
-
-            top_k: (`optional`) int
-                The number of highest probability vocabulary tokens to keep for top-k-filtering. Between 1 and infinity. Default to 50.
-
-            top_p: (`optional`) float
-                The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
-
-            repetition_penalty: (`optional`) float
-                The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
-
-            pad_token_id: (`optional`) int
-                Padding token. Default to specicic model pad_token_id or None if it does not exist.
-
-            bos_token_id: (`optional`) int
-                BOS token. Defaults to `bos_token_id` as defined in the models config.
-
-            eos_token_id: (`optional`) int
-                EOS token. Defaults to `eos_token_id` as defined in the models config.
-
-            length_penalty: (`optional`) float
-                Exponential penalty to the length. Default to 1.
-
-            no_repeat_ngram_size: (`optional`) int
-                If set to int > 0, all ngrams of size `no_repeat_ngram_size` can only occur once.
-            bad_words_ids: (`optional`) list of lists of int
-                `bad_words_ids` contains tokens that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use `tokenizer.encode(bad_word, add_prefix_space=True)`.
-
-            num_return_sequences: (`optional`) int
-                The number of independently computed returned sequences for each element in the batch. Default to 1.
-
-            attention_mask (`optional`) obj: `torch.LongTensor` of same shape as `input_ids`
-                Mask to avoid performing attention on padding token indices.
-                Mask values selected in ``[0, 1]``:
-                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-                Defaults to `None`.
-
-                `What are attention masks? <../glossary.html#attention-mask>`__
-
-            decoder_start_token_id=None: (`optional`) int
-                Start token id for the decoder. Defaults to ``decoder_start_token_id`` as defined the model's config or to the ``bos_token_id``
-                if no ``decoder_start_token_id`` is found in the config.
-                This is only relevant for encoder-decoder models.
-
-            use_cache: (`optional`) bool
-                If `use_cache` is True, past key values are used to speed up decoding if applicable to model. Defaults to `True`.
-
-            model_specific_kwargs: (`optional`) dict
-                Additional model specific kwargs will be forwarded to the `forward` function of the model.
-
-        Return:
-
-            output: `torch.LongTensor` of shape `(batch_size * num_return_sequences, sequence_length)`
-                sequence_length is either equal to max_length or shorter if all batches finished early due to the `eos_token_id`
-
-        Examples::
-
-            from transformers import AutoTokenizer, AutoModelForCausalLM
-
-            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer
-            model = AutoModelForCausalLM.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.
-            outputs = model.generate(max_length=40)  # do greedy decoding
-            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
-
-            tokenizer = AutoTokenizer.from_pretrained('openai-gpt')   # Initialize tokenizer
-            model = AutoModelForCausalLM.from_pretrained('openai-gpt')    # Download model and configuration from S3 and cache.
-            input_context = 'The dog'
-            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
-            outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3, temperature=1.5)  # generate 3 independent sequences using beam search decoding (5 beams) with sampling from initial context 'The dog'
-            for i in range(3): #  3 output sequences were generated
-                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
-
-            tokenizer = AutoTokenizer.from_pretrained('distilgpt2')   # Initialize tokenizer
-            model = AutoModelForCausalLM.from_pretrained('distilgpt2')    # Download model and configuration from S3 and cache.
-            input_context = 'The dog'
-            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
-            outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.7, num_return_sequences=3, do_sample=True)  # 3 generate sequences using by sampling
-            for i in range(3): #  3 output sequences were generated
-                print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))
-
-            tokenizer = AutoTokenizer.from_pretrained('ctrl')   # Initialize tokenizer
-            model = AutoModelForCausalLM.from_pretrained('ctrl')    # Download model and configuration from S3 and cache.
-            input_context = 'Legal My neighbor is'  # "Legal" is one of the control codes for ctrl
-            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
-            outputs = model.generate(input_ids=input_ids, max_length=50, temperature=0.7, repetition_penalty=1.2)  # generate sequences
-            print('Generated: {}'.format(tokenizer.decode(outputs[0], skip_special_tokens=True)))
-
-            tokenizer = AutoTokenizer.from_pretrained('gpt2')   # Initialize tokenizer
-            model = AutoModelForCausalLM.from_pretrained('gpt2')    # Download model and configuration from S3 and cache.
-            input_context = 'My cute dog'  # "Legal" is one of the control codes for ctrl
-            bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]
-            input_ids = tokenizer.encode(input_context, return_tensors='pt')  # encode input context
-            outputs = model.generate(input_ids=input_ids, max_length=100, do_sample=True, bad_words_ids=bad_words_ids)  # generate sequences without allowing bad_words to be generated
-        """
-
-        # We cannot generate if the model does not have a LM head
-        if self.get_output_embeddings() is None:
-            raise AttributeError(
-                "You tried to generate sequences with a model that does not have a LM Head."
-                "Please use another model class (e.g. `OpenAIGPTLMHeadModel`, `XLNetLMHeadModel`, `GPT2LMHeadModel`, `CTRLLMHeadModel`, `T5WithLMHeadModel`, `TransfoXLLMHeadModel`, `XLMWithLMHeadModel`, `BartForConditionalGeneration` )"
-            )
-
-        max_length = max_length if max_length is not None else self.config.max_length
-        min_length = min_length if min_length is not None else self.config.min_length
-        do_sample = do_sample if do_sample is not None else self.config.do_sample
-        early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-        num_beams = num_beams if num_beams is not None else self.config.num_beams
-        temperature = temperature if temperature is not None else self.config.temperature
-        top_k = top_k if top_k is not None else self.config.top_k
-        top_p = top_p if top_p is not None else self.config.top_p
-        repetition_penalty = repetition_penalty if repetition_penalty is not None else self.config.repetition_penalty
-        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
-        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
-        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
-        length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
-        no_repeat_ngram_size = (
-            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
-        )
-        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
-        num_return_sequences = (
-            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
-        )
-        decoder_start_token_id = (
-            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
-        )
-
-        if input_ids is not None:
-            batch_size = input_ids.shape[0]  # overriden by the input batch_size
-        else:
-            batch_size = 1
-
-        assert isinstance(max_length, int) and max_length > 0, "`max_length` should be a strictly positive integer."
-        assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
-        assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
-        assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
-        assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
-        assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
-        assert temperature > 0, "`temperature` should be strictly positive."
-        assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
-        assert 0 <= top_p <= 1, "`top_p` should be between 0 and 1."
-        assert repetition_penalty >= 1.0, "`repetition_penalty` should be >= 1."
-        assert input_ids is not None or (
-            isinstance(bos_token_id, int) and bos_token_id >= 0
-        ), "If input_ids is not defined, `bos_token_id` should be a positive integer."
-        assert pad_token_id is None or (
-            isinstance(pad_token_id, int) and (pad_token_id >= 0)
-        ), "`pad_token_id` should be a positive integer."
-        assert (eos_token_id is None) or (
-            isinstance(eos_token_id, int) and (eos_token_id >= 0)
-        ), "`eos_token_id` should be a positive integer."
-        assert length_penalty > 0, "`length_penalty` should be strictly positive."
-        assert (
-            isinstance(no_repeat_ngram_size, int) and no_repeat_ngram_size >= 0
-        ), "`no_repeat_ngram_size` should be a positive integer."
-        assert (
-            isinstance(num_return_sequences, int) and num_return_sequences > 0
-        ), "`num_return_sequences` should be a strictly positive integer."
-        assert (
-            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
-        ), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"
-
-        if input_ids is None:
-            assert isinstance(bos_token_id, int) and bos_token_id >= 0, (
-                "you should either supply a context to complete as `input_ids` input "
-                "or a `bos_token_id` (integer >= 0) as a first token to start the generation."
-            )
-            input_ids = torch.full(
-                (batch_size, 1), bos_token_id, dtype=torch.long, device=next(self.parameters()).device,
-            )
-        else:
-            assert input_ids.dim() == 2, "Input prompt should be of shape (batch_size, sequence length)."
-
-        # not allow to duplicate outputs when greedy decoding
-        if do_sample is False:
-            if num_beams == 1:
-                # no_beam_search greedy generation conditions
-                assert (
-                    num_return_sequences == 1
-                ), "Greedy decoding will always produce the same output for num_beams == 1 and num_return_sequences > 1. Please set num_return_sequences = 1"
-
-            else:
-                # beam_search greedy generation conditions
-                assert (
-                    num_beams >= num_return_sequences
-                ), "Greedy beam search decoding cannot return more sequences than it has beams. Please set num_beams >= num_return_sequences"
-
-        # create attention mask if necessary
-        # TODO (PVP): this should later be handled by the forward fn() in each model in the future see PR 3140
-        if (attention_mask is None) and (pad_token_id is not None) and (pad_token_id in input_ids):
-            attention_mask = input_ids.ne(pad_token_id).long()
-        elif attention_mask is None:
-            attention_mask = input_ids.new_ones(input_ids.shape)
-
-        # set pad_token_id to eos_token_id if not set. Important that this is done after
-        # attention_mask is created
-        if pad_token_id is None and eos_token_id is not None:
-            logger.warning(
-                "Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence".format(eos_token_id)
-            )
-            pad_token_id = eos_token_id
-
-        # current position and vocab size
-        if hasattr(self.config, "vocab_size"):
-            vocab_size = self.config.vocab_size
-        elif (
-            self.config.is_encoder_decoder
-            and hasattr(self.config, "decoder")
-            and hasattr(self.config.decoder, "vocab_size")
-        ):
-            vocab_size = self.config.decoder.vocab_size
-
-        # set effective batch size and effective batch multiplier according to do_sample
-        if do_sample:
-            effective_batch_size = batch_size * num_return_sequences
-            effective_batch_mult = num_return_sequences
-        else:
-            effective_batch_size = batch_size
-            effective_batch_mult = 1
-
-        if self.config.is_encoder_decoder:
-            if decoder_start_token_id is None:
-                decoder_start_token_id = bos_token_id
-
-            assert (
-                decoder_start_token_id is not None
-            ), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
-            assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
-            assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
-
-            # get encoder and store encoder outputs
-            encoder = self.get_encoder()
-
-            encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
-
-        # Expand input ids if num_beams > 1 or num_return_sequences > 1
-        if num_return_sequences > 1 or num_beams > 1:
-            input_ids_len = input_ids.shape[-1]
-            input_ids = input_ids.unsqueeze(1).expand(batch_size, effective_batch_mult * num_beams, input_ids_len)
-            attention_mask = attention_mask.unsqueeze(1).expand(
-                batch_size, effective_batch_mult * num_beams, input_ids_len
-            )
-
-            input_ids = input_ids.contiguous().view(
-                effective_batch_size * num_beams, input_ids_len
-            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)
-            attention_mask = attention_mask.contiguous().view(
-                effective_batch_size * num_beams, input_ids_len
-            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)
-
-        if self.config.is_encoder_decoder:
-            # create empty decoder_input_ids
-            input_ids = torch.full(
-                (effective_batch_size * num_beams, 1),
-                decoder_start_token_id,
-                dtype=torch.long,
-                device=next(self.parameters()).device,
-            )
-            cur_len = 1
-
-            assert (
-                batch_size == encoder_outputs[0].shape[0]
-            ), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[0]} "
-
-            # expand batch_idx to assign correct encoder output for expanded input_ids (due to num_beams > 1 and num_return_sequences > 1)
-            expanded_batch_idxs = (
-                torch.arange(batch_size)
-                .view(-1, 1)
-                .repeat(1, num_beams * effective_batch_mult)
-                .view(-1)
-                .to(input_ids.device)
-            )
-            # expand encoder_outputs
-            encoder_outputs = (encoder_outputs[0].index_select(0, expanded_batch_idxs), *encoder_outputs[1:])
-
-        else:
-            encoder_outputs = None
-            cur_len = input_ids.shape[-1]
-
-        if num_beams > 1:
-            output = self._generate_beam_search(
-                input_ids,
-                cur_len=cur_len,
-                max_length=max_length,
-                min_length=min_length,
-                do_sample=do_sample,
-                early_stopping=early_stopping,
-                temperature=temperature,
-                top_k=top_k,
-                top_p=top_p,
-                repetition_penalty=repetition_penalty,
-                no_repeat_ngram_size=no_repeat_ngram_size,
-                bad_words_ids=bad_words_ids,
-                pad_token_id=pad_token_id,
-                eos_token_id=eos_token_id,
-                batch_size=effective_batch_size,
-                num_return_sequences=num_return_sequences,
-                length_penalty=length_penalty,
-                num_beams=num_beams,
-                vocab_size=vocab_size,
-                encoder_outputs=encoder_outputs,
-                attention_mask=attention_mask,
-                use_cache=use_cache,
-                model_specific_kwargs=model_specific_kwargs,
-            )
-        else:
-            output = self._generate_no_beam_search(
-                input_ids,
-                cur_len=cur_len,
-                max_length=max_length,
-                min_length=min_length,
-                do_sample=do_sample,
-                temperature=temperature,
-                top_k=top_k,
-                top_p=top_p,
-                repetition_penalty=repetition_penalty,
-                no_repeat_ngram_size=no_repeat_ngram_size,
-                bad_words_ids=bad_words_ids,
-                pad_token_id=pad_token_id,
-                eos_token_id=eos_token_id,
-                batch_size=effective_batch_size,
-                encoder_outputs=encoder_outputs,
-                attention_mask=attention_mask,
-                use_cache=use_cache,
-                model_specific_kwargs=model_specific_kwargs,
-            )
-
-        return output
-
-    def _generate_no_beam_search(
-        self,
-        input_ids,
-        cur_len,
-        max_length,
-        min_length,
-        do_sample,
-        temperature,
-        top_k,
-        top_p,
-        repetition_penalty,
-        no_repeat_ngram_size,
-        bad_words_ids,
-        pad_token_id,
-        eos_token_id,
-        batch_size,
-        encoder_outputs,
-        attention_mask,
-        use_cache,
-        model_specific_kwargs,
-    ):
-        """ Generate sequences for each example without beam search (num_beams == 1).
-            All returned sequence are generated independantly.
-        """
-        # length of generated sentences / unfinished sentences
-        unfinished_sents = input_ids.new(batch_size).fill_(1)
-        sent_lengths = input_ids.new(batch_size).fill_(max_length)
-
-        past = (encoder_outputs, None) if encoder_outputs is not None else None
-
-        while cur_len < max_length:
-            model_inputs = self.prepare_inputs_for_generation(
-                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
-            )
-
-            outputs = self(**model_inputs)
-            next_token_logits = outputs[0][:, -1, :]
-
-            scores = self.postprocess_next_token_scores(
-                scores=next_token_logits,
-                input_ids=input_ids,
-                no_repeat_ngram_size=no_repeat_ngram_size,
-                bad_words_ids=bad_words_ids,
-                cur_len=cur_len,
-                min_length=min_length,
-                max_length=max_length,
-                eos_token_id=eos_token_id,
-                repetition_penalty=repetition_penalty,
-                batch_size=batch_size,
-                num_beams=1,
-            )
-
-            # if model has past, then set the past variable to speed up decoding
-            if self._use_cache(outputs, use_cache):
-                past = outputs[1]
-
-            if do_sample:
-                # Temperature (higher temperature => more likely to sample low probability tokens)
-                if temperature != 1.0:
-                    scores = scores / temperature
-                # Top-p/top-k filtering
-                next_token_logscores = top_k_top_p_filtering(scores, top_k=top_k, top_p=top_p)
-                # Sample
-                probs = F.softmax(next_token_logscores, dim=-1)
-                next_token = torch.multinomial(probs, num_samples=1).squeeze(1)
-            else:
-                # Greedy decoding
-                next_token = torch.argmax(next_token_logits, dim=-1)
-
-            # update generations and finished sentences
-            if eos_token_id is not None:
-                # pad finished sentences if eos_token_id exist
-                tokens_to_add = next_token * unfinished_sents + (pad_token_id) * (1 - unfinished_sents)
-            else:
-                tokens_to_add = next_token
-
-            # add token and increase length by one
-            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
-            cur_len = cur_len + 1
-
-            if eos_token_id is not None:
-                eos_in_sents = tokens_to_add == eos_token_id
-                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
-                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
-                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
-                # unfinished_sents is set to zero if eos in sentence
-                unfinished_sents.mul_((~eos_in_sents).long())
-
-            # stop when there is a </s> in each sentence, or if we exceed the maximul length
-            if unfinished_sents.max() == 0:
-                break
-
-            # extend attention_mask for new generated input if only decoder
-            if self.config.is_encoder_decoder is False:
-                attention_mask = torch.cat(
-                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
-                )
-
-        return input_ids
-
-    def _generate_beam_search(
-        self,
-        input_ids,
-        cur_len,
-        max_length,
-        min_length,
-        do_sample,
-        early_stopping,
-        temperature,
-        top_k,
-        top_p,
-        repetition_penalty,
-        no_repeat_ngram_size,
-        bad_words_ids,
-        pad_token_id,
-        eos_token_id,
-        batch_size,
-        num_return_sequences,
-        length_penalty,
-        num_beams,
-        vocab_size,
-        encoder_outputs,
-        attention_mask,
-        use_cache,
-        model_specific_kwargs,
-    ):
-        """ Generate sequences for each example with beam search.
-        """
-
-        # generated hypotheses
-        generated_hyps = [
-            BeamHypotheses(num_beams, max_length, length_penalty, early_stopping=early_stopping)
-            for _ in range(batch_size)
-        ]
-
-        # scores for each sentence in the beam
-        beam_scores = torch.zeros((batch_size, num_beams), dtype=torch.float, device=input_ids.device)
-
-        # for greedy decoding it is made sure that only tokens of the first beam are considered to avoid sampling the exact same tokens three times
-        if do_sample is False:
-            beam_scores[:, 1:] = -1e9
-        beam_scores = beam_scores.view(-1)  # shape (batch_size * num_beams,)
-
-        # cache compute states
-        past = (encoder_outputs, None) if encoder_outputs is not None else None
-
-        # done sentences
-        done = [False for _ in range(batch_size)]
-
-        while cur_len < max_length:
-            model_inputs = self.prepare_inputs_for_generation(
-                input_ids, past=past, attention_mask=attention_mask, use_cache=use_cache, **model_specific_kwargs
-            )
-            outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)
-            next_token_logits = outputs[0][:, -1, :]  # (batch_size * num_beams, vocab_size)
-
-            # if model has past, then set the past variable to speed up decoding
-            if self._use_cache(outputs, use_cache):
-                past = outputs[1]
-            if self.config.is_encoder_decoder and do_sample is False:
-                # TODO (PVP) still a bit hacky here - there might be a better solution
-                next_token_logits = self.adjust_logits_during_generation(
-                    next_token_logits, cur_len=cur_len, max_length=max_length
-                )
-
-            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)
-
-            scores = self.postprocess_next_token_scores(
-                scores=scores,
-                input_ids=input_ids,
-                no_repeat_ngram_size=no_repeat_ngram_size,
-                bad_words_ids=bad_words_ids,
-                cur_len=cur_len,
-                min_length=min_length,
-                max_length=max_length,
-                eos_token_id=eos_token_id,
-                repetition_penalty=repetition_penalty,
-                batch_size=batch_size,
-                num_beams=num_beams,
-            )
-
-            assert scores.shape == (batch_size * num_beams, vocab_size), "Shapes of scores: {} != {}".format(
-                scores.shape, (batch_size * num_beams, vocab_size)
-            )
-
-            if do_sample:
-                _scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)
-                # Temperature
-                if temperature != 1.0:
-                    _scores = _scores / temperature
-                # Top-p/top-k filtering
-                _scores = top_k_top_p_filtering(
-                    _scores, top_k=top_k, top_p=top_p, min_tokens_to_keep=2
-                )  # (batch_size * num_beams, vocab_size)
-                # re-organize to group the beam together to sample from all beam_idxs
-                _scores = _scores.contiguous().view(
-                    batch_size, num_beams * vocab_size
-                )  # (batch_size, num_beams * vocab_size)
-
-                # Sample 2 next tokens for each beam (so we have some spare tokens and match output of greedy beam search)
-                probs = F.softmax(_scores, dim=-1)
-                next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)  # (batch_size, num_beams * 2)
-                # Compute next scores
-                next_scores = torch.gather(_scores, -1, next_tokens)  # (batch_size, num_beams * 2)
-                # sort the sampled vector to make sure that the first num_beams samples are the best
-                next_scores, next_scores_indices = torch.sort(next_scores, descending=True, dim=1)
-                next_tokens = torch.gather(next_tokens, -1, next_scores_indices)  # (batch_size, num_beams * 2)
-
-            else:
-                next_scores = scores + beam_scores[:, None].expand_as(scores)  # (batch_size * num_beams, vocab_size)
-
-                # re-organize to group the beam together (we are keeping top hypothesis accross beams)
-                next_scores = next_scores.view(
-                    batch_size, num_beams * vocab_size
-                )  # (batch_size, num_beams * vocab_size)
-
-                next_scores, next_tokens = torch.topk(next_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
-
-            assert next_scores.size() == next_tokens.size() == (batch_size, 2 * num_beams)
-
-            # next batch beam content
-            next_batch_beam = []
-
-            # for each sentence
-            for batch_idx in range(batch_size):
-
-                # if we are done with this sentence, add a pad token
-                if done[batch_idx]:
-                    assert (
-                        len(generated_hyps[batch_idx]) >= num_beams
-                    ), "Batch can only be done if at least {} beams have been generated".format(num_beams)
-                    assert (
-                        eos_token_id is not None and pad_token_id is not None
-                    ), "generated beams >= num_beams -> eos_token_id and pad_token have to be defined"
-                    next_batch_beam.extend([(0, pad_token_id, 0)] * num_beams)  # pad the batch
-                    continue
-
-                # next sentence beam content, this will get added to next_batch_beam
-                next_sent_beam = []
-
-                # next tokens for this sentence
-                for beam_token_rank, (beam_token_id, beam_token_score) in enumerate(
-                    zip(next_tokens[batch_idx], next_scores[batch_idx])
-                ):
-                    # get beam and token IDs
-                    beam_id = beam_token_id // vocab_size
-                    token_id = beam_token_id % vocab_size
-
-                    effective_beam_id = batch_idx * num_beams + beam_id
-                    # add to generated hypotheses if end of sentence
-                    if (eos_token_id is not None) and (token_id.item() == eos_token_id):
-                        # if beam_token does not belong to top num_beams tokens, it should not be added
-                        is_beam_token_worse_than_top_num_beams = beam_token_rank >= num_beams
-                        if is_beam_token_worse_than_top_num_beams:
-                            continue
-                        generated_hyps[batch_idx].add(
-                            input_ids[effective_beam_id].clone(), beam_token_score.item(),
-                        )
-                    else:
-                        # add next predicted token since it is not eos_token
-                        next_sent_beam.append((beam_token_score, token_id, effective_beam_id))
-
-                    # once the beam for next step is full, don't add more tokens to it.
-                    if len(next_sent_beam) == num_beams:
-                        break
-
-                # Check if we are done so that we can save a pad step if all(done)
-                done[batch_idx] = done[batch_idx] or generated_hyps[batch_idx].is_done(
-                    next_scores[batch_idx].max().item(), cur_len
-                )
-
-                # update next beam content
-                assert len(next_sent_beam) == num_beams, "Beam should always be full"
-                next_batch_beam.extend(next_sent_beam)
-                assert len(next_batch_beam) == num_beams * (batch_idx + 1), "We should have added num_beams each step"
-
-            # stop when we are done with each sentence
-            if all(done):
-                break
-
-            # sanity check / prepare next batch
-            assert len(next_batch_beam) == batch_size * num_beams
-            beam_scores = beam_scores.new([x[0] for x in next_batch_beam])
-            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
-            beam_idx = input_ids.new([x[2] for x in next_batch_beam])
-
-            # re-order batch and update current length
-            input_ids = input_ids[beam_idx, :]
-            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
-            cur_len = cur_len + 1
-
-            # re-order internal states
-            if past is not None:
-                past = self._reorder_cache(past, beam_idx)
-
-            # extend attention_mask for new generated input if only decoder
-            if self.config.is_encoder_decoder is False:
-                attention_mask = torch.cat(
-                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
-                )
-
-        # finalize all open beam hypotheses and add to generated hypotheses
-        for batch_idx in range(batch_size):
-            if done[batch_idx]:
-                continue
-
-            # test that beam scores match previously calculated scores if not eos and batch_idx not done
-            if eos_token_id is not None and all(
-                (token_id % vocab_size).item() != eos_token_id for token_id in next_tokens[batch_idx]
-            ):
-                assert torch.all(
-                    next_scores[batch_idx, :num_beams] == beam_scores.view(batch_size, num_beams)[batch_idx]
-                ), "If batch_idx is not done, final next scores: {} have to equal to accumulated beam_scores: {}".format(
-                    next_scores[:, :num_beams][batch_idx], beam_scores.view(batch_size, num_beams)[batch_idx],
-                )
-
-            # need to add best num_beams hypotheses to generated hyps
-            for beam_id in range(num_beams):
-                effective_beam_id = batch_idx * num_beams + beam_id
-                final_score = beam_scores[effective_beam_id].item()
-                final_tokens = input_ids[effective_beam_id]
-                generated_hyps[batch_idx].add(final_tokens, final_score)
-
-        # depending on whether greedy generation is wanted or not define different output_batch_size and output_num_return_sequences_per_batch
-        output_batch_size = batch_size if do_sample else batch_size * num_return_sequences
-        output_num_return_sequences_per_batch = 1 if do_sample else num_return_sequences
-
-        # select the best hypotheses
-        sent_lengths = input_ids.new(output_batch_size)
-        best = []
-
-        # retrieve best hypotheses
-        for i, hypotheses in enumerate(generated_hyps):
-            sorted_hyps = sorted(hypotheses.beams, key=lambda x: x[0])
-            for j in range(output_num_return_sequences_per_batch):
-                effective_batch_idx = output_num_return_sequences_per_batch * i + j
-                best_hyp = sorted_hyps.pop()[1]
-                sent_lengths[effective_batch_idx] = len(best_hyp)
-                best.append(best_hyp)
-
-        # shorter batches are padded
-        if sent_lengths.min().item() != sent_lengths.max().item():
-            assert pad_token_id is not None, "`Pad_token_id` has to be defined"
-            sent_max_len = min(sent_lengths.max().item() + 1, max_length)
-            decoded = input_ids.new(output_batch_size, sent_max_len).fill_(pad_token_id)
-
-            # fill with hypothesis and eos_token_id if necessary
-            for i, hypo in enumerate(best):
-                decoded[i, : sent_lengths[i]] = hypo
-                if sent_lengths[i] < max_length:
-                    decoded[i, sent_lengths[i]] = eos_token_id
-        else:
-            # none of the hypotheses have an eos_token
-            assert (len(hypo) == max_length for hypo in best)
-            decoded = torch.stack(best).type(torch.long).to(next(self.parameters()).device)
-
-        return decoded
-
-    @staticmethod
-    def _reorder_cache(past: Tuple, beam_idx: Tensor) -> Tuple[Tensor]:
-        return tuple(layer_past.index_select(1, beam_idx) for layer_past in past)
-
-
-def calc_banned_ngram_tokens(prev_input_ids: Tensor, num_hypos: int, no_repeat_ngram_size: int, cur_len: int) -> None:
-    """Copied from fairseq for no_repeat_ngram in beam_search"""
-    if cur_len + 1 < no_repeat_ngram_size:
-        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
-        return [[] for _ in range(num_hypos)]
-    generated_ngrams = [{} for _ in range(num_hypos)]
-    for idx in range(num_hypos):
-        gen_tokens = prev_input_ids[idx].tolist()
-        generated_ngram = generated_ngrams[idx]
-        for ngram in zip(*[gen_tokens[i:] for i in range(no_repeat_ngram_size)]):
-            prev_ngram_tuple = tuple(ngram[:-1])
-            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
-
-    def _get_generated_ngrams(hypo_idx):
-        # Before decoding the next token, prevent decoding of ngrams that have already appeared
-        start_idx = cur_len + 1 - no_repeat_ngram_size
-        ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
-        return generated_ngrams[hypo_idx].get(ngram_idx, [])
-
-    banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
-    return banned_tokens
-
-
-def calc_banned_bad_words_ids(prev_input_ids: Iterable[int], bad_words_ids: Iterable[int]) -> Iterable[int]:
-    banned_tokens = []
-
-    def _tokens_match(prev_tokens, tokens):
-        if len(tokens) == 0:
-            # if bad word tokens is just one token always ban it
-            return True
-        if len(tokens) > len(prev_input_ids):
-            # if bad word tokens are longer then prev input_ids they can't be equal
-            return False
-
-        if prev_tokens[-len(tokens) :] == tokens:
-            # if tokens match
-            return True
-        else:
-            return False
-
-    for prev_input_ids_slice in prev_input_ids:
-        banned_tokens_slice = []
-
-        for banned_token_seq in bad_words_ids:
-            assert len(banned_token_seq) > 0, "Banned words token sequences {} cannot have an empty list".format(
-                bad_words_ids
-            )
-
-            if _tokens_match(prev_input_ids_slice.tolist(), banned_token_seq[:-1]) is False:
-                # if tokens do not match continue
-                continue
-
-            banned_tokens_slice.append(banned_token_seq[-1])
-
-        banned_tokens.append(banned_tokens_slice)
-
-    return banned_tokens
-
-
-def top_k_top_p_filtering(
-    logits: Tensor,
-    top_k: int = 0,
-    top_p: float = 1.0,
-    filter_value: float = -float("Inf"),
-    min_tokens_to_keep: int = 1,
-) -> Tensor:
-    """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
-        Args:
-            logits: logits distribution shape (batch size, vocabulary size)
-            if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
-            if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
-                Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
-            Make sure we keep at least min_tokens_to_keep per batch example in the output
-        From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
-    """
-    if top_k > 0:
-        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check
-        # Remove all tokens with a probability less than the last token of the top-k
-        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
-        logits[indices_to_remove] = filter_value
-
-    if top_p < 1.0:
-        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
-        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
-
-        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
-        sorted_indices_to_remove = cumulative_probs > top_p
-        if min_tokens_to_keep > 1:
-            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
-            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
-        # Shift the indices to the right to keep also the first token above the threshold
-        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
-        sorted_indices_to_remove[..., 0] = 0
-
-        # scatter sorted tensors to original indexing
-        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
-        logits[indices_to_remove] = filter_value
-    return logits
-
-
-class BeamHypotheses(object):
-    def __init__(self, num_beams, max_length, length_penalty, early_stopping):
-        """
-        Initialize n-best list of hypotheses.
-        """
-        self.max_length = max_length - 1  # ignoring bos_token
-        self.length_penalty = length_penalty
-        self.early_stopping = early_stopping
-        self.num_beams = num_beams
-        self.beams = []
-        self.worst_score = 1e9
-
-    def __len__(self):
-        """
-        Number of hypotheses in the list.
-        """
-        return len(self.beams)
-
-    def add(self, hyp, sum_logprobs):
-        """
-        Add a new hypothesis to the list.
-        """
-        score = sum_logprobs / len(hyp) ** self.length_penalty
-        if len(self) < self.num_beams or score > self.worst_score:
-            self.beams.append((score, hyp))
-            if len(self) > self.num_beams:
-                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.beams)])
-                del self.beams[sorted_scores[0][1]]
-                self.worst_score = sorted_scores[1][0]
-            else:
-                self.worst_score = min(score, self.worst_score)
-
-    def is_done(self, best_sum_logprobs, cur_len):
-        """
-        If there are enough hypotheses and that none of the hypotheses being generated
-        can become better than the worst one in the heap, then we are done with this sentence.
-        """
-
-        if len(self) < self.num_beams:
-            return False
-        elif self.early_stopping:
-            return True
-        else:
-            cur_score = best_sum_logprobs / cur_len ** self.length_penalty
-            ret = self.worst_score >= cur_score
-            return ret
-

 class Conv1D(nn.Module):
    def __init__(self, nf, nx):
--- a/src/transformers/optimization.py
+++ b/src/transformers/optimization.py
@@ -16,6 +16,7 @@

 import logging
 import math
+from typing import Callable, Iterable, Tuple

 import torch
 from torch.optim import Optimizer
@@ -25,18 +26,40 @@ from torch.optim.lr_scheduler import LambdaLR
 logger = logging.getLogger(__name__)


-def get_constant_schedule(optimizer, last_epoch=-1):
-    """ Create a schedule with a constant learning rate.
+def get_constant_schedule(optimizer: Optimizer, last_epoch: int = -1):
+    """
+    Create a schedule with a constant learning rate, using the learning rate set in optimizer.
+
+    Args:
+        optimizer (:class:`~torch.optim.Optimizer`):
+            The optimizer for which to schedule the learning rate.
+        last_epoch (:obj:`int`, `optional`, defaults to -1):
+            The index of the last epoch when resuming training.
+
+    Return:
+        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """
    return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch)


-def get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1):
-    """ Create a schedule with a constant learning rate preceded by a warmup
-    period during which the learning rate increases linearly between 0 and 1.
+def get_constant_schedule_with_warmup(optimizer: Optimizer, num_warmup_steps: int, last_epoch: int = -1):
+    """
+    Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate
+    increases linearly between 0 and the initial lr set in the optimizer.
+
+    Args:
+        optimizer (:class:`~torch.optim.Optimizer`):
+            The optimizer for which to schedule the learning rate.
+        num_warmup_steps (:obj:`int`):
+            The number of steps for the warmup phase.
+        last_epoch (:obj:`int`, `optional`, defaults to -1):
+            The index of the last epoch when resuming training.
+
+    Return:
+        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """

-    def lr_lambda(current_step):
+    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1.0, num_warmup_steps))
        return 1.0
@@ -45,11 +68,25 @@ def get_constant_schedule_with_warmup(optimizer, num_warmup_steps, last_epoch=-1


 def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1):
-    """ Create a schedule with a learning rate that decreases linearly after
-    linearly increasing during a warmup period.
+    """
+    Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0,
+    after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
+
+    Args:
+        optimizer (:class:`~torch.optim.Optimizer`):
+            The optimizer for which to schedule the learning rate.
+        num_warmup_steps (:obj:`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (:obj:`int`):
+            The totale number of training steps.
+        last_epoch (:obj:`int`, `optional`, defaults to -1):
+            The index of the last epoch when resuming training.
+
+    Return:
+        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """

-    def lr_lambda(current_step):
+    def lr_lambda(current_step: int):
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        return max(
@@ -59,10 +96,29 @@ def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_st
    return LambdaLR(optimizer, lr_lambda, last_epoch)


-def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1):
-    """ Create a schedule with a learning rate that decreases following the
-    values of the cosine function between 0 and `pi * cycles` after a warmup
-    period during which it increases linearly between 0 and 1.
+def get_cosine_schedule_with_warmup(
+    optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1
+):
+    """
+    Create a schedule with a learning rate that decreases following the values of the cosine function between the
+    initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the
+    initial lr set in the optimizer.
+
+    Args:
+        optimizer (:class:`~torch.optim.Optimizer`):
+            The optimizer for which to schedule the learning rate.
+        num_warmup_steps (:obj:`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (:obj:`int`):
+            The total number of training steps.
+        num_cycles (:obj:`float`, `optional`, defaults to 0.5):
+            The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
+            following a half-cosine).
+        last_epoch (:obj:`int`, `optional`, defaults to -1):
+            The index of the last epoch when resuming training.
+
+    Return:
+        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """

    def lr_lambda(current_step):
@@ -75,11 +131,27 @@ def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_st


 def get_cosine_with_hard_restarts_schedule_with_warmup(
-    optimizer, num_warmup_steps, num_training_steps, num_cycles=1.0, last_epoch=-1
+    optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1
 ):
-    """ Create a schedule with a learning rate that decreases following the
-    values of the cosine function with several hard restarts, after a warmup
-    period during which it increases linearly between 0 and 1.
+    """
+    Create a schedule with a learning rate that decreases following the values of the cosine function between the
+    initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases
+    linearly between 0 and the initial lr set in the optimizer.
+
+    Args:
+        optimizer (:class:`~torch.optim.Optimizer`):
+            The optimizer for which to schedule the learning rate.
+        num_warmup_steps (:obj:`int`):
+            The number of steps for the warmup phase.
+        num_training_steps (:obj:`int`):
+            The total number of training steps.
+        num_cycles (:obj:`int`, `optional`, defaults to 1):
+            The number of hard restarts to use.
+        last_epoch (:obj:`int`, `optional`, defaults to -1):
+            The index of the last epoch when resuming training.
+
+    Return:
+        :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
    """

    def lr_lambda(current_step):
@@ -94,17 +166,34 @@ def get_cosine_with_hard_restarts_schedule_with_warmup(


 class AdamW(Optimizer):
-    """ Implements Adam algorithm with weight decay fix.
+    """
+    Implements Adam algorithm with weight decay fix as introduced in
+    `Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__.

    Parameters:
-        lr (float): learning rate. Default 1e-3.
-        betas (tuple of 2 floats): Adams beta parameters (b1, b2). Default: (0.9, 0.999)
-        eps (float): Adams epsilon. Default: 1e-6
-        weight_decay (float): Weight decay. Default: 0.0
-        correct_bias (bool): can be set to False to avoid correcting bias in Adam (e.g. like in Bert TF repository). Default True.
+        params (:obj:`Iterable[torch.nn.parameter.Parameter]`):
+            Iterable of parameters to optimize or dictionaries defining parameter groups.
+        lr (:obj:`float`, `optional`, defaults to 1e-3):
+            The learning rate to use.
+        betas (:obj:`Tuple[float,float]`, `optional`, defaults to (0.9, 0.999)):
+            Adam's betas parameters (b1, b2).
+        eps (:obj:`float`, `optional`, defaults to 1e-6):
+            Adam's epsilon for numerical stability.
+        weight_decay (:obj:`float`, `optional`, defaults to 0):
+            Decoupled weight decay to apply.
+        correct_bias (:obj:`bool`, `optional`, defaults to `True`):
+            Whether ot not to correct bias in Adam (for instance, in Bert TF repository they use :obj:`False`).
    """

-    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.0, correct_bias=True):
+    def __init__(
+        self,
+        params: Iterable[torch.nn.parameter.Parameter],
+        lr: float = 1e-3,
+        betas: Tuple[float, float] = (0.9, 0.999),
+        eps: float = 1e-6,
+        weight_decay: float = 0.0,
+        correct_bias: bool = True,
+    ):
        if lr < 0.0:
            raise ValueError("Invalid learning rate: {} - should be >= 0.0".format(lr))
        if not 0.0 <= betas[0] < 1.0:
@@ -116,12 +205,12 @@ class AdamW(Optimizer):
        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=correct_bias)
        super().__init__(params, defaults)

-    def step(self, closure=None):
-        """Performs a single optimization step.
+    def step(self, closure: Callable = None):
+        """
+        Performs a single optimization step.

        Arguments:
-            closure (callable, optional): A closure that reevaluates the model
-                and returns the loss.
+            closure (:obj:`Callable`, `optional`): A closure that reevaluates the model and returns the loss.
        """
        loss = None
        if closure is not None:
--- a/src/transformers/optimization_tf.py
+++ b/src/transformers/optimization_tf.py
@@ -16,15 +16,36 @@


 import re
+from typing import Callable, List, Optional, Union

 import tensorflow as tf


 class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):
-    """Applies a warmup schedule on a given learning rate decay schedule."""
+    """
+    Applies a warmup schedule on a given learning rate decay schedule.
+
+    Args:
+        initial_learning_rate (:obj:`float`):
+            The initial learning rate for the schedule after the warmup (so this will be the learning rate at the end
+            of the warmup).
+        decay_schedule_fn (:obj:`Callable`):
+            The schedule function to apply after the warmup for the rest of training.
+        warmup_steps (:obj:`int`):
+            The number of steps for the warmup part of training.
+        power (:obj:`float`, `optional`, defaults to 1):
+            The power to use for the polynomial warmup (defaults is a linear warmup).
+        name (:obj:`str`, `optional`):
+            Optional name prefix for the returned tensors during the schedule.
+    """

    def __init__(
-        self, initial_learning_rate, decay_schedule_fn, warmup_steps, power=1.0, name=None,
+        self,
+        initial_learning_rate: float,
+        decay_schedule_fn: Callable,
+        warmup_steps: int,
+        power: float = 1.0,
+        name: str = None,
    ):
        super().__init__()
        self.initial_learning_rate = initial_learning_rate
@@ -59,15 +80,34 @@ class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):


 def create_optimizer(
-    init_lr,
-    num_train_steps,
-    num_warmup_steps,
-    min_lr_ratio=0.0,
-    adam_epsilon=1e-8,
-    weight_decay_rate=0.0,
-    include_in_weight_decay=None,
+    init_lr: float,
+    num_train_steps: int,
+    num_warmup_steps: int,
+    min_lr_ratio: float = 0.0,
+    adam_epsilon: float = 1e-8,
+    weight_decay_rate: float = 0.0,
+    include_in_weight_decay: Optional[List[str]] = None,
 ):
-    """Creates an optimizer with learning rate schedule."""
+    """
+    Creates an optimizer with a learning rate schedule using a warmup phase followed by a linear decay.
+
+    Args:
+        init_lr (:obj:`float`):
+            The desired learning rate at the end of the warmup phase.
+        num_train_step (:obj:`int`):
+            The total number of training steps.
+        num_warmup_steps (:obj:`int`):
+            The number of warmup steps.
+        min_lr_ratio (:obj:`float`, `optional`, defaults to 0):
+            The final learning rate at the end of the linear decay will be :obj:`init_lr * min_lr_ratio`.
+        adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
+            The epsilon to use in Adam.
+        weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
+            The weight decay to use.
+        include_in_weight_decay (:obj:`List[str]`, `optional`):
+            List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is
+            applied to all parameters except bias and layer norm parameters.
+    """
    # Implements linear decay of the learning rate.
    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=init_lr,
@@ -96,26 +136,55 @@ def create_optimizer(


 class AdamWeightDecay(tf.keras.optimizers.Adam):
-    """Adam enables L2 weight decay and clip_by_global_norm on gradients.
-    Just adding the square of the weights to the loss function is *not* the
-    correct way of using L2 regularization/weight decay with Adam, since that will
-    interact with the m and v parameters in strange ways.
-    Instead we want ot decay the weights in a manner that doesn't interact with
-    the m/v parameters. This is equivalent to adding the square of the weights to
-    the loss with plain (non-momentum) SGD.
+    """
+    Adam enables L2 weight decay and clip_by_global_norm on gradients. Just adding the square of the weights to the
+    loss function is *not* the correct way of using L2 regularization/weight decay with Adam, since that will interact
+    with the m and v parameters in strange ways as shown in
+    `Decoupled Weight Decay Regularization <https://arxiv.org/abs/1711.05101>`__.
+
+    Instead we want ot decay the weights in a manner that doesn't interact with the m/v parameters. This is equivalent
+    to adding the square of the weights to the loss with plain (non-momentum) SGD.
+
+    Args:
+        learning_rate (:obj:`Union[float, tf.keras.optimizers.schedules.LearningRateSchedule]`, `optional`, defaults to 1e-3):
+            The learning rate to use or a schedule.
+        beta_1 (:obj:`float`, `optional`, defaults to 0.9):
+            The beta1 parameter in Adam, which is the exponential decay rate for the 1st momentum estimates.
+        beta_2 (:obj:`float`, `optional`, defaults to 0.999):
+            The beta2 parameter in Adam, which is the exponential decay rate for the 2nd momentum estimates.
+        epsilon (:obj:`float`, `optional`, defaults to 1e-7):
+            The epsilon paramenter in Adam, which is a small constant for numerical stability.
+        amsgrad (:obj:`bool`, `optional`, default to `False`):
+            Wheter to apply AMSGrad varient of this algorithm or not, see
+            `On the Convergence of Adam and Beyond <https://arxiv.org/abs/1904.09237>`__.
+        weight_decay_rate (:obj:`float`, `optional`, defaults to 0):
+            The weight decay to apply.
+        include_in_weight_decay (:obj:`List[str]`, `optional`):
+            List of the parameter names (or re patterns) to apply weight decay to. If none is passed, weight decay is
+            applied to all parameters by default (unless they are in :obj:`exclude_from_weight_decay`).
+        exclude_from_weight_decay (:obj:`List[str]`, `optional`):
+            List of the parameter names (or re patterns) to exclude from applying weight decay to. If a
+            :obj:`include_in_weight_decay` is passed, the names in it will supersede this list.
+        name (:obj:`str`, `optional`, defaults to 'AdamWeightDecay'):
+            Optional name for the operations created when applying gradients.
+        kwargs:
+            Keyward arguments. Allowed to be {``clipnorm``, ``clipvalue``, ``lr``, ``decay``}. ``clipnorm`` is clip
+            gradients by norm; ``clipvalue`` is clip gradients by value, ``decay`` is included for backward
+            compatibility to allow time inverse decay of learning rate. ``lr`` is included for backward compatibility,
+            recommended to use ``learning_rate`` instead.
    """

    def __init__(
        self,
-        learning_rate=0.001,
-        beta_1=0.9,
-        beta_2=0.999,
-        epsilon=1e-7,
-        amsgrad=False,
-        weight_decay_rate=0.0,
-        include_in_weight_decay=None,
-        exclude_from_weight_decay=None,
-        name="AdamWeightDecay",
+        learning_rate: Union[float, tf.keras.optimizers.schedules.LearningRateSchedule] = 0.001,
+        beta_1: float = 0.9,
+        beta_2: float = 0.999,
+        epsilon: float = 1e-7,
+        amsgrad: bool = False,
+        weight_decay_rate: float = 0.0,
+        include_in_weight_decay: Optional[List[str]] = None,
+        exclude_from_weight_decay: Optional[List[str]] = None,
+        name: str = "AdamWeightDecay",
        **kwargs
    ):
        super().__init__(learning_rate, beta_1, beta_2, epsilon, amsgrad, name, **kwargs)
--- a/src/transformers/pipelines.py
+++ b/src/transformers/pipelines.py
@@ -87,6 +87,18 @@ def get_framework(model=None):
    return framework


+class PipelineException(Exception):
+    """
+    Raised by pipelines when handling __call__
+    """
+
+    def __init__(self, task: str, model: str, reason: str):
+        super().__init__(reason)
+
+        self.task = task
+        self.model = model
+
+
 class ArgumentHandler(ABC):
    """
    Base interface for handling varargs for each Pipeline
@@ -603,6 +615,28 @@ class TextGenerationPipeline(Pipeline):
        "TFCTRLLMHeadModel",
    ]

+    # overriding _parse_and_tokenize to allow for unusual language-modeling tokenizer arguments
+
+    def _parse_and_tokenize(self, *args, padding=True, add_special_tokens=True, **kwargs):
+        """
+        Parse arguments and tokenize
+        """
+        # Parse arguments
+        if self.model.__class__.__name__ in ["TransfoXLLMHeadModel"]:
+            tokenizer_kwargs = {"add_space_before_punct_symbol": True}
+        else:
+            tokenizer_kwargs = {}
+        inputs = self._args_parser(*args, **kwargs)
+        inputs = self.tokenizer(
+            inputs,
+            add_special_tokens=add_special_tokens,
+            return_tensors=self.framework,
+            padding=padding,
+            **tokenizer_kwargs,
+        )
+
+        return inputs
+
    def __call__(
        self, *args, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
    ):
@@ -808,6 +842,21 @@ class FillMaskPipeline(Pipeline):

        self.topk = topk

+    def ensure_exactly_one_mask_token(self, masked_index: np.ndarray):
+        numel = np.prod(masked_index.shape)
+        if numel > 1:
+            raise PipelineException(
+                "fill-mask",
+                self.model.base_model_prefix,
+                f"More than one mask_token ({self.tokenizer.mask_token}) is not supported",
+            )
+        elif numel < 1:
+            raise PipelineException(
+                "fill-mask",
+                self.model.base_model_prefix,
+                f"No mask_token ({self.tokenizer.mask_token}) found on the input",
+            )
+
    def __call__(self, *args, **kwargs):
        inputs = self._parse_and_tokenize(*args, **kwargs)
        outputs = self._forward(inputs, return_tensors=True)
@@ -820,14 +869,22 @@ class FillMaskPipeline(Pipeline):
            result = []

            if self.framework == "tf":
-                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy().item()
-                logits = outputs[i, masked_index, :]
+                masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy()
+
+                # Fill mask pipeline supports only one ${mask_token} per sample
+                self.ensure_exactly_one_mask_token(masked_index)
+
+                logits = outputs[i, masked_index.item(), :]
                probs = tf.nn.softmax(logits)
                topk = tf.math.top_k(probs, k=self.topk)
                values, predictions = topk.values.numpy(), topk.indices.numpy()
            else:
-                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
-                logits = outputs[i, masked_index, :]
+                masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero()
+
+                # Fill mask pipeline supports only one ${mask_token} per sample
+                self.ensure_exactly_one_mask_token(masked_index.numpy())
+
+                logits = outputs[i, masked_index.item(), :]
                probs = logits.softmax(dim=0)
                values, predictions = probs.topk(self.topk)

@@ -987,6 +1044,10 @@ class TokenClassificationPipeline(Pipeline):

                entities += [entity]

+            # Ensure if an entity is the latest one in the sequence it gets appended to the output
+            if len(entity_group_disagg) > 0:
+                entity_groups.append(self.group_entities(entity_group_disagg))
+
            # Append
            if self.grouped_entities:
                answers += [entity_groups]
@@ -1211,33 +1272,34 @@ class QuestionAnsweringPipeline(Pipeline):
            with self.device_placement():
                if self.framework == "tf":
                    fw_args = {k: tf.constant(v) for (k, v) in fw_args.items()}
-                    start, end = self.model(fw_args)
+                    start, end = self.model(fw_args)[:2]
                    start, end = start.numpy(), end.numpy()
                else:
                    with torch.no_grad():
                        # Retrieve the score for the context tokens only (removing question tokens)
                        fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}
-                        start, end = self.model(**fw_args)
+                        start, end = self.model(**fw_args)[:2]
                        start, end = start.cpu().numpy(), end.cpu().numpy()

            min_null_score = 1000000  # large and positive
            answers = []
            for (feature, start_, end_) in zip(features, start, end):
-                # Normalize logits and spans to retrieve the answer
-                start_ = np.exp(start_) / np.sum(np.exp(start_))
-                end_ = np.exp(end_) / np.sum(np.exp(end_))
-
                # Mask padding and question
                start_, end_ = (
                    start_ * np.abs(np.array(feature.p_mask) - 1),
                    end_ * np.abs(np.array(feature.p_mask) - 1),
                )

+                # Mask CLS
+                start_[0] = end_[0] = 0
+
+                # Normalize logits and spans to retrieve the answer
+                start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
+                end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
+
                if kwargs["handle_impossible_answer"]:
                    min_null_score = min(min_null_score, (start_[0] * end_[0]).item())

-                start_[0] = end_[0] = 0
-
                starts, ends, scores = self.decode(start_, end_, kwargs["topk"], kwargs["max_answer_len"])
                char_to_word = np.array(example.char_to_word_offset)

--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
--- a/src/transformers/tokenization_bert.py
+++ b/src/transformers/tokenization_bert.py
@@ -606,7 +606,7 @@ class BertTokenizerFast(PreTrainedTokenizerFast):
        mask_token="[MASK]",
        clean_text=True,
        tokenize_chinese_chars=True,
-        strip_accents=True,
+        strip_accents=None,
        wordpieces_prefix="##",
        **kwargs
    ):
--- a/src/transformers/tokenization_gpt2.py
+++ b/src/transformers/tokenization_gpt2.py
@@ -102,17 +102,26 @@ def get_pairs(word):

 class GPT2Tokenizer(PreTrainedTokenizer):
    """
-    GPT-2 BPE tokenizer. Peculiarities:
+    GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.

-    - Byte-level Byte-Pair-Encoding
-    - Requires a space to start the input string => the encoding methods should be called with the
-      ``add_prefix_space`` flag set to ``True``.
-      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
-      the absence of a space at the beginning of a string:
+    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
+    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ::

-        tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+        >>> from transformers import GPT2Tokenizer
+        >>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+        >>> tokenizer("Hello world")['input_ids']
+        [15496, 995]
+        >>> tokenizer(" Hello world")['input_ids']
+        [18435, 995]
+
+    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
+    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
+
+    .. note::
+
+        When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).

    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
    should refer to the superclass for more information regarding methods.
@@ -137,6 +146,7 @@ class GPT2Tokenizer(PreTrainedTokenizer):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["attention_mask"]

    def __init__(
        self,
@@ -287,21 +297,30 @@ class GPT2Tokenizer(PreTrainedTokenizer):

 class GPT2TokenizerFast(PreTrainedTokenizerFast):
    """
-    Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).
+    Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library), using byte-level
+    Byte-Pair-Encoding.

-    Peculiarities:
-
-    - Byte-level Byte-Pair-Encoding
-    - Requires a space to start the input string => the encoding methods should be called with the
-      ``add_prefix_space`` flag set to ``True``.
-      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
-      the absence of a space at the beginning of a string:
+    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
+    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ::

-        tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+        >>> from transformers import GPT2TokenizerFast
+        >>> tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
+        >>> tokenizer("Hello world")['input_ids']
+        [15496, 995]
+        >>> tokenizer(" Hello world")['input_ids']
+        [18435, 995]

-    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
+    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
+    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
+
+    .. note::
+
+        When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
+        ``add_prefix_space=True``.
+
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
    should refer to the superclass for more information regarding methods.

    Args:
@@ -330,6 +349,7 @@ class GPT2TokenizerFast(PreTrainedTokenizerFast):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["attention_mask"]

    def __init__(
        self,
--- a/src/transformers/tokenization_marian.py
+++ b/src/transformers/tokenization_marian.py
@@ -129,6 +129,8 @@ class MarianTokenizer(PreTrainedTokenizer):
        max_length: Optional[int] = None,
        pad_to_max_length: bool = True,
        return_tensors: str = "pt",
+        truncation_strategy="only_first",
+        padding="longest",
    ) -> BatchEncoding:
        """Prepare model inputs for translation. For best performance, translate one sentence at a time.
        Arguments:
@@ -147,24 +149,21 @@ class MarianTokenizer(PreTrainedTokenizer):
            raise ValueError(f"found empty string in src_texts: {src_texts}")
        self.current_spm = self.spm_source
        src_texts = [self.normalize(t) for t in src_texts]  # this does not appear to do much
-        model_inputs: BatchEncoding = self.batch_encode_plus(
-            src_texts,
+        tokenizer_kwargs = dict(
            add_special_tokens=True,
            return_tensors=return_tensors,
            max_length=max_length,
            pad_to_max_length=pad_to_max_length,
+            truncation_strategy=truncation_strategy,
+            padding=padding,
        )
+        model_inputs: BatchEncoding = self(src_texts, **tokenizer_kwargs)
+
        if tgt_texts is None:
            return model_inputs

        self.current_spm = self.spm_target
-        decoder_inputs: BatchEncoding = self.batch_encode_plus(
-            tgt_texts,
-            add_special_tokens=True,
-            return_tensors=return_tensors,
-            max_length=max_length,
-            pad_to_max_length=pad_to_max_length,
-        )
+        decoder_inputs: BatchEncoding = self(tgt_texts, **tokenizer_kwargs)
        for k, v in decoder_inputs.items():
            model_inputs[f"decoder_{k}"] = v
        self.current_spm = self.spm_source
--- a/src/transformers/tokenization_openai.py
+++ b/src/transformers/tokenization_openai.py
@@ -96,6 +96,7 @@ class OpenAIGPTTokenizer(PreTrainedTokenizer):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["attention_mask"]

    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
        super().__init__(unk_token=unk_token, **kwargs)
@@ -261,6 +262,7 @@ class OpenAIGPTTokenizerFast(PreTrainedTokenizerFast):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["attention_mask"]

    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
        kwargs.setdefault("unk_token", unk_token)
--- a/src/transformers/tokenization_roberta.py
+++ b/src/transformers/tokenization_roberta.py
@@ -62,17 +62,26 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

 class RobertaTokenizer(GPT2Tokenizer):
    """
-    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
+    Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.

-    - Byte-level Byte-Pair-Encoding
-    - Requires a space to start the input string => the encoding methods should be called with the
-      ``add_prefix_space`` flag set to ``True``.
-      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
-      the absence of a space at the beginning of a string:
+    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
+    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ::

-        tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+        >>> from transformers import RobertaTokenizer
+        >>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+        >>> tokenizer("Hello world")['input_ids']
+        [0, 31414, 232, 328, 2]
+        >>> tokenizer(" Hello world")['input_ids']
+        [0, 20920, 232, 2]
+
+    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
+    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
+
+    .. note::
+
+        When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).

    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
    should refer to the superclass for more information regarding methods.
@@ -251,19 +260,28 @@ class RobertaTokenizer(GPT2Tokenizer):

 class RobertaTokenizerFast(GPT2TokenizerFast):
    """
-    Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).
+    Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library), derived from the GPT-2
+    tokenizer, using byte-level Byte-Pair-Encoding.

-    Peculiarities:
-
-    - Byte-level Byte-Pair-Encoding
-    - Requires a space to start the input string => the encoding methods should be called with the
-      ``add_prefix_space`` flag set to ``True``.
-      Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
-      the absence of a space at the beginning of a string:
+    This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
+    be encoded differently whether it is at the beginning of the sentence (without space) or not:

    ::

-        tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
+        >>> from transformers import RobertaTokenizerFast
+        >>> tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
+        >>> tokenizer("Hello world")['input_ids']
+        [0, 31414, 232, 328, 2]
+        >>> tokenizer(" Hello world")['input_ids']
+        [0, 20920, 232, 2]
+
+    You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
+    call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
+
+    .. note::
+
+        When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
+        ``add_prefix_space=True``.

    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
    should refer to the superclass for more information regarding methods.
--- a/src/transformers/tokenization_utils.py
+++ b/src/transformers/tokenization_utils.py
@@ -454,12 +454,12 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        first_ids = get_input_ids(text)
        second_ids = get_input_ids(text_pair) if text_pair is not None else None

-        return self._prepare_for_model(
+        return self.prepare_for_model(
            first_ids,
            pair_ids=second_ids,
            add_special_tokens=add_special_tokens,
-            padding_strategy=padding_strategy,
-            truncation_strategy=truncation_strategy,
+            padding=padding_strategy.value,
+            truncation=truncation_strategy.value,
            max_length=max_length,
            stride=stride,
            pad_to_multiple_of=pad_to_multiple_of,
@@ -584,12 +584,12 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):

        batch_outputs = {}
        for first_ids, second_ids in batch_ids_pairs:
-            outputs = self._prepare_for_model(
+            outputs = self.prepare_for_model(
                first_ids,
                second_ids,
                add_special_tokens=add_special_tokens,
-                padding_strategy=PaddingStrategy.DO_NOT_PAD,  # we pad in batch afterward
-                truncation_strategy=truncation_strategy,
+                padding=PaddingStrategy.DO_NOT_PAD.value,  # we pad in batch afterward
+                truncation=truncation_strategy.value,
                max_length=max_length,
                stride=stride,
                pad_to_multiple_of=None,  # we pad in batch afterward
@@ -620,109 +620,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):

        return batch_outputs

-    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
-    def _prepare_for_model(
-        self,
-        ids: List[int],
-        pair_ids: Optional[List[int]] = None,
-        add_special_tokens: bool = True,
-        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
-        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
-        max_length: Optional[int] = None,
-        stride: int = 0,
-        pad_to_multiple_of: Optional[int] = None,
-        return_tensors: Optional[str] = None,
-        prepend_batch_axis: bool = False,
-        return_token_type_ids: Optional[bool] = None,
-        return_attention_mask: Optional[bool] = None,
-        return_overflowing_tokens: bool = False,
-        return_special_tokens_mask: bool = False,
-        return_length: bool = False,
-        verbose: bool = True,
-    ) -> BatchEncoding:
-        """ Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
-        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
-        manages a moving window (with user defined stride) for overflowing tokens
-
-        Args:
-            ids: list of tokenized input ids. Can be obtained from a string by chaining the
-                `tokenize` and `convert_tokens_to_ids` methods.
-            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
-                `tokenize` and `convert_tokens_to_ids` methods.
-        """
-        pair = bool(pair_ids is not None)
-        len_ids = len(ids)
-        len_pair_ids = len(pair_ids) if pair else 0
-
-        # Load from model defaults
-        if return_token_type_ids is None:
-            return_token_type_ids = "token_type_ids" in self.model_input_names
-        if return_attention_mask is None:
-            return_attention_mask = "attention_mask" in self.model_input_names
-
-        encoded_inputs = {}
-
-        # Compute the total size of the returned encodings
-        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
-
-        # Truncation: Handle max sequence length
-        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
-            ids, pair_ids, overflowing_tokens = self.truncate_sequences(
-                ids,
-                pair_ids=pair_ids,
-                num_tokens_to_remove=total_len - max_length,
-                truncation_strategy=truncation_strategy,
-                stride=stride,
-            )
-            if return_overflowing_tokens:
-                encoded_inputs["overflowing_tokens"] = overflowing_tokens
-                encoded_inputs["num_truncated_tokens"] = total_len - max_length
-
-        # Add special tokens
-        if add_special_tokens:
-            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
-            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
-        else:
-            sequence = ids + pair_ids if pair else ids
-            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
-
-        # Build output dictionnary
-        encoded_inputs["input_ids"] = sequence
-        if return_token_type_ids:
-            encoded_inputs["token_type_ids"] = token_type_ids
-        if return_special_tokens_mask:
-            if add_special_tokens:
-                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
-            else:
-                encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
-
-        # Check lengths
-        if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
-            logger.warning(
-                "Token indices sequence length is longer than the specified maximum sequence length "
-                "for this model ({} > {}). Running this sequence through the model will result in "
-                "indexing errors".format(len(ids), self.model_max_length)
-            )
-
-        # Padding
-        if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
-            encoded_inputs = self.pad(
-                encoded_inputs,
-                max_length=max_length,
-                padding=padding_strategy.value,
-                pad_to_multiple_of=pad_to_multiple_of,
-                return_attention_mask=return_attention_mask,
-            )
-
-        if return_length:
-            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
-
-        batch_outputs = BatchEncoding(
-            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
-        )
-
-        return batch_outputs
-
    def prepare_for_tokenization(self, text: str, is_pretokenized=False, **kwargs) -> (str, dict):
        """ Performs any necessary transformations before tokenization.

@@ -731,90 +628,6 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):
        """
        return (text, kwargs)

-    def truncate_sequences(
-        self,
-        ids: List[int],
-        pair_ids: Optional[List[int]] = None,
-        num_tokens_to_remove: int = 0,
-        truncation_strategy: Union[str, TruncationStrategy] = "only_first",
-        stride: int = 0,
-    ) -> Tuple[List[int], List[int], List[int]]:
-        """ Truncates a sequence pair in place to the maximum length.
-
-        Args:
-            ids: list of tokenized input ids. Can be obtained from a string by chaining the
-                `tokenize` and `convert_tokens_to_ids` methods.
-            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
-                `tokenize` and `convert_tokens_to_ids` methods.
-            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):
-                number of tokens to remove using the truncation strategy
-            truncation_strategy (:obj:`string`, `optional`, defaults to "only_first"):
-                String selected in the following options:
-
-                - 'only_first' (default): Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
-                - 'only_second': Only truncate the second sequence
-                - 'longest_first': Iteratively reduce the inputs sequence until the input is under max_length
-                  starting from the longest one at each token (when there is a pair of input sequences).
-                  Overflowing tokens only contains overflow from the first sequence.
-                - 'do_not_truncate'
-            stride (:obj:`int`, `optional`, defaults to ``0``):
-                If set to a number along with max_length, the overflowing tokens returned will contain some tokens
-                from the main sequence returned. The value of this argument defines the number of additional tokens.
-        """
-        if num_tokens_to_remove <= 0:
-            return ids, pair_ids, []
-
-        if not isinstance(truncation_strategy, TruncationStrategy):
-            truncation_strategy = TruncationStrategy(truncation_strategy)
-
-        overflowing_tokens = []
-        if truncation_strategy == TruncationStrategy.LONGEST_FIRST:
-            for _ in range(num_tokens_to_remove):
-                if pair_ids is None or len(ids) > len(pair_ids):
-                    ids = ids[:-1]
-                else:
-                    pair_ids = pair_ids[:-1]
-        elif truncation_strategy == TruncationStrategy.ONLY_FIRST:
-            if len(ids) > num_tokens_to_remove:
-                window_len = min(len(ids), stride + num_tokens_to_remove)
-                overflowing_tokens = ids[-window_len:]
-                ids = ids[:-num_tokens_to_remove]
-            else:
-                logger.error(
-                    f"We need to remove {num_tokens_to_remove} to truncate the input"
-                    f"but the first sequence has a length {len(ids)}. "
-                    f"Please select another truncation strategy than {truncation_strategy}, "
-                    f"for instance 'longest_first' or 'only_second'."
-                )
-        elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
-            if len(pair_ids) > num_tokens_to_remove:
-                window_len = min(len(pair_ids), stride + num_tokens_to_remove)
-                overflowing_tokens = pair_ids[-window_len:]
-                pair_ids = pair_ids[:-num_tokens_to_remove]
-            else:
-                logger.error(
-                    f"We need to remove {num_tokens_to_remove} to truncate the input"
-                    f"but the second sequence has a length {len(pair_ids)}. "
-                    f"Please select another truncation strategy than {truncation_strategy}, "
-                    f"for instance 'longest_first' or 'only_first'."
-                )
-
-        return (ids, pair_ids, overflowing_tokens)
-
-    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:
-        if token_ids_1 is None:
-            return len(token_ids_0) * [0]
-        return [0] * len(token_ids_0) + [1] * len(token_ids_1)
-
-    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
-        """
-        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
-        by concatenating and adding special tokens. This implementation does not add special tokens.
-        """
-        if token_ids_1 is None:
-            return token_ids_0
-        return token_ids_0 + token_ids_1
-
    def get_special_tokens_mask(
        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
@@ -836,7 +649,7 @@ class PreTrainedTokenizer(PreTrainedTokenizerBase):

    def convert_ids_to_tokens(
        self, ids: Union[int, List[int]], skip_special_tokens: bool = False
-    ) -> Union[int, List[int]]:
+    ) -> Union[str, List[str]]:
        """ Converts a single index or a sequence of indices (integers) in a token "
            (resp.) a sequence of tokens (str), using the vocabulary and added tokens.

--- a/src/transformers/tokenization_utils_base.py
+++ b/src/transformers/tokenization_utils_base.py
@@ -945,9 +945,9 @@ ENCODE_KWARGS_DOCSTRING = r"""
            `truncation` (:obj:`Union[bool, str]`, `optional`, defaults to :obj:`False`):
                Activate and control truncation. Accepts the following values:

-                * `True` or `'only_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
+                * `True` or `'longest_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided,
+                * `'only_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
                * `'only_second'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided,
-                * `'longest_first'`: truncate to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`). This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided,
                * `False` or `'do_not_truncate'` (default): No truncation (i.e. can output batch with sequences length greater than the model max admissible input size)
            `max_length` (:obj:`Union[int, None]`, `optional`, defaults to :obj:`None`):
                Control the length for padding/truncation. Accepts the following values
@@ -1446,10 +1446,11 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
                logger.warning(
                    "Truncation was not explicitely activated but `max_length` is provided a specific value, "
                    "please use `truncation=True` to explicitely truncate examples to max length. "
-                    "Defaulting to 'only_first' truncation strategy. "
-                    "If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior."
+                    "Defaulting to 'longest_first' truncation strategy. "
+                    "If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
+                    "more precisely by providing a specific strategy to `truncation`."
                )
-            truncation = "only_first"
+            truncation = "longest_first"

        # Get padding strategy
        if padding is False and old_pad_to_max_length:
@@ -1469,7 +1470,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
        elif padding is not False:
            if padding is True:
                padding_strategy = PaddingStrategy.LONGEST  # Default to pad to the longest sequence in the batch
-            else:
+            elif not isinstance(padding, PaddingStrategy):
                padding_strategy = PaddingStrategy(padding)
        else:
            padding_strategy = PaddingStrategy.DO_NOT_PAD
@@ -1492,9 +1493,9 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):
        elif truncation is not False:
            if truncation is True:
                truncation_strategy = (
-                    TruncationStrategy.ONLY_FIRST
-                )  # Default to truncate the first sequences in pairs of inputs
-            else:
+                    TruncationStrategy.LONGEST_FIRST
+                )  # Default to truncate the longest sequences in pairs of inputs
+            elif not isinstance(truncation, TruncationStrategy):
                truncation_strategy = TruncationStrategy(truncation)
        else:
            truncation_strategy = TruncationStrategy.DO_NOT_TRUNCATE
@@ -1960,6 +1961,225 @@ class PreTrainedTokenizerBase(SpecialTokensMixin):

        return BatchEncoding(batch_outputs, tensor_type=return_tensors)

+    def create_token_type_ids_from_sequences(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List[int]:
+        if token_ids_1 is None:
+            return len(token_ids_0) * [0]
+        return [0] * len(token_ids_0) + [1] * len(token_ids_1)
+
+    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
+        by concatenating and adding special tokens. This implementation does not add special tokens.
+        """
+        if token_ids_1 is None:
+            return token_ids_0
+        return token_ids_0 + token_ids_1
+
+    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
+    def prepare_for_model(
+        self,
+        ids: List[int],
+        pair_ids: Optional[List[int]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str] = False,
+        truncation: Union[bool, str] = False,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs
+    ) -> BatchEncoding:
+        """ Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model.
+        It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
+        manages a moving window (with user defined stride) for overflowing tokens
+
+        Args:
+            ids: list of tokenized input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+        """
+
+        if "return_lengths" in kwargs:
+            if verbose:
+                warnings.warn(
+                    "The PreTrainedTokenizerBase.prepare_for_model `return_lengths` parameter is deprecated. "
+                    "Please use `return_length` instead.",
+                    FutureWarning,
+                )
+            return_length = kwargs["return_lengths"]
+
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+
+        pair = bool(pair_ids is not None)
+        len_ids = len(ids)
+        len_pair_ids = len(pair_ids) if pair else 0
+
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+
+        encoded_inputs = {}
+
+        # Compute the total size of the returned encodings
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+
+        # Truncation: Handle max sequence length
+        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
+            ids, pair_ids, overflowing_tokens = self.truncate_sequences(
+                ids,
+                pair_ids=pair_ids,
+                num_tokens_to_remove=total_len - max_length,
+                truncation_strategy=truncation_strategy,
+                stride=stride,
+            )
+            if return_overflowing_tokens:
+                encoded_inputs["overflowing_tokens"] = overflowing_tokens
+                encoded_inputs["num_truncated_tokens"] = total_len - max_length
+
+        # Add special tokens
+        if add_special_tokens:
+            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+        else:
+            sequence = ids + pair_ids if pair else ids
+            token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])
+
+        # Build output dictionnary
+        encoded_inputs["input_ids"] = sequence
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+        if return_special_tokens_mask:
+            if add_special_tokens:
+                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+            else:
+                encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+
+        # Check lengths
+        if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum sequence length "
+                "for this model ({} > {}). Running this sequence through the model will result in "
+                "indexing errors".format(len(ids), self.model_max_length)
+            )
+
+        # Padding
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding_strategy.value,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+
+        batch_outputs = BatchEncoding(
+            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+        )
+
+        return batch_outputs
+
+    def truncate_sequences(
+        self,
+        ids: List[int],
+        pair_ids: Optional[List[int]] = None,
+        num_tokens_to_remove: int = 0,
+        truncation_strategy: Union[str, TruncationStrategy] = "longest_first",
+        stride: int = 0,
+    ) -> Tuple[List[int], List[int], List[int]]:
+        """ Truncates a sequence pair in place to the maximum length.
+
+        Args:
+            ids: list of tokenized input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+            pair_ids: Optional second list of input ids. Can be obtained from a string by chaining the
+                `tokenize` and `convert_tokens_to_ids` methods.
+            num_tokens_to_remove (:obj:`int`, `optional`, defaults to ``0``):
+                number of tokens to remove using the truncation strategy
+            truncation_strategy (:obj:`string`, `optional`, defaults to "longest_first"):
+                String selected in the following options:
+
+                - 'longest_first' (default): Iteratively reduce the inputs sequence until the input is under max_length
+                  starting from the longest one at each token (when there is a pair of input sequences).
+                  Overflowing tokens only contains overflow from the first sequence.
+                - 'only_first': Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.
+                - 'only_second': Only truncate the second sequence
+                - 'do_not_truncate'
+            stride (:obj:`int`, `optional`, defaults to ``0``):
+                If set to a number along with max_length, the overflowing tokens returned will contain some tokens
+                from the main sequence returned. The value of this argument defines the number of additional tokens.
+        """
+        if num_tokens_to_remove <= 0:
+            return ids, pair_ids, []
+
+        if not isinstance(truncation_strategy, TruncationStrategy):
+            truncation_strategy = TruncationStrategy(truncation_strategy)
+
+        overflowing_tokens = []
+        if truncation_strategy == TruncationStrategy.LONGEST_FIRST:
+            for _ in range(num_tokens_to_remove):
+                if pair_ids is None or len(ids) > len(pair_ids):
+                    if not overflowing_tokens:
+                        window_len = min(len(ids), stride + 1)
+                    else:
+                        window_len = 1
+                    overflowing_tokens.extend(ids[-window_len:])
+                    ids = ids[:-1]
+                else:
+                    if not overflowing_tokens:
+                        window_len = min(len(pair_ids), stride + 1)
+                    else:
+                        window_len = 1
+                    overflowing_tokens.extend(pair_ids[-window_len:])
+                    pair_ids = pair_ids[:-1]
+        elif truncation_strategy == TruncationStrategy.ONLY_FIRST:
+            if len(ids) > num_tokens_to_remove:
+                window_len = min(len(ids), stride + num_tokens_to_remove)
+                overflowing_tokens = ids[-window_len:]
+                ids = ids[:-num_tokens_to_remove]
+            else:
+                logger.error(
+                    f"We need to remove {num_tokens_to_remove} to truncate the input"
+                    f"but the first sequence has a length {len(ids)}. "
+                    f"Please select another truncation strategy than {truncation_strategy}, "
+                    f"for instance 'longest_first' or 'only_second'."
+                )
+        elif truncation_strategy == TruncationStrategy.ONLY_SECOND and pair_ids is not None:
+            if len(pair_ids) > num_tokens_to_remove:
+                window_len = min(len(pair_ids), stride + num_tokens_to_remove)
+                overflowing_tokens = pair_ids[-window_len:]
+                pair_ids = pair_ids[:-num_tokens_to_remove]
+            else:
+                logger.error(
+                    f"We need to remove {num_tokens_to_remove} to truncate the input"
+                    f"but the second sequence has a length {len(pair_ids)}. "
+                    f"Please select another truncation strategy than {truncation_strategy}, "
+                    f"for instance 'longest_first' or 'only_first'."
+                )
+
+        return (ids, pair_ids, overflowing_tokens)
+
    def _pad(
        self,
        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -1,7 +1,6 @@
 import logging
 import math
 import os
-import random
 import re
 import shutil
 import warnings
@@ -23,7 +22,14 @@ from .data.data_collator import DataCollator, default_data_collator
 from .file_utils import is_apex_available, is_torch_tpu_available
 from .modeling_utils import PreTrainedModel
 from .optimization import AdamW, get_linear_schedule_with_warmup
-from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput, is_wandb_available
+from .trainer_utils import (
+    PREFIX_CHECKPOINT_DIR,
+    EvalPrediction,
+    PredictionOutput,
+    TrainOutput,
+    is_wandb_available,
+    set_seed,
+)
 from .training_args import TrainingArguments


@@ -60,18 +66,13 @@ if is_wandb_available():
 logger = logging.getLogger(__name__)


-def set_seed(seed: int):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    # ^^ safe to call this function even if cuda is not available
-
-
@contextmanager
 def torch_distributed_zero_first(local_rank: int):
    """
    Decorator to make all processes in distributed training wait for each local_master to do something.
+
+    Args:
+        local_rank (:obj:`int`): The rank of the local process.
    """
    if local_rank not in [-1, 0]:
        torch.distributed.barrier()
@@ -133,7 +134,31 @@ def get_tpu_sampler(dataset: Dataset):
 class Trainer:
    """
    Trainer is a simple but feature-complete training and eval loop for PyTorch,
-    optimized for Transformers.
+    optimized for 🤗 Transformers.
+
+    Args:
+        model (:class:`~transformers.PreTrainedModel`):
+            The model to train, evaluate or use for predictions.
+        args (:class:`~transformers.TrainingArguments`):
+            The arguments to tweak training.
+        data_collator (:obj:`DataCollator`, `optional`, defaults to :func:`~transformers.default_data_collator`):
+            The function to use to from a batch from a list of elements of :obj:`train_dataset` or
+            :obj:`eval_dataset`.
+        train_dataset (:obj:`Dataset`, `optional`):
+            The dataset to use for training.
+        eval_dataset (:obj:`Dataset`, `optional`):
+            The dataset to use for evaluation.
+        compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
+            The function that will be used to compute metrics at evaluation. Must take a
+            :class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
+        prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
+            When performing evaluation and predictions, only returns the loss.
+        tb_writer (:obj:`SummaryWriter`, `optional`):
+            Object to write to TensorBoard.
+        optimizers (:obj:`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR`, `optional`):
+            A tuple containing the optimizer and the scheduler to use. Will default to an instance of
+            :class:`~transformers.AdamW` on your model and a scheduler given by
+            :func:`~transformers.get_linear_schedule_with_warmup` controlled by :obj:`args`.
    """

    model: PreTrainedModel
@@ -160,14 +185,6 @@ class Trainer:
        tb_writer: Optional["SummaryWriter"] = None,
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,
    ):
-        """
-        Trainer is a simple but feature-complete training and eval loop for PyTorch,
-        optimized for Transformers.
-
-        Args:
-            prediction_loss_only:
-                (Optional) in evaluation and prediction, only return the loss
-        """
        self.model = model.to(args.device)
        self.args = args
        self.data_collator = data_collator if data_collator is not None else default_data_collator
@@ -210,6 +227,9 @@ class Trainer:
            )

    def get_train_dataloader(self) -> DataLoader:
+        """
+        Returns the training :class:`~torch.utils.data.DataLoader`.
+        """
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")
        if is_torch_tpu_available():
@@ -232,6 +252,13 @@ class Trainer:
        return data_loader

    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
+        """
+        Returns the evaluation :class:`~torch.utils.data.DataLoader`.
+
+        Args:
+            eval_dataset (:obj:`Dataset`, `optional`):
+                If provided, will override `self.eval_dataset`.
+        """
        if eval_dataset is None and self.eval_dataset is None:
            raise ValueError("Trainer: evaluation requires an eval_dataset.")

@@ -257,6 +284,12 @@ class Trainer:
        return data_loader

    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
+        """
+        Returns the test :class:`~torch.utils.data.DataLoader`.
+
+        Args:
+            test_dataset (obj:`Dataset`): The test dataset to use.
+        """
        # We use the same batch_size as for eval.
        if is_torch_tpu_available():
            sampler = SequentialDistributedSampler(
@@ -283,9 +316,8 @@ class Trainer:
        """
        Setup the optimizer and the learning rate scheduler.

-        We provide a reasonable default that works well.
-        If you want to use something else, you can pass a tuple in the Trainer's init,
-        or override this method in a subclass.
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        Trainer's init through :obj:`optimizers`, or override this method in a subclass.
        """
        if self.optimizers is not None:
            return self.optimizers
@@ -336,7 +368,7 @@ class Trainer:

    def num_examples(self, dataloader: DataLoader) -> int:
        """
-        Helper to get num of examples from a DataLoader, by accessing its Dataset.
+        Helper to get number of samples in a :class:`~torch.utils.data.DataLoader` by accessing its Dataset.
        """
        return len(dataloader.dataset)

@@ -345,9 +377,9 @@ class Trainer:
        Main training entry point.

        Args:
-            model_path:
-                (Optional) Local path to model if model to train has been instantiated from a local path
-                If present, we will try reloading the optimizer/scheduler states from there.
+            model_path (:obj:`str`, `optional`):
+                Local path to the model if the model to train has been instantiated from a local path. If present,
+                training will resume from the optimizer/scheduler states loaded here.
        """
        train_dataloader = self.get_train_dataloader()
        if self.args.max_steps > 0:
@@ -453,6 +485,10 @@ class Trainer:
            else:
                epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())

+            # Reset the past mems state at the beginning of each epoch if necessary.
+            if self.args.past_index >= 0:
+                self._past = None
+
            for step, inputs in enumerate(epoch_iterator):

                # Skip past any already trained steps if resuming training
@@ -497,8 +533,8 @@ class Trainer:

                        self._log(logs)

-                        if self.args.evaluate_during_training:
-                            self.evaluate()
+                    if self.args.evaluate_during_training and self.global_step % self.args.eval_steps == 0:
+                        self.evaluate()

                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
                        # In all cases (even distributed/parallel), self.model is always a reference
@@ -529,12 +565,15 @@ class Trainer:
            if self.args.max_steps > 0 and self.global_step > self.args.max_steps:
                train_iterator.close()
                break
-            if self.args.tpu_metrics_debug:
+            if self.args.tpu_metrics_debug or self.args.debug:
                # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
                xm.master_print(met.metrics_report())

        if self.tb_writer:
            self.tb_writer.close()
+        if self.args.past_index and hasattr(self, "_past"):
+            # Clean the state at the end of training
+            delattr(self, "_past")

        logger.info("\n\nTraining completed. Do not forget to share your model on huggingface.co/models =)\n\n")
        return TrainOutput(self.global_step, tr_loss / self.global_step)
@@ -577,9 +616,15 @@ class Trainer:
            if isinstance(v, torch.Tensor):
                inputs[k] = v.to(self.args.device)

+        if self.args.past_index >= 0 and self._past is not None:
+            inputs["mems"] = self._past
+
        outputs = model(**inputs)
        loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+
        if self.args.n_gpu > 1:
            loss = loss.mean()  # mean() to average on multi-gpu parallel training
        if self.args.gradient_accumulation_steps > 1:
@@ -611,8 +656,7 @@ class Trainer:

    def save_model(self, output_dir: Optional[str] = None):
        """
-        Saving best-practices: if you use default names for the model,
-        you can reload it using from_pretrained().
+        Will save the model, so you can reload it using :obj:`from_pretrained()`.

        Will only save from the world_master process (unless in TPUs).
        """
@@ -683,22 +727,18 @@ class Trainer:
            logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
            shutil.rmtree(checkpoint)

-    def evaluate(
-        self, eval_dataset: Optional[Dataset] = None, prediction_loss_only: Optional[bool] = None,
-    ) -> Dict[str, float]:
+    def evaluate(self, eval_dataset: Optional[Dataset] = None) -> Dict[str, float]:
        """
-        Run evaluation and return metrics.
+        Run evaluation and returns metrics.

        The calling script will be responsible for providing a method to compute metrics, as they are
-        task-dependent.
+        task-dependent (pass it to the init :obj:`compute_metrics` argument).

        Args:
-            eval_dataset: (Optional) Pass a dataset if you wish to override
-            the one on the instance.
+            eval_dataset (:obj:`Dataset`, `optional`):
+                Pass a dataset if you wish to override :obj:`self.eval_dataset`.
        Returns:
-            A dict containing:
-                - the eval loss
-                - the potential metrics computed from the predictions
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
        """
        eval_dataloader = self.get_eval_dataloader(eval_dataset)

@@ -706,7 +746,7 @@ class Trainer:

        self._log(output.metrics)

-        if self.args.tpu_metrics_debug:
+        if self.args.tpu_metrics_debug or self.args.debug:
            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
            xm.master_print(met.metrics_report())

@@ -714,10 +754,22 @@ class Trainer:

    def predict(self, test_dataset: Dataset) -> PredictionOutput:
        """
-        Run prediction and return predictions and potential metrics.
+        Run prediction and returns predictions and potential metrics.

        Depending on the dataset and your use case, your test dataset may contain labels.
-        In that case, this method will also return metrics, like in evaluate().
+        In that case, this method will also return metrics, like in :obj:`evaluate()`.
+
+        Args:
+            test_dataset (:obj:`Dataset`):
+                Dataset to run the predictions on.
+        Returns:
+            `NamedTuple`:
+            predictions (:obj:`np.ndarray`):
+                The predictions on :obj:`test_dataset`.
+            label_ids (:obj:`np.ndarray`, `optional`):
+                The labels (if the dataset contained some).
+            metrics (:obj:`Dict[str, float]`, `optional`):
+                The potential dictionary of metrics (if the dataset contained labels).
        """
        test_dataloader = self.get_test_dataloader(test_dataset)

@@ -755,12 +807,17 @@ class Trainer:
        if is_torch_tpu_available():
            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)

+        if self.args.past_index >= 0:
+            past = None
+
        for inputs in tqdm(dataloader, desc=description):
            has_labels = any(inputs.get(k) is not None for k in ["labels", "lm_labels", "masked_lm_labels"])

            for k, v in inputs.items():
                if isinstance(v, torch.Tensor):
                    inputs[k] = v.to(self.args.device)
+            if self.args.past_index >= 0:
+                inputs["mems"] = past

            with torch.no_grad():
                outputs = model(**inputs)
@@ -769,6 +826,8 @@ class Trainer:
                    eval_losses += [step_eval_loss.mean().item()]
                else:
                    logits = outputs[0]
+                if self.args.past_index >= 0:
+                    past = outputs[self.args.past_index if has_labels else self.args.past_index - 1]

            if not prediction_loss_only:
                if preds is None:
--- a/src/transformers/trainer_tf.py
+++ b/src/transformers/trainer_tf.py
@@ -3,7 +3,6 @@
 import logging
 import math
 import os
-import random
 from typing import Callable, Dict, Optional, Tuple

 import numpy as np
@@ -11,7 +10,7 @@ import tensorflow as tf

 from .modeling_tf_utils import TFPreTrainedModel
 from .optimization_tf import GradientAccumulator, create_optimizer
-from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, is_wandb_available
+from .trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, is_wandb_available, set_seed
 from .training_args_tf import TFTrainingArguments


@@ -22,13 +21,35 @@ if is_wandb_available():
 logger = logging.getLogger(__name__)


-def set_seed(seed: int):
-    random.seed(seed)
-    np.random.seed(seed)
-    tf.random.set_seed(seed)
-
-
 class TFTrainer:
+    """
+    TFTrainer is a simple but feature-complete training and eval loop for TensorFlow,
+    optimized for 🤗 Transformers.
+
+    Args:
+        model (:class:`~transformers.TFPreTrainedModel`):
+            The model to train, evaluate or use for predictions.
+        args (:class:`~transformers.TFTrainingArguments`):
+            The arguments to tweak training.
+        train_dataset (:class:`~tf.data.Dataset`, `optional`):
+            The dataset to use for training.
+        eval_dataset (:class:`~tf.data.Dataset`, `optional`):
+            The dataset to use for evaluation.
+        compute_metrics (:obj:`Callable[[EvalPrediction], Dict]`, `optional`):
+            The function that will be used to compute metrics at evaluation. Must take a
+            :class:`~transformers.EvalPrediction` and return a dictionary string to metric values.
+        prediction_loss_only (:obj:`bool`, `optional`, defaults to `False`):
+            When performing evaluation and predictions, only returns the loss.
+        tb_writer (:obj:`tf.summary.SummaryWriter`, `optional`):
+            Object to write to TensorBoard.
+        optimizers (:obj:`Tuple[tf.keras.optimizers.Optimizer, tf.keras.optimizers.schedules.LearningRateSchedule]`, `optional`):
+            A tuple containing the optimizer and the scheduler to use. The optimizer default to an instance of
+            :class:`tf.keras.optimizers.Adam` if :obj:`args.weight_decay_rate` is 0 else an instance of
+            :class:`~transformers.AdamWeightDecay`. The scheduler will default to an instance of
+            :class:`tf.keras.optimizers.schedules.PolynomialDecay` if :obj:`args.num_warmup_steps` is 0 else
+            an instance of :class:`~transformers.WarmUp`.
+    """
+
    model: TFPreTrainedModel
    args: TFTrainingArguments
    train_dataset: Optional[tf.data.Dataset]
@@ -78,6 +99,9 @@ class TFTrainer:
        set_seed(self.args.seed)

    def get_train_tfdataset(self) -> tf.data.Dataset:
+        """
+        Returns the training :class:`~tf.data.Dataset`.
+        """
        if self.train_dataset is None:
            raise ValueError("Trainer: training requires a train_dataset.")

@@ -101,6 +125,13 @@ class TFTrainer:
        return self.args.strategy.experimental_distribute_dataset(ds)

    def get_eval_tfdataset(self, eval_dataset: Optional[tf.data.Dataset] = None) -> tf.data.Dataset:
+        """
+        Returns the evaluation :class:`~tf.data.Dataset`.
+
+        Args:
+            eval_dataset (:class:`~tf.data.Dataset`, `optional`):
+                If provided, will override `self.eval_dataset`.
+        """
        if eval_dataset is None and self.eval_dataset is None:
            raise ValueError("Trainer: evaluation requires an eval_dataset.")

@@ -114,6 +145,12 @@ class TFTrainer:
        return self.args.strategy.experimental_distribute_dataset(ds)

    def get_test_tfdataset(self, test_dataset: tf.data.Dataset) -> tf.data.Dataset:
+        """
+        Returns a test :class:`~tf.data.Dataset`.
+
+        Args:
+            test_dataset (:class:`~tf.data.Dataset`): The dataset to use.
+        """
        ds = test_dataset.batch(self.args.eval_batch_size, drop_remainder=self.args.dataloader_drop_last)

        return self.args.strategy.experimental_distribute_dataset(ds)
@@ -124,9 +161,8 @@ class TFTrainer:
        """
        Setup the optimizer and the learning rate scheduler.

-        We provide a reasonable default that works well.
-        If you want to use something else, you can pass a tuple in the Trainer's init,
-        or override this method in a subclass.
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        TFTrainer's init through :obj:`optimizers`, or override this method in a subclass.
        """
        if self.optimizers is not None:
            return self.optimizers
@@ -197,6 +233,10 @@ class TFTrainer:

        step: int = 1

+        # Reset the past mems state at the beginning of the evaluation if necessary.
+        if self.args.past_index >= 0:
+            self._past = None
+
        for features, labels in dataset:
            step = tf.convert_to_tensor(step, dtype=tf.int64)
            loss, logits = self._evaluate_steps(features, labels)
@@ -209,7 +249,7 @@ class TFTrainer:
                if isinstance(labels, tuple):
                    labels = labels[0]

-                if self.args.n_gpu > 1:
+                if self.args.n_replicas > 1:
                    for val in logits.values:
                        if preds is None:
                            preds = val.numpy()
@@ -245,6 +285,10 @@ class TFTrainer:
            if not key.startswith("eval_"):
                metrics[f"eval_{key}"] = metrics.pop(key)

+        if self.args.past_index and hasattr(self, "_past"):
+            # Clean the state at the end of training
+            delattr(self, "_past")
+
        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)

    def _log(self, logs: Dict[str, float]) -> None:
@@ -263,11 +307,18 @@ class TFTrainer:

        logger.info(output)

-    def evaluate(
-        self, eval_dataset: Optional[tf.data.Dataset] = None, prediction_loss_only: Optional[bool] = None
-    ) -> Dict[str, float]:
+    def evaluate(self, eval_dataset: Optional[tf.data.Dataset] = None) -> Dict[str, float]:
        """
-        Prediction/evaluation loop, shared by `evaluate()` and `predict()`.
+        Run evaluation and returns metrics.
+
+        The calling script will be responsible for providing a method to compute metrics, as they are
+        task-dependent (pass it to the init :obj:`compute_metrics` argument).
+
+        Args:
+            eval_dataset (:class:`~tf.data.Dataset`, `optional`):
+                Pass a dataset if you wish to override :obj:`self.eval_dataset`.
+        Returns:
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions.
        """
        eval_ds = self.get_eval_tfdataset(eval_dataset)

@@ -355,6 +406,9 @@ class TFTrainer:
        logger.info("  Total optimization steps = %d", t_total)

        for epoch_iter in range(epochs_trained, int(epochs + 1)):
+            # Reset the past mems state at the beginning of each epoch if necessary.
+            if self.args.past_index >= 0:
+                self._past = None
            for step, training_loss in enumerate(self._training_steps(train_ds, optimizer)):
                self.global_step = iterations.numpy()
                self.epoch_logging = epoch_iter - 1 + (step + 1) / steps_per_epoch
@@ -394,6 +448,10 @@ class TFTrainer:
                if self.args.max_steps > 0 and self.global_step % self.args.max_steps == 0:
                    break

+        if self.args.past_index and hasattr(self, "_past"):
+            # Clean the state at the end of training
+            delattr(self, "_past")
+
    def _training_steps(self, ds, optimizer):
        """
        Returns a generator over training steps (i.e. parameters update).
@@ -468,22 +526,37 @@ class TFTrainer:
          labels: the batched labels.
          training: run the model in training mode or not
        """
+        if self.args.past_index >= 0 and getattr(self, "_past", None) is not None:
+            features["mems"] = self._past
        if isinstance(labels, (dict)):
-            loss, logits = self.model(features, training=training, **labels)[:2]
+            outputs = self.model(features, training=training, **labels)[:2]
        else:
-            loss, logits = self.model(features, labels=labels, training=training)[:2]
-        loss += sum(self.model.losses) * (1.0 / self.args.n_gpu)
+            outputs = self.model(features, labels=labels, training=training)[:2]
+        loss, logits = outputs[:2]
+        if self.args.past_index >= 0:
+            self._past = outputs[self.args.past_index]
+        loss += sum(self.model.losses) * (1.0 / self.args.n_replicas)

        return loss, logits

    def predict(self, test_dataset: tf.data.Dataset) -> PredictionOutput:
        """
-        Run prediction and return predictions and potential metrics.
+        Run prediction and returns predictions and potential metrics.
+
        Depending on the dataset and your use case, your test dataset may contain labels.
-        In that case, this method will also return metrics, like in evaluate().
+        In that case, this method will also return metrics, like in :obj:`evaluate()`.
+
        Args:
-          test_dataset: something similar to a PT Dataset. This is just
-            temporary before to have a framework-agnostic approach for datasets.
+            test_dataset (:class:`~tf.data.Dataset`):
+                Dataset to run the predictions on.
+        Returns:
+            `NamedTuple`:
+            predictions (:obj:`np.ndarray`):
+                The predictions on :obj:`test_dataset`.
+            label_ids (:obj:`np.ndarray`, `optional`):
+                The labels (if the dataset contained some).
+            metrics (:obj:`Dict[str, float]`, `optional`):
+                The potential dictionary of metrics (if the dataset contained labels).
        """
        test_ds = self.get_test_tfdataset(test_dataset)

@@ -491,7 +564,7 @@ class TFTrainer:

    def save_model(self, output_dir: Optional[str] = None):
        """
-        Save the pretrained model.
+        Will save the model, so you can reload it using :obj:`from_pretrained()`.
        """
        output_dir = output_dir if output_dir is not None else self.args.output_dir

--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -1,8 +1,11 @@
 import os
+import random
 from typing import Dict, NamedTuple, Optional

 import numpy as np

+from .file_utils import is_tf_available, is_torch_available
+

 try:
    import wandb
@@ -21,10 +24,35 @@ def is_wandb_available():
    return _has_wandb


+def set_seed(seed: int):
+    """
+    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf``
+    (if installed).
+
+    Args:
+        seed (:obj:`int`): The seed to set.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+    if is_torch_available():
+        import torch
+
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+        # ^^ safe to call this function even if cuda is not available
+    if is_tf_available():
+        import tensorflow as tf
+
+        tf.random.set_seed(seed)
+
+
 class EvalPrediction(NamedTuple):
    """
-    Evaluation output (always contains labels), to be used
-    to compute metrics.
+    Evaluation output (always contains labels), to be used to compute metrics.
+
+    Parameters:
+        predictions (:obj:`np.ndarray`): Predictions of the model.
+        label_ids (:obj:`np.ndarray`): Targets to be matched.
    """

    predictions: np.ndarray
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -35,9 +35,80 @@ class TrainingArguments:
    TrainingArguments is the subset of the arguments we use in our example scripts
    **which relate to the training loop itself**.

-    Using `HfArgumentParser` we can turn this class
-    into argparse arguments to be able to specify them on
-    the command line.
+    Using :class:`~transformers.HfArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on the command line.
+
+    Parameters:
+        output_dir (:obj:`str`):
+            The output directory where the model predictions and checkpoints will be written.
+        overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
+            :obj:`output_dir` points to a checkpoint directory.
+        do_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run training or not.
+        do_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run evaluation on the dev set or not.
+        do_predict (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run predictions on the test set or not.
+        evaluate_during_training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run evaluation during training at each logging step or not.
+        per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
+            The batch size per GPU/TPU core/CPU for training.
+        per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
+            The batch size per GPU/TPU core/CPU for evaluation.
+        gradient_accumulation_steps: (:obj:`int`, `optional`, defaults to 1):
+            Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
+        learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
+            The initial learning rate for Adam.
+        weight_decay (:obj:`float`, `optional`, defaults to 0):
+            The weight decay to apply (if not zero).
+        adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
+            Epsilon for the Adam optimizer.
+        max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
+            Maximum gradient norm (for gradient clipping).
+        num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
+            Total number of training epochs to perform.
+        max_steps (:obj:`int`, `optional`, defaults to -1):
+            If set to a positive number, the total number of training steps to perform. Overrides
+            :obj:`num_train_epochs`.
+        warmup_steps (:obj:`int`, `optional`, defaults to 0):
+            Number of steps used for a linear warmup from 0 to :obj:`learning_rate`.
+        logging_dir (:obj:`str`, `optional`):
+            Tensorboard log directory. Will default to `runs/**CURRENT_DATETIME_HOSTNAME**`.
+        logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Wheter to log and evalulate the first :obj:`global_step` or not.
+        logging_steps (:obj:`int`, `optional`, defaults to 500):
+            Number of update steps between two logs.
+        save_steps (:obj:`int`, `optional`, defaults to 500):
+            Number of updates steps before two checkpoint saves.
+        save_total_limit (:obj:`int`, `optional`):
+            If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
+            :obj:`output_dir`.
+        no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Wherher to not use CUDA even when it is available or not.
+        seed (:obj:`int`, `optional`, defaults to 42):
+            Random seed for initialization.
+        fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.
+        fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
+            For :obj:`fp16` training, apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
+            on the `apex documentation <https://nvidia.github.io/apex/amp.html>`__.
+        local_rank (:obj:`int`, `optional`, defaults to -1):
+            During distributed training, the rank of the process.
+        tpu_num_cores (:obj:`int`, `optional`):
+            When training on TPU, the mumber of TPU cores (automatically passed by launcher script).
+        debug (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            When training on TPU, whether to print debug metrics or not.
+        dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
+            or not.
+        eval_steps (:obj:`int`, `optional`, defaults to 1000):
+            Number of update steps between two evaluations.
+        past_index (:obj:`int`, `optional`, defaults to -1):
+            Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc`XLNet <../model_doc/xlnet>` can
+            make use of the past hidden states for their predictions. If this argument is set to a positive int, the
+            ``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
+            at the next training step under the keyword argument ``mems``.
    """

    output_dir: str = field(
@@ -133,14 +204,27 @@ class TrainingArguments:
    tpu_num_cores: Optional[int] = field(
        default=None, metadata={"help": "TPU: Number of TPU cores (automatically passed by launcher script)"}
    )
-    tpu_metrics_debug: bool = field(default=False, metadata={"help": "TPU: Whether to print debug metrics"})
+    tpu_metrics_debug: bool = field(
+        default=False,
+        metadata={"help": "Deprecated, the use of `--debug` is preferred. TPU: Whether to print debug metrics"},
+    )
+    debug: bool = field(default=False, metadata={"help": "Whether to print debug metrics on TPU"})

    dataloader_drop_last: bool = field(
        default=False, metadata={"help": "Drop the last incomplete batch if it is not divisible by the batch size."}
    )
+    eval_steps: int = field(default=1000, metadata={"help": "Run an evaluation every X steps."})
+
+    past_index: int = field(
+        default=-1,
+        metadata={"help": "If >=0, uses the corresponding part of the output as the past state for next step."},
+    )

    @property
    def train_batch_size(self) -> int:
+        """
+        The actual batch size for training (may differ from :obj:`per_gpu_train_batch_size` in distributed training).
+        """
        if self.per_gpu_train_batch_size:
            logger.warning(
                "Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future "
@@ -151,6 +235,9 @@ class TrainingArguments:

    @property
    def eval_batch_size(self) -> int:
+        """
+        The actual batch size for evaluation (may differ from :obj:`per_gpu_eval_batch_size` in distributed training).
+        """
        if self.per_gpu_eval_batch_size:
            logger.warning(
                "Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future "
@@ -193,11 +280,21 @@ class TrainingArguments:
    @property
    @torch_required
    def device(self) -> "torch.device":
+        """
+        The device used by this process.
+        """
        return self._setup_devices[0]

    @property
    @torch_required
    def n_gpu(self):
+        """
+        The number of GPUs used by this process.
+
+        Note:
+            This will only be greater than one when you have multiple GPUs available but are not using distributed
+            training. For distributed training, it will always be 1.
+        """
        return self._setup_devices[1]

    def to_json_string(self):
--- a/src/transformers/training_args_tf.py
+++ b/src/transformers/training_args_tf.py
@@ -1,4 +1,5 @@
 import logging
+import warnings
 from dataclasses import dataclass, field
 from typing import Tuple

@@ -14,13 +15,91 @@ if is_tf_available():

@dataclass
 class TFTrainingArguments(TrainingArguments):
+    """
+    TrainingArguments is the subset of the arguments we use in our example scripts
+    **which relate to the training loop itself**.
+
+    Using :class:`~transformers.HfArgumentParser` we can turn this class
+    into argparse arguments to be able to specify them on the command line.
+
+    Parameters:
+        output_dir (:obj:`str`):
+            The output directory where the model predictions and checkpoints will be written.
+        overwrite_output_dir (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            If :obj:`True`, overwrite the content of the output directory. Use this to continue training if
+            :obj:`output_dir` points to a checkpoint directory.
+        do_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run training or not.
+        do_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run evaluation on the dev set or not.
+        do_predict (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run predictions on the test set or not.
+        evaluate_during_training (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to run evaluation during training at each logging step or not.
+        per_device_train_batch_size (:obj:`int`, `optional`, defaults to 8):
+            The batch size per GPU/TPU core/CPU for training.
+        per_device_eval_batch_size (:obj:`int`, `optional`, defaults to 8):
+            The batch size per GPU/TPU core/CPU for evaluation.
+        gradient_accumulation_steps: (:obj:`int`, `optional`, defaults to 1):
+            Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
+        learning_rate (:obj:`float`, `optional`, defaults to 5e-5):
+            The initial learning rate for Adam.
+        weight_decay (:obj:`float`, `optional`, defaults to 0):
+            The weight decay to apply (if not zero).
+        adam_epsilon (:obj:`float`, `optional`, defaults to 1e-8):
+            Epsilon for the Adam optimizer.
+        max_grad_norm (:obj:`float`, `optional`, defaults to 1.0):
+            Maximum gradient norm (for gradient clipping).
+        num_train_epochs(:obj:`float`, `optional`, defaults to 3.0):
+            Total number of training epochs to perform.
+        max_steps (:obj:`int`, `optional`, defaults to -1):
+            If set to a positive number, the total number of training steps to perform. Overrides
+            :obj:`num_train_epochs`.
+        warmup_steps (:obj:`int`, `optional`, defaults to 0):
+            Number of steps used for a linear warmup from 0 to :obj:`learning_rate`.
+        logging_dir (:obj:`str`, `optional`):
+            Tensorboard log directory. Will default to `runs/**CURRENT_DATETIME_HOSTNAME**`.
+        logging_first_step (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Wheter to log and evalulate the first :obj:`global_step` or not.
+        logging_steps (:obj:`int`, `optional`, defaults to 500):
+            Number of update steps between two logs.
+        save_steps (:obj:`int`, `optional`, defaults to 500):
+            Number of updates steps before two checkpoint saves.
+        save_total_limit (:obj:`int`, `optional`):
+            If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
+            :obj:`output_dir`.
+        no_cuda (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Wherher to not use CUDA even when it is available or not.
+        seed (:obj:`int`, `optional`, defaults to 42):
+            Random seed for initialization.
+        fp16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.
+        fp16_opt_level (:obj:`str`, `optional`, defaults to 'O1'):
+            For :obj:`fp16` training, apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details
+            on the `apex documentation <https://nvidia.github.io/apex/amp.html>`__.
+        local_rank (:obj:`int`, `optional`, defaults to -1):
+            During distributed training, the rank of the process.
+        tpu_num_cores (:obj:`int`, `optional`):
+            When training on TPU, the mumber of TPU cores (automatically passed by launcher script).
+        debug (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Wheter to activate the trace to record computation graphs and profiling information or not.
+        dataloader_drop_last (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to drop the last incomplete batch (if the length of the dataset is not divisible by the batch size)
+            or not.
+        eval_steps (:obj:`int`, `optional`, defaults to 1000):
+            Number of update steps before two evaluations.
+        past_index (:obj:`int`, `optional`, defaults to -1):
+            Some models like :doc:`TransformerXL <../model_doc/transformerxl>` or :doc`XLNet <../model_doc/xlnet>` can
+            make use of the past hidden states for their predictions. If this argument is set to a positive int, the
+            ``Trainer`` will use the corresponding output (usually index 2) as the past state and feed it to the model
+            at the next training step under the keyword argument ``mems``.
+        tpu_name (:obj:`str`, `optional`):
+            The name of the TPU the process is running on.
+    """
+
    tpu_name: str = field(
        default=None, metadata={"help": "Name of TPU"},
    )
-    eval_steps: int = field(default=1000, metadata={"help": "Run an evaluation every X steps."})
-    debug: bool = field(
-        default=False, metadata={"help": "Activate the trace to record computation graphs and profiling information"}
-    )

    @cached_property
    @tf_required
@@ -59,9 +138,53 @@ class TFTrainingArguments(TrainingArguments):
    @property
    @tf_required
    def strategy(self) -> "tf.distribute.Strategy":
+        """
+        The strategy used for distributed training.
+        """
        return self._setup_strategy

    @property
    @tf_required
-    def n_gpu(self) -> int:
+    def n_replicas(self) -> int:
+        """
+        The number of replicas (CPUs, GPUs or TPU cores) used in this training.
+        """
+        return self._setup_strategy.num_replicas_in_sync
+
+    @property
+    def train_batch_size(self) -> int:
+        """
+        The actual batch size for training (may differ from :obj:`per_gpu_train_batch_size` in distributed training).
+        """
+        if self.per_gpu_train_batch_size:
+            logger.warning(
+                "Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future "
+                "version. Using `--per_device_train_batch_size` is preferred."
+            )
+        per_device_batch_size = self.per_gpu_train_batch_size or self.per_device_train_batch_size
+        return per_device_batch_size * max(1, self.n_replicas)
+
+    @property
+    def eval_batch_size(self) -> int:
+        """
+        The actual batch size for evaluation (may differ from :obj:`per_gpu_eval_batch_size` in distributed training).
+        """
+        if self.per_gpu_eval_batch_size:
+            logger.warning(
+                "Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future "
+                "version. Using `--per_device_eval_batch_size` is preferred."
+            )
+        per_device_batch_size = self.per_gpu_eval_batch_size or self.per_device_eval_batch_size
+        return per_device_batch_size * max(1, self.n_replicas)
+
+    @property
+    @tf_required
+    def n_gpu(self) -> int:
+        """
+        The number of replicas (CPUs, GPUs or TPU cores) used in this training.
+        """
+        warnings.warn(
+            "The n_gpu argument is deprecated and will be removed in a future version, use n_replicas instead.",
+            FutureWarning,
+        )
        return self._setup_strategy.num_replicas_in_sync
--- a/tests/test_activations.py
+++ b/tests/test_activations.py
@@ -1,8 +1,7 @@
 import unittest

 from transformers import is_torch_available
-
-from .utils import require_torch
+from transformers.testing_utils import require_torch


 if is_torch_available():
--- a/tests/test_benchmark.py
+++ b/tests/test_benchmark.py
@@ -4,8 +4,7 @@ import unittest
 from pathlib import Path

 from transformers import AutoConfig, is_torch_available
-
-from .utils import require_torch, torch_device
+from transformers.testing_utils import require_torch, torch_device


 if is_torch_available():
--- a/tests/test_benchmark_tf.py
+++ b/tests/test_benchmark_tf.py
@@ -4,8 +4,7 @@ import unittest
 from pathlib import Path

 from transformers import AutoConfig, is_tf_available
-
-from .utils import require_tf
+from transformers.testing_utils import require_tf


 if is_tf_available():
--- a/tests/test_configuration_auto.py
+++ b/tests/test_configuration_auto.py
@@ -19,8 +19,7 @@ import unittest
 from transformers.configuration_auto import CONFIG_MAPPING, AutoConfig
 from transformers.configuration_bert import BertConfig
 from transformers.configuration_roberta import RobertaConfig
-
-from .utils import DUMMY_UNKWOWN_IDENTIFIER
+from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER


 SAMPLE_ROBERTA_CONFIG = os.path.join(os.path.dirname(os.path.abspath(__file__)), "fixtures/dummy-config.json")
--- a/tests/test_doc_samples.py
+++ b/tests/test_doc_samples.py
@@ -21,8 +21,7 @@ from pathlib import Path
 from typing import List, Union

 import transformers
-
-from .utils import require_tf, require_torch, slow
+from transformers.testing_utils import require_tf, require_torch, slow


 logger = logging.getLogger()
--- a/tests/test_modeling_albert.py
+++ b/tests/test_modeling_albert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_auto.py
+++ b/tests/test_modeling_auto.py
@@ -17,8 +17,7 @@
 import unittest

 from transformers import is_torch_available
-
-from .utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_torch, slow
+from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_torch, slow


 if is_torch_available():
--- a/tests/test_modeling_bart.py
+++ b/tests/test_modeling_bart.py
@@ -20,10 +20,10 @@ import timeout_decorator  # noqa

 from transformers import is_torch_available
 from transformers.file_utils import cached_property
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_bert.py
+++ b/tests/test_modeling_bert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_camembert.py
+++ b/tests/test_modeling_camembert.py
@@ -16,8 +16,7 @@
 import unittest

 from transformers import is_torch_available
-
-from .utils import require_torch, slow, torch_device
+from transformers.testing_utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -21,8 +21,7 @@ import unittest
 from typing import List

 from transformers import is_torch_available
-
-from .utils import require_multigpu, require_torch, slow, torch_device
+from transformers.testing_utils import require_multigpu, require_torch, slow, torch_device


 if is_torch_available():
@@ -812,7 +811,7 @@ class ModelTesterMixin:
            # Wrap model in nn.DataParallel
            model = torch.nn.DataParallel(model)
            with torch.no_grad():
-                _ = model(**inputs_dict)
+                _ = model(**self._prepare_for_class(inputs_dict, model_class))


 global_rng = random.Random()
--- a/tests/test_modeling_ctrl.py
+++ b/tests/test_modeling_ctrl.py
@@ -16,10 +16,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_distilbert.py
+++ b/tests/test_modeling_distilbert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, torch_device


 if is_torch_available():
--- a/tests/test_modeling_electra.py
+++ b/tests/test_modeling_electra.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_encoder_decoder.py
+++ b/tests/test_modeling_encoder_decoder.py
@@ -18,12 +18,12 @@ import tempfile
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 # TODO(PVP): this line reruns all the tests in BertModelTest; not sure whether this can be prevented
 # for now only run module with pytest tests/test_modeling_encoder_decoder.py::EncoderDecoderModelTest
 from .test_modeling_bert import BertModelTester
 from .test_modeling_common import ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_flaubert.py
+++ b/tests/test_modeling_flaubert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_gpt2.py
+++ b/tests/test_modeling_gpt2.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_longformer.py
+++ b/tests/test_modeling_longformer.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
@@ -115,6 +115,18 @@ class LongformerModelTester:
    def check_loss_output(self, result):
        self.parent.assertListEqual(list(result["loss"].size()), [])

+    def create_and_check_attention_mask_determinism(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LongformerModel(config=config)
+        model.to(torch_device)
+        model.eval()
+
+        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
+        output_with_mask = model(input_ids, attention_mask=attention_mask)[0]
+        output_without_mask = model(input_ids)[0]
+        self.parent.assertTrue(torch.allclose(output_with_mask[0, 0, :5], output_without_mask[0, 0, :5], atol=1e-4))
+
    def create_and_check_longformer_model(
        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
    ):
@@ -134,6 +146,36 @@ class LongformerModelTester:
        )
        self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])

+    def create_and_check_longformer_model_with_global_attention_mask(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LongformerModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        global_attention_mask = input_mask.clone()
+        global_attention_mask[:, input_mask.shape[-1] // 2] = 0
+        global_attention_mask = global_attention_mask.to(torch_device)
+
+        sequence_output, pooled_output = model(
+            input_ids,
+            attention_mask=input_mask,
+            global_attention_mask=global_attention_mask,
+            token_type_ids=token_type_ids,
+        )
+        sequence_output, pooled_output = model(
+            input_ids, token_type_ids=token_type_ids, global_attention_mask=global_attention_mask
+        )
+        sequence_output, pooled_output = model(input_ids, global_attention_mask=global_attention_mask)
+
+        result = {
+            "sequence_output": sequence_output,
+            "pooled_output": pooled_output,
+        }
+        self.parent.assertListEqual(
+            list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
+        )
+        self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
    def create_and_check_longformer_for_masked_lm(
        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
    ):
@@ -243,7 +285,13 @@ class LongformerModelTester:
            token_labels,
            choice_labels,
        ) = config_and_inputs
-        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        global_attention_mask = torch.zeros_like(input_ids)
+        inputs_dict = {
+            "input_ids": input_ids,
+            "token_type_ids": token_type_ids,
+            "attention_mask": input_mask,
+            "global_attention_mask": global_attention_mask,
+        }
        return config, inputs_dict

    def prepare_config_and_inputs_for_question_answering(self):
@@ -277,11 +325,10 @@ class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
        (
            LongformerModel,
            LongformerForMaskedLM,
-            # TODO: make tests pass for those models
-            # LongformerForSequenceClassification,
-            # LongformerForQuestionAnswering,
-            # LongformerForTokenClassification,
-            # LongformerForMultipleChoice,
+            LongformerForSequenceClassification,
+            LongformerForQuestionAnswering,
+            LongformerForTokenClassification,
+            LongformerForMultipleChoice,
        )
        if is_torch_available()
        else ()
@@ -298,6 +345,14 @@ class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_longformer_model(*config_and_inputs)

+    def test_longformer_model_attention_mask_determinism(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_attention_mask_determinism(*config_and_inputs)
+
+    def test_longformer_model_global_attention_mask(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_longformer_model_with_global_attention_mask(*config_and_inputs)
+
    def test_longformer_for_masked_lm(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_longformer_for_masked_lm(*config_and_inputs)
@@ -325,15 +380,31 @@ class LongformerModelIntegrationTest(unittest.TestCase):
        model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
        model.to(torch_device)

+        # 'Hello world!'
+        input_ids = torch.tensor([[0, 20920, 232, 328, 1437, 2]], dtype=torch.long, device=torch_device)
+        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
+        output = model(input_ids, attention_mask=attention_mask)[0]
+        output_without_mask = model(input_ids)[0]
+
+        expected_output_slice = torch.tensor([0.0549, 0.1087, -0.1119, -0.0368, 0.0250], device=torch_device)
+        self.assertTrue(torch.allclose(output[0, 0, -5:], expected_output_slice, atol=1e-4))
+        self.assertTrue(torch.allclose(output_without_mask[0, 0, -5:], expected_output_slice, atol=1e-4))
+
+    @slow
+    def test_inference_no_head_long(self):
+        model = LongformerModel.from_pretrained("allenai/longformer-base-4096")
+        model.to(torch_device)
+
        # 'Hello world! ' repeated 1000 times
        input_ids = torch.tensor(
            [[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
        )  # long input

        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
-        attention_mask[:, [1, 4, 21]] = 2  # Set global attention on a few random positions
+        global_attention_mask = torch.zeros(input_ids.shape, dtype=torch.long, device=input_ids.device)
+        global_attention_mask[:, [1, 4, 21]] = 1  # Set global attention on a few random positions

-        output = model(input_ids, attention_mask=attention_mask)[0]
+        output = model(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)[0]

        expected_output_sum = torch.tensor(74585.8594, device=torch_device)
        expected_output_mean = torch.tensor(0.0243, device=torch_device)
@@ -341,7 +412,7 @@ class LongformerModelIntegrationTest(unittest.TestCase):
        self.assertTrue(torch.allclose(output.mean(), expected_output_mean, atol=1e-4))

    @slow
-    def test_inference_masked_lm(self):
+    def test_inference_masked_lm_long(self):
        model = LongformerForMaskedLM.from_pretrained("allenai/longformer-base-4096")
        model.to(torch_device)

@@ -352,9 +423,9 @@ class LongformerModelIntegrationTest(unittest.TestCase):

        loss, prediction_scores = model(input_ids, labels=input_ids)

-        expected_loss = torch.tensor(0.0620, device=torch_device)
-        expected_prediction_scores_sum = torch.tensor(-6.1599e08, device=torch_device)
-        expected_prediction_scores_mean = torch.tensor(-3.0622, device=torch_device)
+        expected_loss = torch.tensor(0.0074, device=torch_device)
+        expected_prediction_scores_sum = torch.tensor(-6.1048e08, device=torch_device)
+        expected_prediction_scores_mean = torch.tensor(-3.0348, device=torch_device)
        input_ids = input_ids.to(torch_device)

        self.assertTrue(torch.allclose(loss, expected_loss, atol=1e-4))
--- a/tests/test_modeling_marian.py
+++ b/tests/test_modeling_marian.py
@@ -19,8 +19,7 @@ import unittest
 from transformers import is_torch_available
 from transformers.file_utils import cached_property
 from transformers.hf_api import HfApi
-
-from .utils import require_torch, slow, torch_device
+from transformers.testing_utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_mobilebert.py
+++ b/tests/test_modeling_mobilebert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_openai.py
+++ b/tests/test_modeling_openai.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_reformer.py
+++ b/tests/test_modeling_reformer.py
@@ -16,19 +16,21 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_multigpu, require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
-from .utils import require_multigpu, require_torch, slow, torch_device


 if is_torch_available():
    from transformers import (
        ReformerConfig,
+        ReformerForMaskedLM,
        ReformerModel,
        ReformerModelWithLMHead,
        ReformerTokenizer,
        ReformerLayer,
+        ReformerForQuestionAnswering,
        REFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
    )
    import torch
@@ -43,6 +45,7 @@ class ReformerModelTester:
        is_training=None,
        is_decoder=None,
        use_input_mask=None,
+        use_labels=None,
        vocab_size=None,
        attention_head_size=None,
        hidden_size=None,
@@ -81,6 +84,7 @@ class ReformerModelTester:
        self.is_training = is_training
        self.is_decoder = is_decoder
        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
        self.vocab_size = vocab_size
        self.attention_head_size = attention_head_size
        self.hidden_size = hidden_size
@@ -128,6 +132,10 @@ class ReformerModelTester:
        if self.use_input_mask:
            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)

+        choice_labels = None
+        if self.use_labels:
+            choice_labels = ids_tensor([self.batch_size], 2)
+
        config = ReformerConfig(
            vocab_size=self.vocab_size,
            hidden_size=self.hidden_size,
@@ -160,14 +168,13 @@ class ReformerModelTester:
            config,
            input_ids,
            input_mask,
+            choice_labels,
        )

    def check_loss_output(self, result):
        self.parent.assertListEqual(list(result["loss"].size()), [])

-    def create_and_check_reformer_model(
-        self, config, input_ids, input_mask,
-    ):
+    def create_and_check_reformer_model(self, config, input_ids, input_mask, choice_labels):
        model = ReformerModel(config=config)
        model.to(torch_device)
        model.eval()
@@ -182,18 +189,14 @@ class ReformerModelTester:
            list(result["sequence_output"].size()), [self.batch_size, self.seq_length, 2 * self.hidden_size],
        )

-    def create_and_check_reformer_model_with_lm_backward(
-        self, config, input_ids, input_mask,
-    ):
+    def create_and_check_reformer_model_with_lm_backward(self, config, input_ids, input_mask, choice_labels):
        model = ReformerModelWithLMHead(config=config)
        model.to(torch_device)
        model.eval()
        loss = model(input_ids, attention_mask=input_mask, labels=input_ids)[0]
        loss.backward()

-    def create_and_check_reformer_with_lm(
-        self, config, input_ids, input_mask,
-    ):
+    def create_and_check_reformer_with_lm(self, config, input_ids, input_mask, choice_labels):
        model = ReformerModelWithLMHead(config=config)
        model.to(torch_device)
        model.eval()
@@ -207,7 +210,24 @@ class ReformerModelTester:
        )
        self.check_loss_output(result)

-    def create_and_check_reformer_model_with_attn_mask(self, config, input_ids, input_mask, is_decoder):
+    def create_and_check_reformer_with_mlm(self, config, input_ids, input_mask, choice_labels):
+        config.is_decoder = False
+        model = ReformerForMaskedLM(config=config)
+        model.to(torch_device)
+        model.eval()
+        loss, prediction_scores = model(input_ids, attention_mask=input_mask, labels=input_ids)
+        result = {
+            "loss": loss,
+            "prediction_scores": prediction_scores,
+        }
+        self.parent.assertListEqual(
+            list(result["prediction_scores"].size()), [self.batch_size, self.seq_length, self.vocab_size],
+        )
+        self.check_loss_output(result)
+
+    def create_and_check_reformer_model_with_attn_mask(
+        self, config, input_ids, input_mask, choice_labels, is_decoder=False
+    ):
        # no special position embeddings
        config.axial_pos_embds = False
        config.is_decoder = is_decoder
@@ -248,7 +268,9 @@ class ReformerModelTester:

        self.parent.assertTrue(torch.allclose(output_padded, output_padded_rolled, atol=1e-3))

-    def create_and_check_reformer_layer_dropout_seed(self, config, input_ids, input_mask, is_decoder):
+    def create_and_check_reformer_layer_dropout_seed(
+        self, config, input_ids, input_mask, choice_labels, is_decoder=False
+    ):
        config.is_decoder = is_decoder
        layer = ReformerLayer(config).to(torch_device)
        layer.train()
@@ -281,7 +303,7 @@ class ReformerModelTester:
            torch.allclose(next_hidden_states, hidden_states + feed_forward_hidden_states, atol=1e-3,)
        )

-    def create_and_check_reformer_feed_forward_chunking(self, config, input_ids, input_mask):
+    def create_and_check_reformer_feed_forward_chunking(self, config, input_ids, input_mask, choice_labels):
        torch.manual_seed(0)
        model = ReformerModel(config=config)
        model.to(torch_device)
@@ -299,7 +321,7 @@ class ReformerModelTester:
        hidden_states_with_chunk = model(input_ids, attention_mask=input_mask)[0]
        self.parent.assertTrue(torch.allclose(hidden_states_no_chunk, hidden_states_with_chunk, atol=1e-3))

-    def create_and_check_reformer_feed_backward_chunking(self, config, input_ids, input_mask):
+    def create_and_check_reformer_feed_backward_chunking(self, config, input_ids, input_mask, choice_labels):
        if not self.is_training:
            return

@@ -341,7 +363,7 @@ class ReformerModelTester:
            torch.allclose(grad_slice_position_factor_2_chunk, grad_slice_position_factor_2_no_chunk, atol=1e-3)
        )

-    def create_and_check_reformer_random_seed(self, config, input_ids, input_mask):
+    def create_and_check_reformer_random_seed(self, config, input_ids, input_mask, choice_labels):
        layer = ReformerLayer(config).to(torch_device)
        layer.train()

@@ -372,7 +394,7 @@ class ReformerModelTester:
            seeds.append(layer.feed_forward_seed)
        self.parent.assertGreater(len(set(seeds)), 70)

-    def create_and_check_reformer_model_fp16_forward(self, config, input_ids, input_mask):
+    def create_and_check_reformer_model_fp16_forward(self, config, input_ids, input_mask, choice_labels):
        model = ReformerModel(config=config)
        model.to(torch_device)
        model.half()
@@ -380,7 +402,7 @@ class ReformerModelTester:
        output = model(input_ids, attention_mask=input_mask)[0]
        self.parent.assertFalse(torch.isnan(output).any().item())

-    def create_and_check_reformer_model_fp16_generate(self, config, input_ids, input_mask):
+    def create_and_check_reformer_model_fp16_generate(self, config, input_ids, input_mask, choice_labels):
        model = ReformerModelWithLMHead(config=config)
        model.to(torch_device)
        model.half()
@@ -388,7 +410,7 @@ class ReformerModelTester:
        output = model.generate(input_ids, attention_mask=input_mask, do_sample=False)
        self.parent.assertFalse(torch.isnan(output).any().item())

-    def create_and_check_reformer_no_chunking(self, config, input_ids, input_mask):
+    def create_and_check_reformer_no_chunking(self, config, input_ids, input_mask, choice_labels):
        # force chunk length to be bigger than input_ids
        config.lsh_attn_chunk_length = 2 * input_ids.shape[-1]
        config.local_attn_chunk_length = 2 * input_ids.shape[-1]
@@ -398,9 +420,25 @@ class ReformerModelTester:
        output_logits = model(input_ids, attention_mask=input_mask)[0]
        self.parent.assertTrue(output_logits.shape[1] == input_ids.shape[-1])

+    def create_and_check_longformer_for_question_answering(self, config, input_ids, input_mask, choice_labels):
+        model = ReformerForQuestionAnswering(config=config)
+        model.to(torch_device)
+        model.eval()
+        loss, start_logits, end_logits = model(
+            input_ids, attention_mask=input_mask, start_positions=choice_labels, end_positions=choice_labels,
+        )
+        result = {
+            "loss": loss,
+            "start_logits": start_logits,
+            "end_logits": end_logits,
+        }
+        self.parent.assertListEqual(list(result["start_logits"].size()), [self.batch_size, self.seq_length])
+        self.parent.assertListEqual(list(result["end_logits"].size()), [self.batch_size, self.seq_length])
+        self.check_loss_output(result)
+
    def prepare_config_and_inputs_for_common(self):
        config_and_inputs = self.prepare_config_and_inputs()
-        (config, input_ids, input_mask,) = config_and_inputs
+        (config, input_ids, input_mask, choice_labels) = config_and_inputs
        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
        return config, inputs_dict

@@ -423,17 +461,21 @@ class ReformerTesterMixin:

    def test_reformer_model_attn_masking(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, True)
-        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, False)
+        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=True)
+        self.model_tester.create_and_check_reformer_model_with_attn_mask(*config_and_inputs, is_decoder=False)

    def test_reformer_with_lm(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_reformer_with_lm(*config_and_inputs)

+    def test_reformer_with_mlm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_reformer_with_mlm(*config_and_inputs)
+
    def test_reformer_layer_training_dropout(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
-        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, True)
-        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, False)
+        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=True)
+        self.model_tester.create_and_check_reformer_layer_dropout_seed(*config_and_inputs, is_decoder=False)

    def test_reformer_chunking_forward_equality(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
@@ -470,7 +512,9 @@ class ReformerTesterMixin:

@require_torch
 class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
-    all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
+    all_model_classes = (
+        (ReformerModel, ReformerModelWithLMHead, ReformerForQuestionAnswering) if is_torch_available() else ()
+    )
    all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
    test_pruning = False
    test_headmasking = False
@@ -481,8 +525,9 @@ class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest
            "batch_size": 13,
            "seq_length": 32,
            "is_training": True,
-            "is_decoder": False,
+            "is_decoder": True,
            "use_input_mask": True,
+            "use_labels": True,
            "vocab_size": 32,
            "attention_head_size": 16,
            "hidden_size": 32,
@@ -524,7 +569,9 @@ class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest

@require_torch
 class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
-    all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
+    all_model_classes = (
+        (ReformerModel, ReformerModelWithLMHead, ReformerForQuestionAnswering) if is_torch_available() else ()
+    )
    all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
    test_pruning = False
    test_headmasking = False
@@ -535,8 +582,9 @@ class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.T
            "batch_size": 13,
            "seq_length": 13,
            "use_input_mask": True,
+            "use_labels": True,
            "is_training": False,
-            "is_decoder": False,
+            "is_decoder": True,
            "vocab_size": 32,
            "attention_head_size": 16,
            "hidden_size": 64,
@@ -886,7 +934,7 @@ class ReformerIntegrationTests(unittest.TestCase):
        config["num_buckets"] = [2, 4]
        config["is_decoder"] = False
        torch.manual_seed(0)
-        model = ReformerModelWithLMHead(ReformerConfig(**config)).to(torch_device)
+        model = ReformerForMaskedLM(ReformerConfig(**config)).to(torch_device)
        model.eval()
        input_ids, attn_mask = self._get_input_ids_and_mask()
        hidden_states = model(input_ids=input_ids, attention_mask=attn_mask)[0]
--- a/tests/test_modeling_roberta.py
+++ b/tests/test_modeling_roberta.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_t5.py
+++ b/tests/test_modeling_t5.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device


 if is_torch_available():
--- a/tests/test_modeling_tf_albert.py
+++ b/tests/test_modeling_tf_albert.py
@@ -17,10 +17,10 @@
 import unittest

 from transformers import AlbertConfig, is_tf_available
+from transformers.testing_utils import require_tf, slow

 from .test_configuration_common import ConfigTester
 from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
-from .utils import require_tf, slow


 if is_tf_available():
--- a/tests/test_modeling_tf_auto.py
+++ b/tests/test_modeling_tf_auto.py
@@ -17,8 +17,7 @@
 import unittest

 from transformers import is_tf_available
-
-from .utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_tf, slow
+from transformers.testing_utils import DUMMY_UNKWOWN_IDENTIFIER, SMALL_MODEL_IDENTIFIER, require_tf, slow


 if is_tf_available():
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Lysandre	b0892fa0e8	Release: v3.0.2 Some checks failed GitHub-hosted runner / check_code_quality (push) Has been cancelled Details	2020-07-06 18:49:44 -04:00
Sylvain Gugger	f1e2e423ab	Fix fast tokenizers too (#5562 )	2020-07-06 18:45:01 -04:00
Anthony MOI	5787e4c159	Various tokenizers fixes (#5558 ) * BertTokenizerFast - Do not specify strip_accents by default * Bump tokenizers to new version * Add test for AddedToken serialization	2020-07-06 18:27:53 -04:00
Sylvain Gugger	21f28c34b7	Fix #5507 (#5559 ) * Fix #5507 * Fix formatting	2020-07-06 17:26:48 -04:00
Lysandre Debut	9d9b872b66	The `add_space_before_punct_symbol` is only for TransfoXL (#5549 )	2020-07-06 12:17:05 -04:00
Lysandre Debut	d6b0b9d451	GPT2 tokenizer should not output token type IDs (#5546 ) * GPT2 tokenizer should not output token type IDs * Same for OpenAIGPT	2020-07-06 11:33:57 -04:00
Sylvain Gugger	7833b21a5a	Fix #5544 (#5551 )	2020-07-06 11:22:24 -04:00
Thomas Wolf	c473484087	Fix the tokenization warning noted in #5505 (#5550 ) * fix warning * style and quality	2020-07-06 11:15:25 -04:00
Lysandre	1bbc28bee7	Imports organization	2020-07-06 10:27:10 -04:00
Mohamed Taher Alrefaie	1bc13697b1	Update convert_pytorch_checkpoint_to_tf2.py (#5531 ) fixed ImportError: cannot import name 'hf_bucket_url'	2020-07-06 09:55:10 -04:00
Arnav Sharma	b2309cc6bf	Typo fix in `training` doc (#5495 )	2020-07-06 09:15:22 -04:00
ELanning	7ecff0ccbb	Fix typo in training (#5510 )	2020-07-06 09:14:57 -04:00
Sam Shleifer	58cca47c16	[cleanup] TF T5 tests only init t5-base once. (#5410 )	2020-07-03 14:27:49 -04:00
Patrick von Platen	991172922f	better error message (#5497 )	2020-07-03 19:25:25 +02:00
Thomas Wolf	b58a15a31e	unpining specific git versions in setup.py	2020-07-03 17:38:39 +02:00
Thomas Wolf	fedabcd154	Release: 3.0.1 Some checks failed GitHub-hosted runner / check_code_quality (push) Has been cancelled Details	2020-07-03 17:02:44 +02:00
Lysandre Debut	17ade127b9	Exposing prepare_for_model for both slow & fast tokenizers (#5479 ) * Exposing prepare_for_model for both slow & fast tokenizers * Update method signature * The traditional style commit * Hide the warnings behind the verbose flag * update default truncation strategy and prepare_for_model * fix tests and prepare_for_models methods Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-07-03 16:51:21 +02:00
Manuel Romero	814ed7ee76	Create model card (#5396 ) Create model card for electicidad-small (Spanish Electra) fine-tuned on SQUAD-esv1	2020-07-03 08:29:09 -04:00
Moseli Motsoehli	49281ac939	grammar corrections and train data update (#5448 ) - fixed grammar and spelling - added an intro - updated Training data references	2020-07-03 08:25:57 -04:00
chrisliu	97355339f6	Update upstream (#5456 )	2020-07-03 08:16:27 -04:00
Manuel Romero	55b932a818	Create model card (#5464 ) Create model card for electra-small-discriminator fine-tuned on SQUAD v2.0	2020-07-03 06:19:49 -04:00
Funtowicz Morgan	21cd8c4086	QA Pipelines fixes (#5429 ) * Make QA pipeline supports models with more than 2 outputs such as BART assuming start/end are the two first outputs. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length. This result in every sample going through QA pipelines to be of size 384 whatever the actual input size is making the overall pipeline very slow. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Mask padding & question before applying softmax. Softmax has been refactored to operate in log space for speed and stability. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Format. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Use PaddingStrategy.LONGEST instead of DO_NOT_PAD Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Revert "When using the new padding/truncation paradigm setting padding="max_length" + max_length=X actually pads the input up to max_length." This reverts commit 1b00a9a2 Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI after unattended failure * Trigger CI	2020-07-03 10:29:20 +02:00
Pierric Cistac	8438bab38e	Fix roberta model ordering for TFAutoModel (#5414 )	2020-07-02 19:23:55 -04:00
Sylvain Gugger	6b735a7253	Tokenizer summary (#5467 ) * Work on tokenizer summary * Finish tutorial * Link to it * Apply suggestions from code review Co-authored-by: Anthony MOI <xn1t0x@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add vocab definition Co-authored-by: Anthony MOI <xn1t0x@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-07-02 17:07:42 -04:00
Shen	ef0e9d806c	Update: ElectraDiscriminatorPredictions forward. (#5471 ) `ElectraDiscriminatorPredictions.forward` should not need `attention_mask`.	2020-07-02 13:57:33 -04:00
Manuel Romero	13a8588f2d	Create model card (#5432 ) Create model card for electra-base-discriminator fine-tuned on SQUAD v1.1	2020-07-02 10:16:30 -04:00
Julien Chaumond	a0a6387a0d	[model_cards] roberta-large-mnli: fix sep_token	2020-07-02 10:04:02 -04:00
Julien Chaumond	215db688da	Create roberta-large-mnli-README.md	2020-07-02 09:43:54 -04:00
Lysandre Debut	69d313e808	Bans SentencePiece 0.1.92 (#5418 )	2020-07-02 09:23:00 -04:00
George Ho	84e56669af	Fix typo in glossary (#5466 )	2020-07-02 09:19:33 -04:00
Teven	c6a510c6fa	Fixing missing arguments for TransfoXL tokenizer when using TextGenerationPipeline (#5465 ) * overriding _parse_and_tokenize in `TextGenerationPipeine` to allow for TransfoXl tokenizer arguments	2020-07-02 13:53:33 +02:00
Teven	6726416e4a	Changed expected_output_ids in TransfoXL generation test (#5462 ) * Changed expected_output_ids in TransfoXL generation test to match #4826 generation PR. * making black happy * making isort happy	2020-07-02 11:56:44 +02:00
tommccoy	812def00c9	fix use of mems in Transformer-XL (#4826 ) Fixed duplicated memory use in Transformer-XL generation leading to bad predictions and performance.	2020-07-02 11:19:07 +02:00
Patrick von Platen	306f1a2695	Add Reformer MLM notebook (#5450 ) * Add Reformer MLM notebook * Update notebooks/README.md	2020-07-02 00:20:49 +02:00
Patrick von Platen	d16e36c7e5	[Reformer] Add Masked LM Reformer (#5426 ) * fix conflicts * fix * happy rebasing	2020-07-01 22:43:18 +02:00
Funtowicz Morgan	f4323dbf8c	Don't discard entity_group when token is the latest in the sequence. (#5439 ) Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>	2020-07-01 20:30:42 +02:00
Joe Davison	35befd9ce3	Fix tensor label type inference in default collator (#5250 ) * allow tensor label inputs to default collator * replace try/except with type check	2020-07-01 10:40:14 -06:00
Patrick von Platen	fe81f7d12c	finish reformer qa head (#5433 )	2020-07-01 12:27:14 -04:00
Patrick von Platen	d697b6ca75	[Longformer] Major Refactor (#5219 ) * refactor naming * add small slow test * refactor * refactor naming * rename selected to extra * big global attention refactor * make style * refactor naming * save intermed * refactor functions * finish function refactor * fix tests * fix longformer * fix longformer * fix longformer * fix all tests but one * finish longformer * address sams and izs comments * fix transpose	2020-07-01 17:43:32 +02:00
Sam Shleifer	e0d58ddb65	[fix] Marian tests import (#5442 )	2020-07-01 11:42:22 -04:00
Funtowicz Morgan	608d5a7c44	Raises PipelineException on FillMaskPipeline when there are != 1 mask_token in the input (#5389 ) * Added PipelineException Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * fill-mask pipeline raises exception when more than one mask_token detected. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Put everything in a function. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Added tests on pipeline fill-mask when input has != 1 mask_token Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Fix numel() computation for TF Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Addressing PR comments. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Remove function typing to avoid import on specific framework. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Retry typing with @julien-c tip. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Quality². Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Simplify fill-mask mask_token checking. Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Trigger CI	2020-07-01 17:27:47 +02:00
Sylvain Gugger	6c55e9fc32	Fix dropdown bug in searches (#5440 ) * Trigger CI * Fix dropdown bug in searches	2020-07-01 11:02:59 -04:00
Sylvain Gugger	734a28a767	Clean up diffs in Trainer/TFTrainer (#5417 ) * Cleanup and unify Trainer/TFTrainer * Forgot to adapt TFTrainingArgs * In tf scripts n_gpu -> n_replicas * Update src/transformers/training_args.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * Formatting * Fix typo Co-authored-by: Lysandre Debut <lysandre@huggingface.co>	2020-07-01 11:00:20 -04:00
Sam Shleifer	43cb03a93d	MarianTokenizer.prepare_translation_batch uses new tokenizer API (#5182 )	2020-07-01 10:32:50 -04:00
Sam Shleifer	13deb95a40	Move tests/utils.py -> transformers/testing_utils.py (#5350 )	2020-07-01 10:31:17 -04:00
sgugger	9c219305f5	Trigger CI	2020-07-01 10:22:50 -04:00
Sylvain Gugger	64e3d966b1	Add support for past states (#5399 ) * Add support for past states * Style and forgotten self * You mean, documenting is not enough? I have to actually add it too? * Add memory support during evaluation * Fix tests in eval and add TF support * No need to change this line anymore	2020-07-01 08:11:55 -04:00
Sylvain Gugger	4ade7491f4	Fix examples titles and optimization doc page (#5408 )	2020-07-01 08:11:25 -04:00
Moseli Motsoehli	d60d231ea4	Create README.md (#5422 ) * Create README.md * Update model_cards/MoseliMotsoehli/TswanaBert/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-07-01 05:01:51 -04:00
Jay	298bdab18a	Create model card for schmidek/electra-small-cased (#5400 )	2020-07-01 04:01:56 -04:00
Julien Plu	fcf0652460	Fix TensorFlow dataset generator (#4881 ) * fix TensorFlow generator * Better features handling * Apply style * Apply style * Fix squad as well * Apply style * Better factorization of TF Tensors creation	2020-06-30 19:49:11 -04:00
Hong Xu	501040fd30	In the run_ner.py example, give the optional label arg a default value (#5326 ) Otherwise, if label is not specified, the following error occurs: Traceback (most recent call last): File "run_ner.py", line 303, in <module> main() File "run_ner.py", line 101, in main model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) File "/home/user/anaconda3/envs/bert/lib/python3.7/site-packages/transformers/hf_argparser.py", line 159, in parse_json_file obj = dtype(**inputs) TypeError: __init__() missing 1 required positional argument: 'labels'	2020-06-30 19:45:35 -04:00
Sam Shleifer	b45e65efa0	Avoid deprecation warning for F.tanh (#5413 )	2020-06-30 16:41:43 -04:00
Sam Shleifer	23231c0f78	[GH Runner] fix yaml indent (#5412 )	2020-06-30 16:17:12 -04:00
Sam Shleifer	ac61114592	[CI] gh runner doesn't use -v, cats new result (#5409 )	2020-06-30 16:12:14 -04:00
Sam Shleifer	27a7fe7a8d	examples/seq2seq: never override $WANDB_PROJECT (#5407 )	2020-06-30 15:29:13 -04:00
Sam Shleifer	32d2031458	[fix] slow fill_mask test failure (#5406 )	2020-06-30 15:28:15 -04:00
Sam Shleifer	80aa4b8aa6	[CI] GH-runner stores artifacts like CircleCI (#5318 )	2020-06-30 15:01:53 -04:00
Sylvain Gugger	87716a6d07	Documentation for the Trainer API (#5383 ) * Documentation for the Trainer API * Address review comments * Address comments	2020-06-30 11:43:43 -04:00
Yacine Jernite	c4d4e8bdbd	Move GenerationMixin to separate file (#5254 ) * separate_generation_code * isort * renamed * rename_files * move_shapelit	2020-06-30 10:42:08 -04:00
Lysandre	90d13954c4	Repin versions	2020-06-30 09:16:36 -04:00
Sylvain Gugger	0607b88945	How to share model cards with the CLI (#5374 ) * How to share model cards * Switch the two options * Fix bad copy/cut * Julien's suggestion	2020-06-30 08:59:32 -04:00
Kevin Canwen Xu	331d8d2936	Upload DistilBART artwork (#5394 )	2020-06-30 18:11:11 +08:00
Manuel Romero	09e841490c	Model Card Fixing (#5369 ) - Fix missing ```-``` in language meta - T5 pic uploaded to a more permanent place	2020-06-30 18:02:24 +08:00
Manuel Romero	4c5bed192a	Model Card Fixing (#5373 ) - T5 pic uploaded to a more permanent place	2020-06-30 18:01:45 +08:00
Manuel Romero	02509d4b06	Model Card Fixing (#5371 ) - Model pic uploaded to a more permanent place	2020-06-30 18:01:11 +08:00
Manuel Romero	79f0118c72	Model Card Fixing (#5370 ) - Fix missing ```-``` in language meta - T5 pic uploaded to a more permanent place	2020-06-30 18:00:29 +08:00
MichaelJanz	9a473f1e43	Update Bertabs example to work again (#5355 ) * Fix the bug 'Attempted relative import with no known parent package' when using the bertabs example. Also change the used model from bertabs-finetuned-cnndm, since it seems not be accessible anymore * Update run_summarization.py Co-authored-by: Kevin Canwen Xu <canwenxu@126.com>	2020-06-30 14:05:01 +08:00
Sylvain Gugger	7f60e93ac5	Mention openAI model card and merge content (#5378 ) * Mention openAI model card and merge content * Fix sentence	2020-06-29 18:27:36 -04:00
chrisliu	482a5993c2	Fix model card folder name so that it is consistent with model hub (#5368 ) * Merge upstream * Merge upstream * Add generate.py link * Merge upstream * Merge upstream * Fix folder name	2020-06-29 12:54:30 -04:00
chrisliu	97f24303e8	Add link to file and fix typos in model card (#5367 ) * Merge upstream * Merge upstream * Add generate.py link	2020-06-29 11:34:52 -04:00
Lysandre Debut	b9ee87f5c7	Doc for v3.0.0 (#5366 ) * Doc for v3.0.0 * Update docs/source/_static/js/custom.js Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/_static/js/custom.js Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2020-06-29 11:08:54 -04:00