Release: v4.3.1

Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC (#10089 )
* add wav2vec2CTC and deprecate for maskedlm * remove from docs
2021-02-09 09:55:55 +01:00 · 2021-02-09 09:55:55 +01:00 · 2021-02-08 18:31:49 +01:00 · 2021-02-08 18:29:16 +01:00 · 2021-02-08 18:18:26 +01:00 · 2021-02-04 15:41:19 -05:00
324 changed files with 23343 additions and 5344 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -421,15 +421,15 @@ workflows:
            - run_tests_git_lfs
            - build_doc
            - deploy_doc: *workflow_filters
-    tpu_testing_jobs:
-        triggers:
-            - schedule:
-                # Set to run at the first minute of every hour.
-                cron: "0 8 * * *"
-                filters:
-                    branches:
-                        only:
-                            - master
-        jobs:
-            - cleanup-gke-jobs
-            - run_examples_tpu
+#    tpu_testing_jobs:
+#        triggers:
+#            - schedule:
+#                # Set to run at the first minute of every hour.
+#                cron: "0 8 * * *"
+#                filters:
+#                    branches:
+#                        only:
+#                            - master
+#        jobs:
+#            - cleanup-gke-jobs
+#            - run_examples_tpu
--- a/.circleci/deploy.sh
+++ b/.circleci/deploy.sh
@@ -54,4 +54,5 @@ deploy_doc "0613f05" v3.3.1
 deploy_doc "eb0e0ce" v3.4.0
 deploy_doc "818878d" v3.5.1
 deploy_doc "c781171" v4.0.0
-deploy_doc "bfa4ccf" # v4.1.1 Latest stable release
+deploy_doc "bfa4ccf" v4.1.1
+deploy_doc "7d9a9d0" # v4.2.0 Latest stable release
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -25,32 +25,40 @@ assignees: ''
 If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
 Please tag fewer than 3 people.

- albert, bert, GPT2, XLM: @LysandreJik
- tokenizers: @mfuntowicz
- Trainer: @sgugger
- Speed and Memory Benchmarks: @patrickvonplaten
- Model Cards: @julien-c
- TextGeneration: @TevenLeScao
- examples/distillation: @VictorSanh
- nlp datasets: [different repo](https://github.com/huggingface/nlp)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
- Text Generation: @patrickvonplaten @TevenLeScao
- Blenderbot: @patrickvonplaten
- Bart: @patrickvonplaten
- Marian: @patrickvonplaten
- Pegasus: @patrickvonplaten
- mBART: @patrickvonplaten
- T5: @patrickvonplaten
- Longformer/Reformer: @patrickvonplaten
- TransfoXL/XLNet: @TevenLeScao
- RAG: @patrickvonplaten, @lhoestq
- FSMT: @stas00
- examples/seq2seq: @patil-suraj
- examples/bert-loses-patience: @JetRunner
- ray/raytune: @richardliaw @amogkam
- tensorflow: @jplu
- examples/token-classification: @stefan-it
- documentation: @sgugger
+Models:
+
+- albert, bert, xlm: @LysandreJik
+- blenderbot, bart, marian, pegasus, encoderdecoder,  t5: @patrickvonplaten, @patil-suraj
+- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
+- fsmt: @stas00
+- funnel: @sgugger
+- gpt2: @patrickvonplaten, @LysandreJik
+- rag: @patrickvonplaten, @lhoestq
+- tensorflow: @jplu
+
+Library:
+
+- benchmarks: @patrickvonplaten
+- deepspeed: @stas00
+- ray/raytune: @richardliaw, @amogkam
+- text generation: @patrickvonplaten
+- tokenizers: @n1t0, @LysandreJik
+- trainer: @sgugger
+- pipelines: @LysandreJik
+
+Documentation: @sgugger
+
+HF projects:
+
+- nlp datasets: [different repo](https://github.com/huggingface/nlp)
+- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
+
+Examples:
+
+- maintained examples (not research project or legacy): @sgugger, @patil-suraj
+- research_projects/bert-loses-patience: @JetRunner
+- research_projects/distillation: @VictorSanh
+
 -->

 ## Information
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -37,26 +37,38 @@ members/contributors which may be interested in your PR.
 If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
 Please tag fewer than 3 people.

- albert, bert, XLM: @LysandreJik
- GPT2: @LysandreJik, @patrickvonplaten
- tokenizers: @mfuntowicz
- Trainer: @sgugger
- Benchmarks: @patrickvonplaten
- Model Cards: @julien-c
- examples/distillation: @VictorSanh
- nlp datasets: [different repo](https://github.com/huggingface/nlp)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
- Text Generation: @patrickvonplaten, @TevenLeScao
- Blenderbot, Bart, Marian, Pegasus: @patrickvonplaten
- T5: @patrickvonplaten
- Rag: @patrickvonplaten, @lhoestq
- EncoderDecoder: @patrickvonplaten
- Longformer, Reformer: @patrickvonplaten
- TransfoXL, XLNet: @TevenLeScao, @patrickvonplaten
- examples/seq2seq: @patil-suraj
- examples/bert-loses-patience: @JetRunner
- tensorflow: @jplu
- examples/token-classification: @stefan-it
- documentation: @sgugger
- FSMT: @stas00
+Models:
+
+- albert, bert, xlm: @LysandreJik
+- blenderbot, bart, marian, pegasus, encoderdecoder,  t5: @patrickvonplaten, @patil-suraj
+- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
+- fsmt: @stas00
+- funnel: @sgugger
+- gpt2: @patrickvonplaten, @LysandreJik
+- rag: @patrickvonplaten, @lhoestq
+- tensorflow: @jplu
+
+Library:
+
+- benchmarks: @patrickvonplaten
+- deepspeed: @stas00
+- ray/raytune: @richardliaw, @amogkam
+- text generation: @patrickvonplaten
+- tokenizers: @n1t0, @LysandreJik
+- trainer: @sgugger
+- pipelines: @LysandreJik
+
+Documentation: @sgugger
+
+HF projects:
+
+- nlp datasets: [different repo](https://github.com/huggingface/nlp)
+- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
+
+Examples:
+
+- maintained examples (not research project or legacy): @sgugger, @patil-suraj
+- research_projects/bert-loses-patience: @JetRunner
+- research_projects/distillation: @VictorSanh
+
 -->
--- a/.github/workflows/model-templates.yml
+++ b/.github/workflows/model-templates.yml
@@ -7,6 +7,9 @@ on:
      - "tests/**"
      - ".github/**"
      - "templates/**"
+  pull_request_target:
+    branches:
+      - master

 jobs:
  run_tests_templates:
--- a/.github/workflows/release-conda.yml
+++ b/.github/workflows/release-conda.yml
@@ -37,7 +37,8 @@ jobs:
      - name: Build conda packages
        run: |
          conda info
-          conda build .github/conda
+          conda list
+          conda-build .github/conda

      - name: Upload to Anaconda
-        run: anaconda upload `conda build .github/conda --output` --force
+        run: anaconda upload `conda-build .github/conda --output` --force
--- a/README.md
+++ b/README.md
@@ -167,7 +167,7 @@ When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be
 pip install transformers
 ```

-If you'd like to play with the examples, you must [install the library from source](https://huggingface.co/transformers/installation.html#installing-from-source).
+If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must [install the library from source](https://huggingface.co/transformers/installation.html#installing-from-source).

 ### With conda

@@ -179,7 +179,7 @@ Since Transformers version v4.0.0, we now have a conda channel: `huggingface`.
 conda install -c huggingface transformers
 ```

-Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda. 
+Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.

 ## Models architectures

@@ -196,7 +196,9 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
 1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
 1. **[BlenderbotSmall](https://huggingface.co/transformers/model_doc/blenderbot_small.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[BORT](https://huggingface.co/transformers/model_doc/bort.html)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
 1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+1. **[ConvBERT](https://huggingface.co/transformers/model_doc/convbert.html)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
 1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
 1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
@@ -221,11 +223,11 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
 1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[TAPAS](https://huggingface.co/transformers/model_doc/tapas.html)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+1. **[Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
 1. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@@ -1,10 +1,11 @@
 // These two things need to be updated at each release for the version selector.
 // Last stable version
-const stableVersion = "v4.1.1"
+const stableVersion = "v4.2.0"
 // Dictionary doc folder to label. The last stable version should have an empty key.
 const versionMapping = {
    "master": "master",
-    "": "v4.1.1 (stable)",
+    "": "v4.2.0/v4.2.1 (stable)",
+    "v4.1.1": "v4.1.0/v4.1.1",
    "v4.0.1": "v4.0.0/v4.0.1",
    "v3.5.1": "v3.5.0/v3.5.1",
    "v3.4.0": "v3.4.0",
--- a/docs/source/add_new_model.rst
+++ b/docs/source/add_new_model.rst
@@ -0,0 +1,844 @@
+.. 
+    Copyright 2020 The HuggingFace Team. All rights reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+
+How to add a model to 🤗 Transformers?
+=======================================================================================================================
+
+Adding a new model is often difficult and requires an in-depth knowledge of the 🤗 Transformers library and ideally also
+of the model's original repository. At Hugging Face, we are trying to empower the community more and more to add models
+independently. Thus, for some new models that the community wants to be added to 🤗 Transformers, we create a customized
+*call-for-model-addition* that explains step-by-step how to add the requested model. With this
+*call-for-model-addition*, we want to teach a motivated and experienced contributor of the community how to port a
+model to 🤗 Transformers.
+
+If this sounds like something you would be interested in, feel free to check out the currently open
+“calls-for-model-addition” `here
+<https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model/open_model_proposals/README.md>`__
+and to contact us.
+
+If selected, you will then work closely with one member of the Hugging Face team to integrate the model into 🤗
+Transformers. By doing so, you will both gain a theoretical and deep practical understanding of the proposed model. But
+more importantly, you will have made a major open-source contribution to 🤗 Transformers. Along the way, you will:
+
+-  get insights into open-source best practices
+-  understand the design principles of one of the most popular NLP libraries
+-  learn how to do efficiently test large NLP models
+-  learn how to integrate Python utilities like ``black``, ``isort``, ``make fix-copies`` into a library to always
+   ensure clean and readable code
+
+We are also more than happy if you want to add a model that cannot be found in the “calls-for-model-addition” folder.
+The following sections explain in detail how to add a new model. It might also be very helpful to check out already
+added models to see if those resemble the model you would like to add `here
+<https://github.com/huggingface/transformers/pulls?q=is%3Apr+label%3A%22PR+for+Model+Addition%22+is%3Aclosed>`__.
+
+To start, let's try to get a general overview of the Transformers library.
+
+General overview of 🤗 Transformers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, you should get a general overview of 🤗 Transformers. 🤗 Transformers is a very opinionated library, so there is a
+chance that you don't agree with some of the library's philosophies or design choices. From our experience, however, we
+found that the fundamental design choices and philosophies of the library are crucial to efficiently scale 🤗
+Transformers while keeping maintenance costs at a reasonable level.
+
+A good first starting point to better understand the library is to read the :doc:`documentation of our philosophy
+<philosophy>`. As a result of our way of working, there are some choices that we try to apply to all models:
+
+-  Composition is generally favored over-abstraction
+-  Duplicating code is not always bad if it strongly improves the readability or accessibility of a model
+-  Model files are as self-contained as possible so that when you read the code of a specific model, you ideally only
+   have to look into the respective ``modeling_....py`` file.
+
+In our opinion, the library's code is not just a means to provide a product, *e.g.* the ability to use BERT for
+inference, but also as the very product that we want to improve. Hence, when adding a model, the user is not only the
+person that will use your model, but also everybody that will read, try to understand, and possibly tweak your code.
+
+With this in mind, let's go a bit deeper into the general library design.
+
+Overview of models
+-----------------------------------------------------------------------------------------------------------------------
+
+To successfully add a model, it is important to understand the interaction between your model and its config,
+:class:`~transformers.PreTrainedModel`, and :class:`~transformers.PretrainedConfig`. For exemplary purposes, we will
+call the model to be added to 🤗 Transformers ``BrandNewBert``.
+
+Let's take a look:
+
+.. image:: ./imgs/transformers_overview.png
+
+As you can see, we do make use of inheritance in 🤗 Transformers, but we keep the level of abstraction to an absolute
+minimum. There are never more than two levels of abstraction for any model in the library. :obj:`BrandNewBertModel`
+inherits from :obj:`BrandNewBertPreTrainedModel` which in turn inherits from :class:`~transformres.PreTrainedModel` and
+that's it. As a general rule, we want to make sure that a new model only depends on
+:class:`~transformers.PreTrainedModel`. The important functionalities that are automatically provided to every new
+model are :meth:`~transformers.PreTrainedModel.from_pretrained` and
+:meth:`~transformers.PreTrainedModel.save_pretrained`, which are used for serialization and deserialization. All of the
+other important functionalities, such as :meth:`BrandNewBertModel.forward` should be completely defined in the new
+``modeling_brand_new_bert.py`` script. Next, we want to make sure that a model with a specific head layer, such as
+:obj:`BrandNewBertForMaskedLM` does not inherit from :obj:`BrandNewBertModel`, but rather uses :obj:`BrandNewBertModel`
+as a component that can be called in its forward pass to keep the level of abstraction low. Every new model requires a
+configuration class, called :obj:`BrandNewBertConfig`. This configuration is always stored as an attribute in
+:class:`~transformers.PreTrainedModel`, and thus can be accessed via the ``config`` attribute for all classes
+inheriting from :obj:`BrandNewBertPreTrainedModel`:
+
+   .. code:: python
+
+      model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert")
+      model.config  # model has access to its config
+
+Similar to the model, the configuration inherits basic serialization and deserialization functionalities from
+:class:`~transformers.PretrainedConfig`. Note that the configuration and the model are always serialized into two
+different formats - the model to a `pytorch_model.bin` file and the configuration to a `config.json` file. Calling
+:meth:`~transformers.PreTrainedModel.save_pretrained` will automatically call
+:meth:`~transformers.PretrainedConfig.save_pretrained`, so that both model and configuration are saved.
+
+
+Overview of tokenizers
+-----------------------------------------------------------------------------------------------------------------------
+
+Not quite ready yet :-( This section will be added soon!
+
+Step-by-step recipe to add a model to 🤗 Transformers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Everyone has different preferences of how to port a model so it can be very helpful for you to take a look at summaries
+of how other contributors ported models to Hugging Face. Here is a list of community blog posts on how to port a model:
+
+1. `Porting GPT2 Model <https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28>`__ by `Thomas
+   <https://huggingface.co/thomwolf>`__
+2. `Porting WMT19 MT Model <https://huggingface.co/blog/porting-fsmt>`__ by `Stas <https://huggingface.co/stas>`__
+
+From experience, we can tell you that the most important things to keep in mind when adding a model are:
+
+-  Don't reinvent the wheel! Most parts of the code you will add for the new 🤗 Transformers model already exist
+   somewhere in 🤗 Transformers. Take some time to find similar, already existing models and tokenizers you can copy
+   from. `grep <https://www.gnu.org/software/grep/>`__ and `rg <https://github.com/BurntSushi/ripgrep>`__ are your
+   friends. Note that it might very well happen that your model's tokenizer is based on one model implementation, and
+   your model's modeling code on another one. *E.g.* FSMT's modeling code is based on BART, while FSMT's tokenizer code
+   is based on XLM.
+-  It's more of an engineering challenge than a scientific challenge. You should spend more time on creating an
+   efficient debugging environment than trying to understand all theoretical aspects of the model in the paper.
+-  Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so that we at Hugging Face are more
+   than happy to help you at every step to add your model. Don't hesitate to ask if you notice you are not making
+   progress.
+
+In the following, we try to give you a general recipe that we found most useful when porting a model to 🤗 Transformers.
+
+The following list is a summary of everything that has to be done to add a model and can be used by you as a To-Do
+List:
+
+-  1. ☐ (Optional) Understood theoretical aspects
+-  2. ☐ Prepared transformers dev environment
+-  3. ☐ Set up debugging environment of the original repository
+-  4. ☐ Created script that successfully runs forward pass using original repository and checkpoint
+-  5. ☐ Successfully added the model skeleton to Transformers
+-  6. ☐ Successfully converted original checkpoint to Transformers checkpoint
+-  7. ☐ Successfully ran forward pass in Transformers that gives identical output to original checkpoint
+-  8. ☐ Finished model tests in Transformers
+-  9. ☐ Successfully added Tokenizer in Transformers
+-  10. ☐ Run end-to-end integration tests
+-  11. ☐ Finished docs
+-  12. ☐ Uploaded model weights to the hub
+-  13. ☐ Submitted the pull request
+-  14. ☐ (Optional) Added a demo notebook
+
+To begin with, we usually recommend to start by getting a good theoretical understanding of ``BrandNewBert``. However,
+if you prefer to understand the theoretical aspects of the model *on-the-job*, then it is totally fine to directly dive
+into the ``BrandNewBert``'s code-base. This option might suit you better, if your engineering skills are better than
+your theoretical skill, if you have trouble understanding ``BrandNewBert``'s paper, or if you just enjoy programming
+much more than reading scientific papers.
+
+1. (Optional) Theoretical aspects of BrandNewBert
+-----------------------------------------------------------------------------------------------------------------------
+
+You should take some time to read *BrandNewBert's* paper, if such descriptive work exists. There might be large
+sections of the paper that are difficult to understand. If this is the case, this is fine - don't worry! The goal is
+not to get a deep theoretical understanding of the paper, but to extract the necessary information required to
+effectively re-implement the model in 🤗 Transformers. That being said, you don't have to spend too much time on the
+theoretical aspects, but rather focus on the practical ones, namely:
+
+-  What type of model is *brand_new_bert*? BERT-like encoder-only model? GPT2-like decoder-only model? BART-like
+   encoder-decoder model? Look at the :doc:`model_summary` if you're not familiar with the differences between those.
+-  What are the applications of *brand_new_bert*? Text classification? Text generation? Seq2Seq tasks, *e.g.,*
+   summarization?
+-  What is the novel feature of the model making it different from BERT/GPT-2/BART?
+-  Which of the already existing `🤗 Transformers models <https://huggingface.co/transformers/#contents>`__ is most
+   similar to *brand_new_bert*?
+-  What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used
+   for BERT or BART?
+
+After you feel like you have gotten a good overview of the architecture of the model, you might want to write to the
+Hugging Face team with any questions you might have. This might include questions regarding the model's architecture,
+its attention layer, etc. We will be more than happy to help you.
+
+2. Next prepare your environment
+-----------------------------------------------------------------------------------------------------------------------
+
+1. Fork the `repository <https://github.com/huggingface/transformers>`__ by clicking on the ‘Fork' button on the
+   repository's page. This creates a copy of the code under your GitHub user account.
+
+2. Clone your ``transformers`` fork to your local disk, and add the base repository as a remote:
+
+   .. code:: bash
+
+      git clone https://github.com/[your Github handle]/transformers.git
+      cd transformers
+      git remote add upstream https://github.com/huggingface/transformers.git
+
+3. Set up a development environment, for instance by running the following command:
+
+   .. code:: bash
+
+      python -m venv .env
+      source .env/bin/activate
+      pip install -e ".[dev]"
+
+and return to the parent directory
+
+.. code:: bash
+
+   cd ..
+
+4. We recommend adding the PyTorch version of *brand_new_bert* to Transformers. To install PyTorch, please follow the
+   instructions on https://pytorch.org/get-started/locally/.
+
+**Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient.
+
+5. To port *brand_new_bert*, you will also need access to its original repository:
+
+.. code:: bash
+
+   git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git 
+   cd brand_new_bert
+   pip install -e .
+
+Now you have set up a development environment to port *brand_new_bert* to 🤗 Transformers.
+
+3.-4. Run a pretrained checkpoint using the original repository
+-----------------------------------------------------------------------------------------------------------------------
+
+At first, you will work on the original *brand_new_bert* repository. Often, the original implementation is very
+“researchy”. Meaning that documentation might be lacking and the code can be difficult to understand. But this should
+be exactly your motivation to reimplement *brand_new_bert*. At Hugging Face, one of our main goals is to *make people
+stand on the shoulders of giants* which translates here very well into taking a working model and rewriting it to make
+it as **accessible, user-friendly, and beautiful** as possible. This is the number-one motivation to re-implement
+models into 🤗 Transformers - trying to make complex new NLP technology accessible to **everybody**.
+
+You should start thereby by diving into the original repository.
+
+Successfully running the official pretrained model in the original repository is often **the most difficult** step.
+From our experience, it is very important to spend some time getting familiar with the original code-base. You need to
+figure out the following:
+
+-  Where to find the pretrained weights?
+-  How to load the pretrained weights into the corresponding model?
+-  How to run the tokenizer independently from the model?
+-  Trace one forward pass so that you know which classes and functions are required for a simple forward pass. Usually,
+   you only have to reimplement those functions.
+-  Be able to locate the important components of the model: Where is the model's class? Are there model sub-classes,
+   *e.g.* EncoderModel, DecoderModel? Where is the self-attention layer? Are there multiple different attention layers,
+   *e.g.* *self-attention*, *cross-attention*...?
+-  How can you debug the model in the original environment of the repo? Do you have to add `print` statements, can you
+   work with an interactive debugger like `ipdb`, or should you use an efficient IDE to debug the model, like PyCharm?
+
+It is very important that before you start the porting process, that you can **efficiently** debug code in the original
+repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or
+even a pull request in the original repository. The maintainers of this repository are most likely very happy about
+someone looking into their code!
+
+At this point, it is really up to you which debugging environment and strategy you prefer to use to debug the original
+model. We strongly advise against setting up a costly GPU environment, but simply work on a CPU both when starting to
+dive into the original repository and also when starting to write the 🤗 Transformers implementation of the model. Only
+at the very end, when the model has already been successfully ported to 🤗 Transformers, one should verify that the
+model also works as expected on GPU.
+
+In general, there are two possible debugging environments for running the original model
+
+-  `Jupyter notebooks <https://jupyter.org/>`__ / `google colab
+   <https://colab.research.google.com/notebooks/intro.ipynb>`__
+-  Local python scripts.
+
+Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split
+logical components from one another and to have faster debugging cycles as intermediate results can be stored. Also,
+notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging
+Face team for help. If you are familiar with Jupiter notebooks, we strongly recommend you to work with them.
+
+The obvious disadvantage of Jupyther notebooks is that if you are not used to working with them you will have to spend
+some time adjusting to the new programming environment and that you might not be able to use your known debugging tools
+anymore, like ``ipdb``.
+
+For each code-base, a good first step is always to load a **small** pretrained checkpoint and to be able to reproduce a
+single forward pass using a dummy integer vector of input IDs as an input. Such a script could look like this (in
+pseudocode):
+
+.. code:: bash
+
+   model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+   input_ids = [0, 4, 5, 2, 3, 7, 9]  # vector of input ids
+   original_output = model.predict(input_ids)
+
+Next, regarding the debugging strategy, there are generally a few from which to choose from:
+
+-  Decompose the original model into many small testable components and run a forward pass on each of those for
+   verification
+-  Decompose the original model only into the original *tokenizer* and the original *model*, run a forward pass on
+   those, and use intermediate print statements or breakpoints for verification
+
+Again, it is up to you which strategy to choose. Often, one or the other is advantageous depending on the original code
+base.
+
+If the original code-base allows you to decompose the model into smaller sub-components, *e.g.* if the original
+code-base can easily be run in eager mode, it is usually worth the effort to do so. There are some important advantages
+to taking the more difficult road in the beginning:
+
+- at a later stage when comparing the original model to the Hugging Face implementation, you can verify automatically
+  for each component individually that the corresponding component of the 🤗 Transformers implementation matches instead
+  of relying on visual comparison via print statements
+- it can give you some rope to decompose the big problem of porting a model into smaller problems of just porting
+  individual components and thus structure your work better
+- separating the model into logical meaningful components will help you to get a better overview of the model's design
+  and thus to better understand the model
+- at a later stage those component-by-component tests help you to ensure that no regression occurs as you continue
+  changing your code
+
+`Lysandre's <https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed>`__ integration checks for ELECTRA
+gives a nice example of how this can be done.
+
+However, if the original code-base is very complex or only allows intermediate components to be run in a compiled mode,
+it might be too time-consuming or even impossible to separate the model into smaller testable sub-components. A good
+example is `T5's MeshTensorFlow <https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow>`__ library which is
+very complex and does not offer a simple way to decompose the model into its sub-components. For such libraries, one
+often relies on verifying print statements.
+
+No matter which strategy you choose, the recommended procedure is often the same in that you should start to debug the
+starting layers first and the ending layers last.
+
+It is recommended that you retrieve the output, either by print statements or sub-component functions, of the following
+layers in the following order:
+
+1.  Retrieve the input IDs passed to the model
+2.  Retrieve the word embeddings
+3.  Retrieve the input of the first Transformer layer
+4.  Retrieve the output of the first Transformer layer
+5.  Retrieve the output of the following n - 1 Transformer layers
+6.  Retrieve the output of the whole BrandNewBert Model
+
+Input IDs should thereby consists of an array of integers, *e.g.* ``input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]``
+
+The outputs of the following layers often consist of multi-dimensional float arrays and can look like this:
+
+.. code:: bash
+
+   [[
+    [-0.1465, -0.6501,  0.1993,  ...,  0.1451,  0.3430,  0.6024],
+    [-0.4417, -0.5920,  0.3450,  ..., -0.3062,  0.6182,  0.7132],
+    [-0.5009, -0.7122,  0.4548,  ..., -0.3662,  0.6091,  0.7648],
+    ...,
+    [-0.5613, -0.6332,  0.4324,  ..., -0.3792,  0.7372,  0.9288],
+    [-0.5416, -0.6345,  0.4180,  ..., -0.3564,  0.6992,  0.9191],
+    [-0.5334, -0.6403,  0.4271,  ..., -0.3339,  0.6533,  0.8694]]],
+
+We expect that every model added to 🤗 Transformers passes a couple of integration tests, meaning that the original
+model and the reimplemented version in 🤗 Transformers have to give the exact same output up to a precision of 0.001!
+Since it is normal that the exact same model written in different libraries can give a slightly different output
+depending on the library framework, we accept an error tolerance of 1e-3 (0.001). It is not enough if the model gives
+nearly the same output, they have to be the almost identical. Therefore, you will certainly compare the intermediate
+outputs of the 🤗 Transformers version multiple times against the intermediate outputs of the original implementation of
+*brand_new_bert* in which case an **efficient** debugging environment of the original repository is absolutely
+important. Here is some advice is to make your debugging environment as efficient as possible.
+
+-  Find the best way of debugging intermediate results. Is the original repository written in PyTorch? Then you should
+   probably take the time to write a longer script that decomposes the original model into smaller sub-components to
+   retrieve intermediate values. Is the original repository written in Tensorflow 1? Then you might have to rely on
+   TensorFlow print operations like `tf.print <https://www.tensorflow.org/api_docs/python/tf/print>`__ to output
+   intermediate values. Is the original repository written in Jax? Then make sure that the model is **not jitted** when
+   running the forward pass, *e.g.* check-out `this link <https://github.com/google/jax/issues/196>`__.
+-  Use the smallest pretrained checkpoint you can find. The smaller the checkpoint, the faster your debug cycle
+   becomes. It is not efficient if your pretrained model is so big that your forward pass takes more than 10 seconds.
+   In case only very large checkpoints are available, it might make more sense to create a dummy model in the new
+   environment with randomly initialized weights and save those weights for comparison with the 🤗 Transformers version
+   of your model
+-  Make sure you are using the easiest way of calling a forward pass in the original repository. Ideally, you want to
+   find the function in the original repository that **only** calls a single forward pass, *i.e.* that is often called
+   ``predict``, ``evaluate``, ``forward`` or ``__call__``. You don't want to debug a function that calls ``forward``
+   multiple times, *e.g.* to generate text, like ``autoregressive_sample``, ``generate``.
+-  Try to separate the tokenization from the model's `forward` pass. If the original repository shows examples where
+   you have to input a string, then try to find out where in the forward call the string input is changed to input ids
+   and start from this point. This might mean that you have to possibly write a small script yourself or change the
+   original code so that you can directly input the ids instead of an input string.
+-  Make sure that the model in your debugging setup is **not** in training mode, which often causes the model to yield
+   random outputs due to multiple dropout layers in the model. Make sure that the forward pass in your debugging
+   environment is **deterministic** so that the dropout layers are not used. Or use `transformers.file_utils.set_seed`
+   if the old and new implementations are in the same framework.
+
+The following section gives you more specific details/tips on how you can do this for *brand_new_bert*.
+
+5.-14. Port BrandNewBert to 🤗 Transformers
+-----------------------------------------------------------------------------------------------------------------------
+
+Next, you can finally start adding new code to 🤗 Transformers. Go into the clone of your 🤗 Transformers' fork:
+
+::
+
+   cd transformers
+
+In the special case that you are adding a model whose architecture exactly matches the model architecture of an
+existing model you only have to add a conversion script as described in `this section <#write-a-conversion-script>`__.
+In this case, you can just re-use the whole model architecture of the already existing model.
+
+Otherwise, let's start generating a new model with the amazing Cookiecutter!
+
+**Use the Cookiecutter to automatically generate the model's code**
+
+To begin with head over to the `🤗 Transformers templates
+<https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model>`__ to make use of our
+``cookiecutter`` implementation to automatically generate all the relevant files for your model. Again, we recommend
+only adding the PyTorch version of the model at first. Make sure you follow the instructions of the ``README.md`` on
+the `🤗 Transformers templates <https://github.com/huggingface/transformers/tree/master/templates/adding_a_new_model>`__
+carefully.
+
+**Open a Pull Request on the main huggingface/transformers repo**
+
+Before starting to adapt the automatically generated code, now is the time to open a “Work in progress (WIP)” pull
+request, *e.g.* “[WIP] Add *brand_new_bert*”, in 🤗 Transformers so that you and the Hugging Face team can work
+side-by-side on integrating the model into 🤗 Transformers.
+
+You should do the following:
+
+1. Create a branch with a descriptive name from your master branch
+
+::
+
+   git checkout -b add_brand_new_bert
+
+2. Commit the automatically generated code:
+
+::
+
+   git add .
+   git commit
+
+3. Fetch and rebase to current master
+
+::
+
+   git fetch upstream
+   git rebase upstream/master
+
+4. Push the changes to your account using:
+
+::
+
+   git push -u origin a-descriptive-name-for-my-changes
+
+5. Once you are satisfied, go to the webpage of your fork on GitHub. Click on “Pull request”. Make sure to add the
+   GitHub handle of some members of the Hugging Face team as reviewers, so that the Hugging Face team gets notified for
+   future changes.
+
+6. Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page.
+
+In the following, whenever you have done some progress, don't forget to commit your work and push it to your account so
+that it shows in the pull request. Additionally, you should make sure to update your work with the current master from
+time to time by doing:
+
+::
+
+   git fetch upstream
+   git merge upstream/master
+
+In general, all questions you might have regarding the model or your implementation should be asked in your PR and
+discussed/solved in the PR. This way, the Hugging Face team will always be notified when you are committing new code or
+if you have a question. It is often very helpful to point the Hugging Face team to your added code so that the Hugging
+Face team can efficiently understand your problem or question.
+
+To do so, you can go to the “Files changed” tab where you see all of your changes, go to a line regarding which you
+want to ask a question, and click on the “+” symbol to add a comment. Whenever a question or problem has been solved,
+you can click on the “Resolve” button of the created comment.
+
+In the same way, the Hugging Face team will open comments when reviewing your code. We recommend asking most questions
+on GitHub on your PR. For some very general questions that are not very useful for the public, feel free to ping the
+Hugging Face team by Slack or email.
+
+**5. Adapt the generated models code for brand_new_bert**
+
+At first, we will focus only on the model itself and not care about the tokenizer. All the relevant code should be
+found in the generated files ``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` and
+``src/transformers/models/brand_new_bert/configuration_brand_new_bert.py``.
+
+Now you can finally start coding :). The generated code in
+``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` will either have the same architecture as BERT if
+it's an encoder-only model or BART if it's an encoder-decoder model. At this point, you should remind yourself what
+you've learned in the beginning about the theoretical aspects of the model: *How is the model different from BERT or
+BART?*". Implement those changes which often means to change the *self-attention* layer, the order of the normalization
+layer, etc… Again, it is often useful to look at the similar architecture of already existing models in Transformers to
+get a better feeling of how your model should be implemented.
+
+**Note** that at this point, you don't have to be very sure that your code is fully correct or clean. Rather, it is
+advised to add a first *unclean*, copy-pasted version of the original code to
+``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` until you feel like all the necessary code is
+added. From our experience, it is much more efficient to quickly add a first version of the required code and
+improve/correct the code iteratively with the conversion script as described in the next section. The only thing that
+has to work at this point is that you can instantiate the 🤗 Transformers implementation of *brand_new_bert*, *i.e.* the
+following command should work:
+
+.. code:: python
+
+   from transformers import BrandNewBertModel, BrandNewBertConfig
+   model = BrandNewBertModel(BrandNewBertConfig())
+
+The above command will create a model according to the default parameters as defined in ``BrandNewBertConfig()`` with
+random weights, thus making sure that the ``init()`` methods of all components works.
+
+**6. Write a conversion script**
+
+Next, you should write a conversion script that lets you convert the checkpoint you used to debug *brand_new_bert* in
+the original repository to a checkpoint compatible with your just created 🤗 Transformers implementation of
+*brand_new_bert*. It is not advised to write the conversion script from scratch, but rather to look through already
+existing conversion scripts in 🤗 Transformers for one that has been used to convert a similar model that was written in
+the same framework as *brand_new_bert*. Usually, it is enough to copy an already existing conversion script and
+slightly adapt it for your use case. Don't hesitate to ask the Hugging Face team to point you to a similar already
+existing conversion script for your model.
+
+-  If you are porting a model from TensorFlow to PyTorch, a good starting point might be BERT's conversion script `here
+   <https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91>`__
+-  If you are porting a model from PyTorch to PyTorch, a good starting point might be BART's conversion script `here
+   <https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py>`__
+
+In the following, we'll quickly explain how PyTorch models store layer weights and define layer names. In PyTorch, the
+name of a layer is defined by the name of the class attribute you give the layer. Let's define a dummy model in
+PyTorch, called ``SimpleModel`` as follows:
+
+.. code:: python
+
+   import torch.nn as nn
+
+   class SimpleModel(nn.Module):
+       def __init__(self):
+               super().__init__()
+               self.dense = nn.Linear(10, 10)
+               self.intermediate = nn.Linear(10, 10)
+               self.layer_norm = nn.LayerNorm(10)
+
+Now we can create an instance of this model definition which will fill all weights: ``dense``, ``intermediate``,
+``layer_norm`` with random weights. We can print the model to see its architecture
+
+.. code:: python
+
+   model = SimpleModel()
+
+   print(model)
+
+This will print out the following:
+
+.. code:: bash
+
+   SimpleModel(
+     (dense): Linear(in_features=10, out_features=10, bias=True)
+     (intermediate): Linear(in_features=10, out_features=10, bias=True)
+     (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True)
+   )
+
+We can see that the layer names are defined by the name of the class attribute in PyTorch. You can print out the weight
+values of a specific layer:
+
+.. code:: python
+
+   print(model.dense.weight.data)
+
+to see that the weights were randomly initialized
+
+.. code:: bash
+
+   tensor([[-0.0818,  0.2207, -0.0749, -0.0030,  0.0045, -0.1569, -0.1598,  0.0212,
+            -0.2077,  0.2157],
+           [ 0.1044,  0.0201,  0.0990,  0.2482,  0.3116,  0.2509,  0.2866, -0.2190,
+             0.2166, -0.0212],
+           [-0.2000,  0.1107, -0.1999, -0.3119,  0.1559,  0.0993,  0.1776, -0.1950,
+            -0.1023, -0.0447],
+           [-0.0888, -0.1092,  0.2281,  0.0336,  0.1817, -0.0115,  0.2096,  0.1415,
+            -0.1876, -0.2467],
+           [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465,
+             0.2577,  0.0402],
+           [ 0.1502,  0.2465,  0.2566,  0.0693,  0.2352, -0.0530,  0.1859, -0.0604,
+             0.2132,  0.1680],
+           [ 0.1733, -0.2407, -0.1721,  0.1484,  0.0358, -0.0633, -0.0721, -0.0090,
+             0.2707, -0.2509],
+           [-0.1173,  0.1561,  0.2945,  0.0595, -0.1996,  0.2988, -0.0802,  0.0407,
+             0.1829, -0.1568],
+           [-0.1164, -0.2228, -0.0403,  0.0428,  0.1339,  0.0047,  0.1967,  0.2923,
+             0.0333, -0.0536],
+           [-0.1492, -0.1616,  0.1057,  0.1950, -0.2807, -0.2710, -0.1586,  0.0739,
+             0.2220,  0.2358]]).
+
+In the conversion script, you should fill those randomly initialized weights with the exact weights of the
+corresponding layer in the checkpoint. *E.g.*
+
+.. code:: python
+
+   # retrieve matching layer weights, e.g. by 
+   # recursive algorithm
+   layer_name = "dense"
+   pretrained_weight = array_of_dense_layer
+
+   model_pointer = getattr(model, "dense")
+
+   model_pointer.weight.data = torch.from_numpy(pretrained_weight)
+
+While doing so, you must verify that each randomly initialized weight of your PyTorch model and its corresponding
+pretrained checkpoint weight exactly match in both **shape and name**. To do so, it is **necessary** to add assert
+statements for the shape and print out the names of the checkpoints weights. E.g. you should add statements like:
+
+.. code:: python
+
+   assert (
+        model_pointer.weight.shape == pretrained_weight.shape
+   ), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched"
+
+Besides, you should also print out the names of both weights to make sure they match, *e.g.*
+
+.. code:: python
+
+   logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}")
+
+If either the shape or the name doesn't match, you probably assigned the wrong checkpoint weight to a randomly
+initialized layer of the 🤗 Transformers implementation.
+
+An incorrect shape is most likely due to an incorrect setting of the config parameters in ``BrandNewBertConfig()`` that
+do not exactly match those that were used for the checkpoint you want to convert. However, it could also be that
+PyTorch's implementation of a layer requires the weight to be transposed beforehand.
+
+Finally, you should also check that **all** required weights are initialized and print out all checkpoint weights that
+were not used for initialization to make sure the model is correctly converted. It is completely normal, that the
+conversion trials fail with either a wrong shape statement or wrong name assignment. This is most likely because either
+you used incorrect parameters in ``BrandNewBertConfig()``, have a wrong architecture in the 🤗 Transformers
+implementation, you have a bug in the ``init()`` functions of one of the components of the 🤗 Transformers
+implementation or you need to transpose one of the checkpoint weights.
+
+This step should be iterated with the previous step until all weights of the checkpoint are correctly loaded in the
+Transformers model. Having correctly loaded the checkpoint into the 🤗 Transformers implementation, you can then save
+the model under a folder of your choice ``/path/to/converted/checkpoint/folder`` that should then contain both a
+``pytorch_model.bin`` file and a ``config.json`` file:
+
+.. code:: python
+
+   model.save_pretrained("/path/to/converted/checkpoint/folder")
+
+**7. Implement the forward pass**
+
+Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make
+sure that the forward pass is correctly implemented. In `Get familiar with the original repository
+<#run-a-pretrained-checkpoint-using-the-original-repository>`__, you have already created a script that runs a forward
+pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers
+implementation instead of the original one. It should look as follows:
+
+.. code:: python
+
+   model = BrandNewBertModel.from_pretrained(/path/to/converted/checkpoint/folder)
+   input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]
+   output = model(input_ids).last_hidden_states
+
+It is very likely that the 🤗 Transformers implementation and the original model implementation don't give the exact
+same output the very first time or that the forward pass throws an error. Don't be disappointed - it's expected! First,
+you should make sure that the forward pass doesn't throw any errors. It often happens that the wrong dimensions are
+used leading to a `Dimensionality mismatch` error or that the wrong data type object is used, *e.g.* ``torch.long``
+instead of ``torch.float32``. Don't hesitate to ask the Hugging Face team for help, if you don't manage to solve
+certain errors.
+
+The final part to make sure the 🤗 Transformers implementation works correctly is to ensure that the outputs are
+equivalent to a precision of ``1e-3``. First, you should ensure that the output shapes are identical, *i.e.*
+``outputs.shape`` should yield the same value for the script of the 🤗 Transformers implementation and the original
+implementation. Next, you should make sure that the output values are identical as well. This one of the most difficult
+parts of adding a new model. Common mistakes why the outputs are not identical are:
+
+-  Some layers were not added, *i.e.* an `activation` layer was not added, or the residual connection was forgotten
+-  The word embedding matrix was not tied
+-  The wrong positional embeddings are used because the original implementation uses on offset
+-  Dropout is applied during the forward pass. To fix this make sure `model.training is False` and that no dropout
+   layer is falsely activated during the forward pass, *i.e.* pass `self.training` to `PyTorch's functional dropout
+   <https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout>`_
+
+The best way to fix the problem is usually to look at the forward pass of the original implementation and the 🤗
+Transformers implementation side-by-side and check if there are any differences. Ideally, you should debug/print out
+intermediate outputs of both implementations of the forward pass to find the exact position in the network where the 🤗
+Transformers implementation shows a different output than the original implementation. First, make sure that the
+hard-coded ``input_ids`` in both scripts are identical. Next, verify that the outputs of the first transformation of
+the ``input_ids`` (usually the word embeddings) are identical. And then work your way up to the very last layer of the
+network. At some point, you will notice a difference between the two implementations, which should point you to the bug
+in the 🤗 Transformers implementation. From our experience, a simple and efficient way is to add many print statements
+in both the original implementation and 🤗 Transformers implementation, at the same positions in the network
+respectively, and to successively remove print statements showing the same values for intermediate presentions.
+
+When you're confident that both implementations yield the same output, verifying the outputs with
+``torch.allclose(original_output, output, atol=1e-3)``, you're done with the most difficult part! Congratulations - the
+work left to be done should be a cakewalk 😊.
+
+**8. Adding all necessary model tests**
+
+At this point, you have successfully added a new model. However, it is very much possible that the model does not yet
+fully comply with the required design. To make sure, the implementation is fully compatible with 🤗 Transformers, all
+common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under
+the same ``tests/test_modeling_brand_new_bert.py``. Run this test file to verify that all common tests pass:
+
+.. code:: python
+
+   pytest tests/test_modeling_brand_new_bert.py
+
+Having fixed all common tests, it is now crucial to ensure that all the nice work you have done is well tested, so that
+
+-  
+
+   a) The community can easily understand your work by looking at specific tests of *brand_new_bert*
+
+-  
+
+   b) Future changes to your model will not break any important feature of the model.
+
+At first, integration tests should be added. Those integration tests essentially do the same as the debugging scripts
+you used earlier to implement the model to 🤗 Transformers. A template of those model tests is already added by the
+Cookiecutter, called ``BrandNewBertModelIntegrationTests`` and only has to be filled out by you. To ensure that those
+tests are passing, run
+
+.. code:: python
+
+   RUN_SLOW=1 pytest -sv tests/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests
+
+.. note::
+
+  In case you are using Windows, you should replace ``RUN_SLOW=1`` with ``SET RUN_SLOW=1``
+
+Second, all features that are special to *brand_new_bert* should be tested additionally in a separate test under
+``BrandNewBertModelTester``/``BrandNewBertModelTest``. This part is often forgotten but is extremely useful in two
+ways:
+
+-  It helps to transfer the knowledge you have acquired during the model addition to the community by showing how the
+   special features of *brand_new_bert* should work.
+-  Future contributors can quickly test changes to the model by running those special tests.
+
+
+**9. Implement the tokenizer**
+
+Next, we should add the tokenizer of *brand_new_bert*. Usually, the tokenizer is equivalent or very similar to an
+already existing tokenizer of 🤗 Transformers.
+
+It is very important to find/extract the original tokenizer file and to manage to load this file into the 🤗
+Transformers' implementation of the tokenizer.
+
+To ensure that the tokenizer works correctly, it is recommended to first create a script in the original repository
+that inputs a string and returns the ``input_ids``. It could look similar to this (in pseudo-code):
+
+.. code:: bash
+
+   input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
+   model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+   input_ids = model.tokenize(input_str)
+
+You might have to take a deeper look again into the original repository to find the correct tokenizer function or you
+might even have to do changes to your clone of the original repository to only output the ``input_ids``. Having written
+a functional tokenization script that uses the original repository, an analogous script for 🤗 Transformers should be
+created. It should look similar to this:
+
+.. code:: python
+
+   from transformers import BrandNewBertTokenizer
+   input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
+
+   tokenizer = BrandNewBertTokenizer.from_pretrained(/path/to/tokenizer/folder/)
+
+   input_ids = tokenizer(input_str).input_ids
+
+When both ``input_ids`` yield the same values, as a final step a tokenizer test file should also be added.
+
+Analogous to the modeling test files of *brand_new_bert*, the tokenization test files of *brand_new_bert* should
+contain a couple of hard-coded integration tests.
+
+**10. Run End-to-end integration tests**
+
+Having added the tokenizer, you should also add a couple of end-to-end integration tests using both the model and the
+tokenizer to ``tests/test_modeling_brand_new_bert.py`` in 🤗 Transformers. Such a test should show on a meaningful
+text-to-text sample that the 🤗 Transformers implementation works as expected. A meaningful text-to-text sample can
+include *e.g.* a source-to-target-translation pair, an article-to-summary pair, a question-to-answer pair, etc… If none
+of the ported checkpoints has been fine-tuned on a downstream task it is enough to simply rely on the model tests. In a
+final step to ensure that the model is fully functional, it is advised that you also run all tests on GPU. It can
+happen that you forgot to add some ``.to(self.device)`` statements to internal tensors of the model, which in such a
+test would show in an error. In case you have no access to a GPU, the Hugging Face team can take care of running those
+tests for you.
+
+**11. Add Docstring**
+
+Now, all the necessary functionality for *brand_new_bert* is added - you're almost done! The only thing left to add is
+a nice docstring and a doc page. The Cookiecutter should have added a template file called
+``docs/source/model_doc/brand_new_bert.rst`` that you should fill out. Users of your model will usually first look at
+this page before using your model. Hence, the documentation must be understandable and concise. It is very useful for
+the community to add some *Tips* to show how the model should be used. Don't hesitate to ping the Hugging Face team
+regarding the docstrings.
+
+Next, make sure that the docstring added to ``src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`` is
+correct and included all necessary inputs and outputs. It is always to good to remind oneself that documentation should
+be treated at least as carefully as the code in 🤗 Transformers since the documentation is usually the first contact
+point of the community with the model.
+
+**Code refactor**
+
+Great, now you have added all the necessary code for *brand_new_bert*. At this point, you should correct some potential
+incorrect code style by running:
+
+.. code:: bash
+
+   make style
+
+and verify that your coding style passes the quality check:
+
+.. code:: bash
+
+   make quality
+
+There are a couple of other very strict design tests in 🤗 Transformers that might still be failing, which shows up in
+the tests of your pull request. This is often because of some missing information in the docstring or some incorrect
+naming. The Hugging Face team will surely help you if you're stuck here.
+
+Lastly, it is always a good idea to refactor one's code after having ensured that the code works correctly. With all
+tests passing, now it's a good time to go over the added code again and do some refactoring.
+
+You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
+
+**12. Upload the models to the model hub**
+
+In this final part, you should convert and upload all checkpoints to the model hub and add a model card for each
+uploaded model checkpoint. You should work alongside the Hugging Face team here to decide on a fitting name for each
+checkpoint and to get the required access rights to be able to upload the model under the author's organization of
+*brand_new_bert*.
+
+It is worth spending some time to create fitting model cards for each checkpoint. The model cards should highlight the
+specific characteristics of this particular checkpoint, *e.g.* On which dataset was the checkpoint
+pretrained/fine-tuned on? On what down-stream task should the model be used? And also include some code on how to
+correctly use the model.
+
+**13. (Optional) Add notebook**
+
+It is very helpful to add a notebook that showcases in-detail how *brand_new_bert* can be used for inference and/or
+fine-tuned on a downstream task. This is not mandatory to merge your PR, but very useful for the community.
+
+**14. Submit your finished PR**
+
+You're done programming now and can move to the last step, which is getting your PR merged into master. Usually, the
+Hugging Face team should have helped you already at this point, but it is worth taking some time to give your finished
+PR a nice description and eventually add comments to your code, if you want to point out certain design choices to your
+reviewer.
+
+Share your work!!
+-----------------------------------------------------------------------------------------------------------------------
+
+Now, it's time to get some credit from the community for your work! Having completed a model addition is a major
+contribution to Transformers and the whole NLP community. Your code and the ported pre-trained models will certainly be
+used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share
+your achievement with the community.
+
+**You have made another model that is super easy to access for everyone in the community! 🤯**
--- a/docs/source/community.md
+++ b/docs/source/community.md
@@ -0,0 +1,49 @@
+# Community
+
+This page regroups resources around 🤗 Transformers developed by the community.
+
+## Community resources:
+
+| Resource     |      Description      |      Author      |
+|:----------|:-------------|------:|
+| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](https://huggingface.co/transformers/master/glossary.html) that has been put into a form which can be easily learnt/revised using [Anki ](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) |
+
+## Community notebooks:
+
+| Notebook     |      Description      |      Author      |      |
+|:----------|:-------------|:-------------|------:|
+| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
+| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb)  | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
+| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb)  | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning |  [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
+| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb)  | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots |  [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
+| [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb)  | How to train on sequences as long as 500,000 tokens with Reformer |  [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb)  |
+| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) |
+| [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb)  | How to generate tweets in the style of your favorite Twitter account by fine-tune a GPT-2 model |  [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) |
+| [A Step by Step Guide to Tracking Hugging Face Model Performance](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8)  | A quick tutorial for training NLP models with HuggingFace and & visualizing their performance with Weights & Biases |  [Jack Morris](https://github.com/jxmorris12) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) |
+| [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb)  | How to build a "long" version of existing pretrained models |  [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) |
+| [Fine-tune Longformer for QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | How to fine-tune longformer model for QA task | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) |
+| [Evaluate Model with 🤗nlp](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | How to evaluate longformer on TriviaQA with `nlp` | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) |
+| [Fine-tune T5 for Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb)  | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning |  [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) |
+| [Fine-tune DistilBert for Multiclass Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | How to fine-tune DistilBert for multiclass classification with PyTorch | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)|
+|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
+|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
+|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
+|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
+|[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)|
+|[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)|
+|[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
+|[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
+|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
+|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
+|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
+|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
+|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
+|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
+|[Fine-tune TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)|
+|[Evaluate TAPAS on Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | How to evaluate a fine-tuned *TapasForSequenceClassification* with a *tapas-base-finetuned-tabfact* checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)|
+|[Fine-tuning mBART for translation](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | How to fine-tune mBART using Seq2SeqTrainer for Hindi to English translation | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)|
+|[Fine-tune LayoutLM on FUNSD (a form understanding dataset)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | How to fine-tune *LayoutLMForTokenClassification* on the FUNSD dataset for information extraction from scanned documents | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)|
+|[Fine-Tune DistilGPT2 and Generate Text](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | How to fine-tune DistilGPT2 and generate text | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)|
+|[Fine-Tune LED on up to 8K tokens](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) | How to fine-tune LED on pubmed for long-range summarization | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb)|
+|[Evaluate LED on Arxiv](https://github.com/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb) | How to effectively evaluate LED on long-range summarization | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb)|
+|[Fine-tune LayoutLM on RVL-CDIP (a document image classification dataset)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) | How to fine-tune *LayoutLMForSequenceClassification* on the RVL-CDIP dataset for scanned document classification | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)|
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'4.2.0'
+release = u'4.3.0'
 # Prefix link to point to master, comment this during version release and uncomment below line
 extlinks = {'prefix_link': ('https://github.com/huggingface/transformers/blob/master/%s', '')}
 # Prefix link to always point to corresponding version, uncomment this during version release
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -28,17 +28,13 @@ BERT
 You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google
 <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the
 :prefix_link:`convert_bert_original_tf_checkpoint_to_pytorch.py
-<src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>` script.
+<src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py>` script.

 This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated
 configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights
 from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that
-can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py
-<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ ,
-`run_bert_classifier.py
-<https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and
-`run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\
-).
+can be imported using ``from_pretrained()`` (see example in :doc:`quicktour` , `run_glue.py
+<https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py>`_\ ).

 You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow
 checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\
@@ -66,7 +62,7 @@ ALBERT

 Convert TensorFlow model checkpoints of ALBERT to PyTorch using the
 :prefix_link:`convert_albert_original_tf_checkpoint_to_pytorch.py
-<src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>` script.
+<src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py>` script.

 The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying
 configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you
@@ -168,3 +164,18 @@ Here is an example of the conversion process for a pre-trained XLM model:
     --pytorch_dump_output $PYTORCH_DUMP_OUTPUT
    [--config XML_CONFIG] \
    [--finetuning_task_name XML_FINETUNED_TASK]
+
+
+T5
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Here is an example of the conversion process for a pre-trained T5 model:
+
+.. code-block:: shell
+
+   export T5=/path/to/t5/uncased_L-12_H-768_A-12
+
+   transformers-cli convert --model_type t5 \
+     --tf_checkpoint $T5/t5_model.ckpt \
+     --config $T5/t5_config.json \
+     --pytorch_dump_output $T5/pytorch_model.bin
--- a/docs/source/custom_datasets.rst
+++ b/docs/source/custom_datasets.rst
@@ -75,7 +75,7 @@ read this in.
    test_texts, test_labels = read_imdb_split('aclImdb/test')

 We now have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
-and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:
+and tuning without tainting our test set results. Sklearn has a convenient utility for creating such splits:

 .. code-block:: python

--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -24,11 +24,11 @@ General terms
 - MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done
  by masking some tokens randomly, and has to predict the original text.
 - multimodal: a task that combines texts with another kind of inputs (for instance images).
- NLG: natural language generation, all tasks related to generating text ( for instance talk with transformers,
-  translation)
+- NLG: natural language generation, all tasks related to generating text (for instance talk with transformers,
+  translation).
 - NLP: natural language processing, a generic way to say "deal with texts".
 - NLU: natural language understanding, all tasks related to understanding what is in a text (for instance classifying
-  the whole text, individual words)
+  the whole text, individual words).
 - pretrained model: a model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods
  involve a self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or
  masking some words and trying to predict them (see MLM).
--- a/docs/source/imgs/transformers_overview.png
+++ b/docs/source/imgs/transformers_overview.png
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -103,100 +103,106 @@ and conversion utilities for the following models:
 7. :doc:`BlenderbotSmall <model_doc/blenderbot_small>` (from Facebook) released with the paper `Recipes for building an
   open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
   Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
-8. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
+8. :doc:`BORT <model_doc/bort>` (from Alexa) released with the paper `Optimal Subarchitecture Extraction For BERT
+   <https://arxiv.org/abs/2010.10499>`__ by Adrian de Wynter and Daniel J. Perry.
+9. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
   French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
   Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
-9. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
-   Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
-   Lav R. Varshney, Caiming Xiong and Richard Socher.
-10. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
+10. :doc:`ConvBERT <model_doc/convbert>` (from YituTech) released with the paper `ConvBERT: Improving BERT with
+    Span-based Dynamic Convolution <https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou,
+    Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
+11. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
+    Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
+    Lav R. Varshney, Caiming Xiong and Richard Socher.
+12. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
    BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
    Weizhu Chen.
-11. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
+13. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
    Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
    Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
-12. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
+14. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
    distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
    Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
    `DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
    version of DistilBERT.
-13. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
+15. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
    Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
    Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
-14. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
+16. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
    Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
    Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
-15. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
+17. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
    Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
    Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
-16. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
+18. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
    Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
    Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
-17. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
+19. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
    Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
    and Ilya Sutskever.
-18. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
+20. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
    Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
    Luan, Dario Amodei** and Ilya Sutskever**.
-19. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
+21. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
    of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
    Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
-20. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
+22. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
    <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-21. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
+23. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
    Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-22. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
+24. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
    Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
    by Hao Tan and Mohit Bansal.
-23. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
+25. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
    Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
    Translator Team.
-24. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
+26. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
    Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
    Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-25. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
+27. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
    Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
    Jianfeng Lu, Tie-Yan Liu.
-26. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
+28. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
    text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
    Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
-27. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
+29. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
    Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
    Mohammad Saleh and Peter J. Liu.
-28. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
+30. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
    Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
    Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-29. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
+31. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
    Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-30. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
+32. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
    Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
-    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. ultilingual BERT into `DistilmBERT
-    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German version of
-    DistilBERT.
-31. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
+    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+33. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
    about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
    Krishna, and Kurt W. Keutzer.
-32. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
+34. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
    Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
    Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-33. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
+35. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
    Francesco Piccinno and Julian Martin Eisenschlos.
-34. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
+36. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
    Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
    Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-35. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
+37. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
+    Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
+    Zhou, Abdelrahman Mohamed, Michael Auli.
+38. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
    Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
-36. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
+39. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
    Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
    Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-37. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
+40. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
    Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
    Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
    Zettlemoyer and Veselin Stoyanov.
-38. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
+41. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
    Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
    Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

@@ -231,6 +237,8 @@ TensorFlow and/or Flax.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |          CamemBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|          ConvBERT           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |             DPR             |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |           DeBERTa           |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
@@ -285,6 +293,8 @@ TensorFlow and/or Flax.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |       Transformer-XL        |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|          Wav2Vec2           |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |             XLM             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |         XLM-RoBERTa         |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
@@ -327,9 +337,11 @@ TensorFlow and/or Flax.
    examples
    custom_datasets
    notebooks
+    community
    converting_tensorflow_models
    migration
    contributing
+    add_new_model
    testing
    serialization

@@ -369,7 +381,9 @@ TensorFlow and/or Flax.
    model_doc/bertgeneration
    model_doc/blenderbot
    model_doc/blenderbot_small
+    model_doc/bort
    model_doc/camembert
+    model_doc/convbert
    model_doc/ctrl
    model_doc/deberta
    model_doc/dialogpt
@@ -403,6 +417,7 @@ TensorFlow and/or Flax.
    model_doc/t5
    model_doc/tapas
    model_doc/transformerxl
+    model_doc/wav2vec2
    model_doc/xlm
    model_doc/xlmprophetnet
    model_doc/xlmroberta
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -19,7 +19,7 @@ limitations under the License.
 🤗 Transformers is tested on Python 3.6+, and PyTorch 1.1.0+ or TensorFlow 2.0+.

 You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're
-unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going 
+unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Create a virtual environment with the version of Python you're going
 to use and activate it.

 Now, if you want to use 🤗 Transformers, you can install it with pip. If you'd like to play with the examples, you
@@ -28,8 +28,8 @@ must install it from source.
 ## Installation with pip

 First you need to install one of, or both, TensorFlow 2.0 and PyTorch.
-Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available), 
-[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or 
+Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available),
+[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or
 [Flax installation page](https://github.com/google/flax#quick-install)
 regarding the specific install command for your platform.

@@ -73,7 +73,27 @@ It should download a pretrained model then print something like

 ## Installing from source

-To install from source, clone the repository and install with the following commands:
+Here is how to quickly install `transformers` from source:
+
+```bash
+pip install git+https://github.com/huggingface/transformers
+```
+
+Note that this will install not the latest released version, but the bleeding edge `master` version, which you may want to use in case a bug has been fixed since the last official release and a new release hasn't  been yet rolled out.
+
+While we strive to keep `master` operational at all times, if you notice some issues, they usually get fixed within a few hours or a day and and you're more than welcome to help us detect any problems by opening an [Issue](https://github.com/huggingface/transformers/issues) and this way, things will get fixed even sooner.
+
+Again, you can run:
+
+```bash
+python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
+```
+
+to check 🤗 Transformers is properly installed.
+
+## Editable install
+
+If you want to constantly use the bleeding edge `master` version of the source code, or if you want to contribute to the library and need to test the changes in the code you're making, you will need an editable install. This is done by cloning the repository and installing with the following commands:

 ``` bash
 git clone https://github.com/huggingface/transformers.git
@@ -81,13 +101,22 @@ cd transformers
 pip install -e .
 ```

-Again, you can run 
+This command performs a magical link between the folder you cloned the repository to and your python library paths, and it'll look inside this folder in addition to the normal library-wide paths. So if normally your python packages get installed into:
+```
+~/anaconda3/envs/main/lib/python3.7/site-packages/
+```
+now this editable install will reside where you clone the folder to, e.g. `~/transformers/` and python will search it too.

-```bash
-python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
+Do note that you have to keep that `transformers` folder around and not delete it to continue using the  `transfomers` library.
+
+Now, let's get to the real benefit of this installation approach. Say, you saw some new feature has been just committed into `master`. If you have already performed all the steps above, to update your transformers to include all the latest commits, all you need to do is to `cd` into that cloned repository folder and update the clone to the latest version:
+
+```
+cd ~/transformers/
+git pull
 ```

-to check 🤗 Transformers is properly installed.
+There is nothing else to do. Your python environment will find the bleeding edge version of `transformers` on the next run.


 ## With conda
@@ -100,7 +129,7 @@ Since Transformers version v4.0.0, we now have a conda channel: `huggingface`.
 conda install -c huggingface transformers
 ```

-Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda. 
+Follow the installation pages of TensorFlow, PyTorch or Flax to see how to install them with conda.

 ## Caching models

@@ -109,7 +138,7 @@ This library provides pretrained models that will be downloaded and cached local
 folder given by the shell environment variable ``TRANSFORMERS_CACHE``. The default value for it will be the Hugging
 Face cache home followed by ``/transformers/``. This is (by order of priority):

-  * shell environment variable ``HF_HOME`` 
+  * shell environment variable ``HF_HOME``
  * shell environment variable ``XDG_CACHE_HOME`` + ``/huggingface/``
  * default: ``~/.cache/huggingface/``

@@ -130,7 +159,7 @@ faster, and cheaper. Feel free to contact us privately if you need any help.

 You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.

-It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, 
+It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
 `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.

 At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
--- a/docs/source/main_classes/tokenizer.rst
+++ b/docs/source/main_classes/tokenizer.rst
@@ -56,6 +56,8 @@ PreTrainedTokenizer
    :special-members: __call__
    :members:

+    .. automethod:: encode
+

 PreTrainedTokenizerFast
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -64,6 +66,8 @@ PreTrainedTokenizerFast
    :special-members: __call__
    :members:

+    .. automethod:: encode
+

 BatchEncoding
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -113,7 +113,125 @@ Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, O

 This provided support is new and experimental as of this writing.

-You will need at least 2 GPUs to benefit from these features.
+Installation Notes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used.
+
+While all installation issues should be dealt with through the corresponding GitHub Issues of `FairScale
+<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
+<https://github.com/microsoft/DeepSpeed/issues>`__, there are a few common issues that one may encounter while building
+any PyTorch extension that needs to build CUDA extensions.
+
+Therefore, if you encounter a CUDA-related build issue while doing one of the following or both:
+
+.. code-block:: bash
+
+    pip install fairscale
+    pip install deepspeed
+
+please, read the following notes first.
+
+In these notes we give examples for what to do when ``pytorch`` has been built with CUDA ``10.2``. If your situation is
+different remember to adjust the version number to the one you are after.
+
+**Possible problem #1:**
+
+While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA
+installed system-wide.
+
+For example, if you installed ``pytorch`` with ``cudatoolkit==10.2`` in the Python environment, you also need to have
+CUDA ``10.2`` installed system-wide.
+
+The exact location may vary from system to system, but ``/usr/local/cuda-10.2`` is the most common location on many
+Unix systems. When CUDA is correctly set up and added to the ``PATH`` environment variable, one can find the
+installation location by doing:
+
+.. code-block:: bash
+
+    which nvcc
+
+If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite
+search engine. For example, if you're on Ubuntu you may want to search for: `ubuntu cuda 10.2 install
+<https://www.google.com/search?q=ubuntu+cuda+10.2+install>`__.
+
+**Possible problem #2:**
+
+Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you
+may have:
+
+.. code-block:: bash
+
+    /usr/local/cuda-10.2
+    /usr/local/cuda-11.0
+
+Now, in this situation you need to make sure that your ``PATH`` and ``LD_LIBRARY_PATH`` environment variables contain
+the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the
+last version was installed. If you encounter the problem, where the package build fails because it can't find the right
+CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned
+environment variables.
+
+First, you may look at their contents:
+
+.. code-block:: bash
+
+    echo $PATH
+    echo $LD_LIBRARY_PATH
+
+so you get an idea of what is inside.
+
+It's possible that ``LD_LIBRARY_PATH`` is empty.
+
+``PATH`` lists the locations of where executables can be found and ``LD_LIBRARY_PATH`` is for where shared libraries
+are to looked for. In both cases, earlier entries have priority over the later ones. ``:`` is used to separate multiple
+entries.
+
+Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by
+doing:
+
+.. code-block:: bash
+
+    export PATH=/usr/local/cuda-10.2/bin:$PATH
+    export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
+
+Note that we aren't overwriting the existing values, but prepending instead.
+
+Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do
+exist. ``lib64`` sub-directory is where the various CUDA ``.so`` objects, like ``libcudart.so`` reside, it's unlikely
+that your system will have it named differently, but if it is adjust it to reflect your reality.
+
+
+**Possible problem #3:**
+
+Some older CUDA versions may refuse to build with newer compilers. For example, you my have ``gcc-9`` but it wants
+``gcc-7``.
+
+There are various ways to go about it.
+
+If you can install the latest CUDA toolkit it typically should support the newer compiler.
+
+Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may
+already have it but it's not the default one, so the build system can't see it. If you have ``gcc-7`` installed but the
+build system complains it can't find it, the following might do the trick:
+
+.. code-block:: bash
+
+    sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
+    sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++
+
+
+Here, we are making a symlink to ``gcc-7`` from ``/usr/local/cuda-10.2/bin/gcc`` and since
+``/usr/local/cuda-10.2/bin/`` should be in the ``PATH`` environment variable (see the previous problem's solution), it
+should find ``gcc-7`` (and ``g++7``) and then the build will succeed.
+
+As always make sure to edit the paths in the example to match your situation.
+
+**If still unsuccessful:**
+
+If after addressing these you still encounter build issues, please, proceed with the GitHub Issue of `FairScale
+<https://github.com/facebookresearch/fairscale/issues>`__ and `Deepspeed
+<https://github.com/microsoft/DeepSpeed/issues>`__, depending on the project you have the problem with.
+

 FairScale
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -124,6 +242,8 @@ provides support for the following features from `the ZeRO paper <https://arxiv.
 1. Optimizer State Sharding
 2. Gradient Sharding

+You will need at least two GPUs to use this feature.
+
 To deploy this feature:

 1. Install the library via pypi:
@@ -132,7 +252,7 @@ To deploy this feature:

       pip install fairscale

-   or find more details on `the FairScale's github page
+   or find more details on `the FairScale's GitHub page
   <https://github.com/facebookresearch/fairscale/#installation>`__.

 2. Add ``--sharded_ddp`` to the command line arguments, and make sure you have added the distributed launcher ``-m
@@ -164,7 +284,6 @@ Notes:
 DeepSpeed
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-
 `DeepSpeed <https://github.com/microsoft/DeepSpeed>`__ implements everything described in the `ZeRO paper
 <https://arxiv.org/abs/1910.02054>`__, except ZeRO's stage 3. "Parameter Partitioning (Pos+g+p)". Currently it provides
 full support for:
@@ -172,58 +291,119 @@ full support for:
 1. Optimizer State Partitioning (ZeRO stage 1)
 2. Add Gradient Partitioning (ZeRO stage 2)

-To deploy this feature:
+Installation
+=======================================================================================================================

-1. Install the library via pypi:
+Install the library via pypi:

-   .. code-block:: bash
+.. code-block:: bash

-       pip install deepspeed
+    pip install deepspeed

-   or find more details on `the DeepSpeed's github page <https://github.com/microsoft/deepspeed#installation>`__.
+or find more details on `the DeepSpeed's GitHub page <https://github.com/microsoft/deepspeed#installation>`__.

-2. Adjust the :class:`~transformers.Trainer` command line arguments as following:
+Deployment with multiple GPUs
+=======================================================================================================================

-   1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
-   2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file
-      as documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.
+To deploy this feature with multiple GPUs adjust the :class:`~transformers.Trainer` command line arguments as
+following:

-   Therefore, if your original command line looked as following:
+1. replace ``python -m torch.distributed.launch`` with ``deepspeed``.
+2. add a new argument ``--deepspeed ds_config.json``, where ``ds_config.json`` is the DeepSpeed configuration file as
+   documented `here <https://www.deepspeed.ai/docs/config-json/>`__. The file naming is up to you.

-   .. code-block:: bash
+Therefore, if your original command line looked as following:

-       python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
+.. code-block:: bash

-   Now it should be:
+    python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>

-   .. code-block:: bash
+Now it should be:

-       deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
+.. code-block:: bash

-   Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with
-   the ``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used.
-   The full details on how to configure various nodes and GPUs can be found `here
-   <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.
+    deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json

-   Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:
+Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to use with ``--nproc_per_node``, with the
+``deepspeed`` launcher you don't have to use the corresponding ``--num_gpus`` if you want all of your GPUs used. The
+full details on how to configure various nodes and GPUs can be found `here
+<https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__.

-   .. code-block:: bash
+Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs:

-       cd examples/seq2seq
-       deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
-       --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
-       --output_dir output_dir --overwrite_output_dir \
-       --do_train --n_train 500 --num_train_epochs 1 \
-       --per_device_train_batch_size 1  --freeze_embeds \
-       --src_lang en_XX --tgt_lang ro_RO --task translation
+.. code-block:: bash

-   Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` -
-   i.e. two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments
-   to deal with, we combined the two into a single argument.
+    cd examples/seq2seq
+    deepspeed ./finetune_trainer.py --deepspeed ds_config.json \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation

-Before you can deploy DeepSpeed, let's discuss its configuration.
+Note that in the DeepSpeed documentation you are likely to see ``--deepspeed --deepspeed_config ds_config.json`` - i.e.
+two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal
+with, we combined the two into a single argument.

-**Configuration:**
+For some practical usage examples, please, see this `post
+<https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400>`__.
+
+
+
+Deployment with one GPU
+=======================================================================================================================
+
+To deploy DeepSpeed with one GPU adjust the :class:`~transformers.Trainer` command line arguments as following:
+
+.. code-block:: bash
+
+    cd examples/seq2seq
+    deepspeed --num_gpus=1 ./finetune_trainer.py --deepspeed ds_config.json \
+    --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \
+    --output_dir output_dir --overwrite_output_dir \
+    --do_train --n_train 500 --num_train_epochs 1 \
+    --per_device_train_batch_size 1  --freeze_embeds \
+    --src_lang en_XX --tgt_lang ro_RO --task translation
+
+This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU. By default,
+DeepSpeed deploys all GPUs it can see. If you have only 1 GPU to start with, then you don't need this argument. The
+following `documentation <https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node>`__ discusses the
+launcher options.
+
+Why would you want to use DeepSpeed with just one GPU?
+
+1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus
+   leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which
+   normally won't fit.
+2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit
+   bigger models and data batches.
+
+While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU
+with DeepSpeed is to have at least the following configuration in the configuration file:
+
+.. code-block:: json
+
+  {
+    "zero_optimization": {
+       "stage": 2,
+       "allgather_partitions": true,
+       "allgather_bucket_size": 2e8,
+       "reduce_scatter": true,
+       "reduce_bucket_size": 2e8,
+       "overlap_comm": true,
+       "contiguous_gradients": true,
+       "cpu_offload": true
+    },
+  }
+
+which enables ``cpu_offload`` and some other important features. You may experiment with the buffer sizes, you will
+find more details in the discussion below.
+
+For a practical usage example of this type of deployment, please, see this `post
+<https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685>`__.
+
+Configuration
+=======================================================================================================================

 For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer
 to the `following documentation <https://www.deepspeed.ai/docs/config-json/>`__.
@@ -314,7 +494,8 @@ to achieve the same configuration as provided by the longer json file in the fir
 When you execute the program, DeepSpeed will log the configuration it received from the :class:`~transformers.Trainer`
 to the console, so you can see exactly what the final configuration was passed to it.

-**Shared Configuration:**
+Shared Configuration
+=======================================================================================================================

 Some configuration information is required by both the :class:`~transformers.Trainer` and DeepSpeed to function
 correctly, therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to
@@ -338,7 +519,8 @@ Of course, you will need to adjust the values in this example to your situation.



-**ZeRO:**
+ZeRO
+=======================================================================================================================

 The ``zero_optimization`` section of the configuration file is the most important part (`docs
 <https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training>`__), since that is where you define
@@ -372,7 +554,8 @@ no equivalent command line arguments.



-**Optimizer:**
+Optimizer
+=======================================================================================================================


 DeepSpeed's main optimizers are Adam, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are thus
@@ -407,7 +590,8 @@ If you want to use one of the officially supported optimizers, configure them ex
 make sure to adjust the values. e.g. if use Adam you will want ``weight_decay`` around ``0.01``.


-**Scheduler:**
+Scheduler
+=======================================================================================================================

 DeepSpeed supports LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR LR schedulers. The full documentation is `here
 <https://www.deepspeed.ai/docs/config-json/#scheduler-parameters>`__.
@@ -456,7 +640,8 @@ Here is an example of the pre-configured ``scheduler`` entry for WarmupLR (``con
         }
    }

-**Automatic Mixed Precision:**
+Automatic Mixed Precision
+=======================================================================================================================

 You can work with FP16 in one of the following ways:

@@ -464,7 +649,7 @@ You can work with FP16 in one of the following ways:
 2. NVIDIA's apex, as documented `here
   <https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options>`__.

-If you want to use an equivalent of the pytorch native amp, you can either configure the ``fp16`` entry in the
+If you want to use an equivalent of the Pytorch native amp, you can either configure the ``fp16`` entry in the
 configuration file, or use the following command line arguments: ``--fp16 --fp16_backend amp``.

 Here is an example of the ``fp16`` configuration:
@@ -497,7 +682,8 @@ Here is an example of the ``amp`` configuration:



-**Gradient Clipping:**
+Gradient Clipping
+=======================================================================================================================

 If you don't configure the ``gradient_clipping`` entry in the configuration file, the :class:`~transformers.Trainer`
 will use the value of the ``--max_grad_norm`` command line argument to set it.
@@ -512,7 +698,8 @@ Here is an example of the ``gradient_clipping`` configuration:



-**Notes:**
+Notes
+=======================================================================================================================

 * DeepSpeed works with the PyTorch :class:`~transformers.Trainer` but not TF :class:`~transformers.TFTrainer`.
 * While DeepSpeed has a pip installable PyPI package, it is highly recommended that it gets installed from `source
@@ -522,12 +709,14 @@ Here is an example of the ``gradient_clipping`` configuration:
  use any model with your own trainer, and you will have to adapt the latter according to `the DeepSpeed integration
  instructions <https://www.deepspeed.ai/getting-started/#writing-deepspeed-models>`__.

-**Main DeepSpeed Resources:**
+Main DeepSpeed Resources
+=======================================================================================================================

- `github <https://github.com/microsoft/deepspeed>`__
+- `Project's github <https://github.com/microsoft/deepspeed>`__
 - `Usage docs <https://www.deepspeed.ai/getting-started/>`__
 - `API docs <https://deepspeed.readthedocs.io/en/latest/index.html>`__
+- `Blog posts <https://www.microsoft.com/en-us/research/search/?q=deepspeed>`__

 Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you
-have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed github
+have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub
 <https://github.com/microsoft/DeepSpeed/issues>`__.
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -130,6 +130,12 @@ BartForQuestionAnswering
 .. autoclass:: transformers.BartForQuestionAnswering
    :members: forward

+BartForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartForCausalLM
+    :members: forward
+


 TFBartModel
--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
@@ -98,6 +98,13 @@ See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward`
    :members: forward


+BlenderbotForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BlenderbotForCausalLM
+    :members: forward
+
+
 TFBlenderbotModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/blenderbot_small.rst
+++ b/docs/source/model_doc/blenderbot_small.rst
@@ -70,6 +70,13 @@ BlenderbotSmallForConditionalGeneration
    :members: forward


+BlenderbotSmallForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BlenderbotSmallForCausalLM
+    :members: forward
+
+
 TFBlenderbotSmallModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/bort.rst
+++ b/docs/source/model_doc/bort.rst
@@ -0,0 +1,46 @@
+.. 
+    Copyright 2020 The HuggingFace Team. All rights reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+
+BORT
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
+Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
+authors refer to as "Bort".
+
+The abstract from the paper is the following:
+
+*We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by
+applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as
+"Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of 5.5% the
+original BERT-large architecture, and 16% of the net size. Bort is also able to be pretrained in 288 GPU hours, which
+is 1.2% of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large
+(Liu et al., 2019), and about 33% of that of the world-record, in GPU hours, required to train BERT-large on the same
+hardware. It is also 7.9x faster on a CPU, as well as being better performing than other compressed variants of the
+architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%,
+absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.*
+
+Tips:
+
+- BORT's model architecture is based on BERT, so one can refer to :doc:`BERT's documentation page <bert>` for the
+  model's API as well as usage examples.
+- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to :doc:`RoBERTa's documentation page
+  <roberta>` for the tokenizer's API as well as usage examples.
+- BORT requires a specific fine-tuning algorithm, called `Agora
+  <https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology>`__ ,
+  that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
+  algorithm to make BORT fine-tuning work.
+
+The original code can be found `here <https://github.com/alexa/bort/>`__.
--- a/docs/source/model_doc/convbert.rst
+++ b/docs/source/model_doc/convbert.rst
@@ -0,0 +1,144 @@
+.. 
+    Copyright 2020 The HuggingFace Team. All rights reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+
+ConvBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ConvBERT model was proposed in `ConvBERT: Improving BERT with Span-based Dynamic Convolution
+<https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
+Yan.
+
+The abstract from the paper is the following:
+
+*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
+natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
+large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
+generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
+which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
+replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
+rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
+learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
+ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
+fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
+using less than 1/4 training cost. Code and pre-trained models will be released.*
+
+ConvBERT training tips are similar to those of BERT. The original implementation can be found here:
+https://github.com/yitu-opensource/ConvBert
+
+ConvBertConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertConfig
+    :members:
+
+
+ConvBertTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+ConvBertTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertTokenizerFast
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+ConvBertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertModel
+    :members: forward
+
+
+ConvBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertForMaskedLM
+    :members: forward
+
+
+ConvBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertForSequenceClassification
+    :members: forward
+
+
+ConvBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertForMultipleChoice
+    :members: forward
+
+
+ConvBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertForTokenClassification
+    :members: forward
+
+
+ConvBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ConvBertForQuestionAnswering
+    :members: forward
+
+
+TFConvBertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertModel
+    :members: call
+
+
+TFConvBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertForMaskedLM
+    :members: call
+
+
+TFConvBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertForSequenceClassification
+    :members: call
+
+
+TFConvBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertForMultipleChoice
+    :members: call
+
+
+TFConvBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertForTokenClassification
+    :members: call
+
+
+TFConvBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFConvBertForQuestionAnswering
+    :members: call
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
@@ -70,8 +70,29 @@ DebertaPreTrainedModel
    :members:


+DebertaForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaForMaskedLM
+    :members:
+
+
 DebertaForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DebertaForSequenceClassification
    :members:
+
+
+DebertaForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaForTokenClassification
+    :members:
+
+
+DebertaForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaForQuestionAnswering
+    :members:
--- a/docs/source/model_doc/dialogpt.rst
+++ b/docs/source/model_doc/dialogpt.rst
@@ -48,7 +48,6 @@ modeling. We first concatenate all dialog turns within a dialogue session into a
 sequence length), ended by the end-of-text token.* For more information please confer to the original paper.


-DialoGPT's architecture is based on the GPT2 model, so one can refer to GPT2's `docstring
-<https://huggingface.co/transformers/model_doc/gpt2.html>`_.
+DialoGPT's architecture is based on the GPT2 model, so one can refer to :doc:`GPT2's documentation page <gpt2>`.

 The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -193,6 +193,13 @@ MarianMTModel
    :members: forward


+MarianForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.MarianForCausalLM
+    :members: forward
+
+
 TFMarianModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -124,6 +124,13 @@ MBartForSequenceClassification
 .. autoclass:: transformers.MBartForSequenceClassification


+MBartForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.MBartForCausalLM
+    :members: forward
+
+
 TFMBartModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -131,6 +131,13 @@ PegasusForConditionalGeneration
    :members: forward


+PegasusForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.PegasusForCausalLM
+    :members: forward
+
+
 TFPegasusModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/wav2vec2.rst
+++ b/docs/source/model_doc/wav2vec2.rst
@@ -0,0 +1,65 @@
+.. 
+    Copyright 2021 The HuggingFace Team. All rights reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+
+Wav2Vec2
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Wav2Vec2 model was proposed in `wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
+<https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
+
+The abstract from the paper is the following:
+
+*We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on
+transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks
+the speech input in the latent space and solves a contrastive task defined over a quantization of the latent
+representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the
+clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state
+of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and
+pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech
+recognition with limited amounts of labeled data.*
+
+Tips:
+
+- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
+- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded
+  using :class:`~transformers.Wav2Vec2Tokenizer`.
+
+
+Wav2Vec2Config
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.Wav2Vec2Config
+    :members:
+
+
+Wav2Vec2Tokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.Wav2Vec2Tokenizer
+    :members: __call__, save_vocabulary
+
+
+Wav2Vec2Model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.Wav2Vec2Model
+    :members: forward
+
+
+Wav2Vec2ForCTC
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.Wav2Vec2ForCTC
+    :members: forward
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -60,7 +60,7 @@ Basic steps
 In order to upload a model, you'll need to first create a git repo. This repo will live on the model hub, allowing
 users to clone it and you (and your organization members) to push to it.

-You can create a model repo **directly from `the /new page on the website <https://huggingface.co/new>`__.**
+You can create a model repo directly from `the /new page on the website <https://huggingface.co/new>`__.

 Alternatively, you can use the ``transformers-cli``. The next steps describe that process:

--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -330,6 +330,36 @@ the same probabilities as the larger model. The actual objective is a combinatio
 The library provides a version of the model for masked language modeling, token classification, sentence classification
 and question answering.

+ConvBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=convbert">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-convbert-blueviolet">
+   </a>
+   <a href="model_doc/convbert.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-convbert-blueviolet">
+   </a>
+
+`ConvBERT: Improving BERT with Span-based Dynamic Convolution <https://arxiv.org/abs/1910.01108>`_, Zihang Jiang,
+Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.
+
+Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural
+language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large
+memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
+generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
+which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
+replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
+rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
+learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
+ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
+fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
+using less than 1/4 training cost.
+
+The library provides a version of the model for masked language modeling, token classification, sentence classification
+and question answering.
+
 XLM
 -----------------------------------------------------------------------------------------------------------------------

--- a/examples/README.md
+++ b/examples/README.md
@@ -55,11 +55,11 @@ Coming soon!
 |---|---|:---:|:---:|:---:|:---:|
 | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | Raw text        | ✅ | -  | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
 | [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
-| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
+| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
 | [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                     | CNN/Daily Mail  | ✅  | - | - | -
-| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
+| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
 | [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)           | -               | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
-| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
+| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb)
 | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)                       | WMT             | ✅  | - | - | -


--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -100,72 +100,7 @@ sure all your batches have the same length.

 ### Whole word masking

-The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
-selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
-to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
-
-To fine-tune a model using whole word masking, use the following script:
-```bash
-python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
-    --dataset_name wikitext \
-    --dataset_config_name wikitext-2-raw-v1 \
-    --do_train \
-    --do_eval \
-    --output_dir /tmp/test-mlm-wwm
-```
-
-For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
-the character level.
-
-**Q :** Why a reference file?
-
-**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
-`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
-like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
-should be added `##`.
-
-**Q :** Why LTP ?
-
-**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
-It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
-we need LTP.
-
-Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
-You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
-will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
-following:
-
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export LTP_RESOURCE=/path/to/ltp/tokenizer
-export BERT_RESOURCE=/path/to/bert/tokenizer
-export SAVE_PATH=/path/to/data/ref.txt
-
-python examples/contrib/run_chinese_ref.py \
-    --file_name=path_to_train_or_eval_file \
-    --ltp=path_to_ltp_tokenizer \
-    --bert=path_to_bert_tokenizer \
-    --save_path=path_to_reference_file
-```
-
-Then you can run the script like this: 
-
-
-```bash
-python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
-    --train_file path_to_train_file \
-    --validation_file path_to_validation_file \
-    --train_ref_file path_to_train_chinese_ref_file \
-    --validation_ref_file path_to_validation_chinese_ref_file \
-    --do_train \
-    --do_eval \
-    --output_dir /tmp/test-mlm-wwm
-```
-
-**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
+This part was moved to `examples/research_projects/mlm_wwm`. 

 ### XLNet and permutation language modeling

--- a/examples/language-modeling/run_clm.py
+++ b/examples/language-modeling/run_clm.py
@@ -42,7 +42,7 @@ from transformers import (
    default_data_collator,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -160,23 +160,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -222,7 +227,11 @@ def main():
            data_files["train"] = data_args.train_file
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
-        extension = data_args.train_file.split(".")[-1]
+        extension = (
+            data_args.train_file.split(".")[-1]
+            if data_args.train_file is not None
+            else data_args.validation_file.split(".")[-1]
+        )
        if extension == "txt":
            extension = "text"
        datasets = load_dataset(extension, data_files=data_files)
@@ -356,12 +365,13 @@ def main():

    # Training
    if training_args.do_train:
-        model_path = (
-            model_args.model_name_or_path
-            if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
-            else None
-        )
-        train_result = trainer.train(model_path=model_path)
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/language-modeling/run_mlm.py
+++ b/examples/language-modeling/run_mlm.py
@@ -42,7 +42,7 @@ from transformers import (
    TrainingArguments,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -171,23 +171,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -338,6 +343,12 @@ def main():

        if data_args.max_seq_length is None:
            max_seq_length = tokenizer.model_max_length
+            if max_seq_length > 1024:
+                logger.warn(
+                    f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
+                    "Picking 1024 instead. You can change that default value by passing --max_seq_length xxx."
+                )
+                max_seq_length = 1024
        else:
            if data_args.max_seq_length > tokenizer.model_max_length:
                logger.warn(
@@ -391,12 +402,13 @@ def main():

    # Training
    if training_args.do_train:
-        model_path = (
-            model_args.model_name_or_path
-            if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
-            else None
-        )
-        train_result = trainer.train(model_path=model_path)
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/language-modeling/run_plm.py
+++ b/examples/language-modeling/run_plm.py
@@ -38,7 +38,7 @@ from transformers import (
    XLNetLMHeadModel,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -168,23 +168,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -378,12 +383,13 @@ def main():

    # Training
    if training_args.do_train:
-        model_path = (
-            model_args.model_name_or_path
-            if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
-            else None
-        )
-        train_result = trainer.train(model_path=model_path)
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/legacy/multiple_choice/run_multiple_choice.py
+++ b/examples/legacy/multiple_choice/run_multiple_choice.py
@@ -28,6 +28,7 @@ from transformers import (
    AutoConfig,
    AutoModelForMultipleChoice,
    AutoTokenizer,
+    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    Trainer,
@@ -188,6 +189,9 @@ def main():
        preds = np.argmax(p.predictions, axis=1)
        return {"acc": simple_accuracy(preds, p.label_ids)}

+    # Data collator
+    data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
+
    # Initialize our Trainer
    trainer = Trainer(
        model=model,
@@ -195,6 +199,7 @@ def main():
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
+        data_collator=data_collator,
    )

    # Training
--- a/examples/legacy/question-answering/run_squad_trainer.py
+++ b/examples/legacy/question-answering/run_squad_trainer.py
@@ -23,7 +23,14 @@ from dataclasses import dataclass, field
 from typing import Optional

 import transformers
-from transformers import AutoConfig, AutoModelForQuestionAnswering, AutoTokenizer, HfArgumentParser, SquadDataset
+from transformers import (
+    AutoConfig,
+    AutoModelForQuestionAnswering,
+    AutoTokenizer,
+    DataCollatorWithPadding,
+    HfArgumentParser,
+    SquadDataset,
+)
 from transformers import SquadDataTrainingArguments as DataTrainingArguments
 from transformers import Trainer, TrainingArguments
 from transformers.trainer_utils import is_main_process
@@ -145,12 +152,16 @@ def main():
        else None
    )

+    # Data collator
+    data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
+
    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
+        data_collator=data_collator,
    )

    # Training
--- a/examples/legacy/token-classification/run_ner.py
+++ b/examples/legacy/token-classification/run_ner.py
@@ -30,6 +30,7 @@ from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
+    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    Trainer,
@@ -237,6 +238,9 @@ def main():
            "f1": f1_score(out_label_list, preds_list),
        }

+    # Data collator
+    data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) if training_args.fp16 else None
+
    # Initialize our Trainer
    trainer = Trainer(
        model=model,
@@ -244,6 +248,7 @@ def main():
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
+        data_collator=data_collator,
    )

    # Training
--- a/examples/multiple-choice/run_swag.py
+++ b/examples/multiple-choice/run_swag.py
@@ -39,7 +39,7 @@ from transformers import (
    set_seed,
 )
 from transformers.tokenization_utils_base import PaddingStrategy, PreTrainedTokenizerBase
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -194,23 +194,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -312,7 +317,9 @@ def main():

    # Data collator
    data_collator = (
-        default_data_collator if data_args.pad_to_max_length else DataCollatorForMultipleChoice(tokenizer=tokenizer)
+        default_data_collator
+        if data_args.pad_to_max_length
+        else DataCollatorForMultipleChoice(tokenizer=tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
    )

    # Metric
@@ -334,9 +341,13 @@ def main():

    # Training
    if training_args.do_train:
-        train_result = trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/question-answering/requirements.txt
+++ b/examples/question-answering/requirements.txt
@@ -1 +1 @@
-datasets >= 1.1.3
+datasets >= 1.2.1
--- a/examples/question-answering/run_qa.py
+++ b/examples/question-answering/run_qa.py
@@ -39,7 +39,7 @@ from transformers import (
    default_data_collator,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process
 from utils_qa import postprocess_qa_predictions


@@ -169,21 +169,26 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

@@ -407,7 +412,11 @@ def main():
    # Data collator
    # We have already padded to max length if the corresponding flag is True, otherwise we need to pad in the data
    # collator.
-    data_collator = default_data_collator if data_args.pad_to_max_length else DataCollatorWithPadding(tokenizer)
+    data_collator = (
+        default_data_collator
+        if data_args.pad_to_max_length
+        else DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
+    )

    # Post-processing:
    def post_processing_function(examples, features, predictions):
@@ -433,9 +442,7 @@ def main():
        references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in datasets["validation"]]
        return EvalPrediction(predictions=formatted_predictions, label_ids=references)

-    # TODO: Once the fix lands in a Datasets release, remove the _local here and the squad_v2_local folder.
-    current_dir = os.path.sep.join(os.path.join(__file__).split(os.path.sep)[:-1])
-    metric = load_metric(os.path.join(current_dir, "squad_v2_local") if data_args.version_2_with_negative else "squad")
+    metric = load_metric("squad_v2" if data_args.version_2_with_negative else "squad")

    def compute_metrics(p: EvalPrediction):
        return metric.compute(predictions=p.predictions, references=p.label_ids)
@@ -455,9 +462,13 @@ def main():

    # Training
    if training_args.do_train:
-        train_result = trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/question-answering/run_qa_beam_search.py
+++ b/examples/question-answering/run_qa_beam_search.py
@@ -38,7 +38,7 @@ from transformers import (
    default_data_collator,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process
 from utils_qa import postprocess_qa_predictions_with_beam_search


@@ -168,21 +168,26 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

@@ -444,7 +449,11 @@ def main():
    # Data collator
    # We have already padded to max length if the corresponding flag is True, otherwise we need to pad in the data
    # collator.
-    data_collator = default_data_collator if data_args.pad_to_max_length else DataCollatorWithPadding(tokenizer)
+    data_collator = (
+        default_data_collator
+        if data_args.pad_to_max_length
+        else DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)
+    )

    # Post-processing:
    def post_processing_function(examples, features, predictions):
@@ -472,9 +481,7 @@ def main():
        references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in datasets["validation"]]
        return EvalPrediction(predictions=formatted_predictions, label_ids=references)

-    # TODO: Once the fix lands in a Datasets release, remove the _local here and the squad_v2_local folder.
-    current_dir = os.path.sep.join(os.path.join(__file__).split(os.path.sep)[:-1])
-    metric = load_metric(os.path.join(current_dir, "squad_v2_local") if data_args.version_2_with_negative else "squad")
+    metric = load_metric("squad_v2" if data_args.version_2_with_negative else "squad")

    def compute_metrics(p: EvalPrediction):
        return metric.compute(predictions=p.predictions, references=p.label_ids)
@@ -494,9 +501,13 @@ def main():

    # Training
    if training_args.do_train:
-        train_result = trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/question-answering/squad_v2_local/evaluate.py
+++ b/examples/question-answering/squad_v2_local/evaluate.py
@@ -1,322 +0,0 @@
-"""Official evaluation script for SQuAD version 2.0.
-
-In addition to basic functionality, we also compute additional statistics and
-plot precision-recall curves if an additional na_prob.json file is provided.
-This file is expected to map question ID's to the model's predicted probability
-that a question is unanswerable.
-"""
-import argparse
-import collections
-import json
-import os
-import re
-import string
-import sys
-
-import numpy as np
-
-
-OPTS = None
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Official evaluation script for SQuAD version 2.0.")
-    parser.add_argument("data_file", metavar="data.json", help="Input data JSON file.")
-    parser.add_argument("pred_file", metavar="pred.json", help="Model predictions.")
-    parser.add_argument(
-        "--out-file", "-o", metavar="eval.json", help="Write accuracy metrics to file (default is stdout)."
-    )
-    parser.add_argument(
-        "--na-prob-file", "-n", metavar="na_prob.json", help="Model estimates of probability of no answer."
-    )
-    parser.add_argument(
-        "--na-prob-thresh",
-        "-t",
-        type=float,
-        default=1.0,
-        help='Predict "" if no-answer probability exceeds this (default = 1.0).',
-    )
-    parser.add_argument(
-        "--out-image-dir", "-p", metavar="out_images", default=None, help="Save precision-recall curves to directory."
-    )
-    parser.add_argument("--verbose", "-v", action="store_true")
-    if len(sys.argv) == 1:
-        parser.print_help()
-        sys.exit(1)
-    return parser.parse_args()
-
-
-def make_qid_to_has_ans(dataset):
-    qid_to_has_ans = {}
-    for article in dataset:
-        for p in article["paragraphs"]:
-            for qa in p["qas"]:
-                qid_to_has_ans[qa["id"]] = bool(qa["answers"]["text"])
-    return qid_to_has_ans
-
-
-def normalize_answer(s):
-    """Lower text and remove punctuation, articles and extra whitespace."""
-
-    def remove_articles(text):
-        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
-        return re.sub(regex, " ", text)
-
-    def white_space_fix(text):
-        return " ".join(text.split())
-
-    def remove_punc(text):
-        exclude = set(string.punctuation)
-        return "".join(ch for ch in text if ch not in exclude)
-
-    def lower(text):
-        return text.lower()
-
-    return white_space_fix(remove_articles(remove_punc(lower(s))))
-
-
-def get_tokens(s):
-    if not s:
-        return []
-    return normalize_answer(s).split()
-
-
-def compute_exact(a_gold, a_pred):
-    return int(normalize_answer(a_gold) == normalize_answer(a_pred))
-
-
-def compute_f1(a_gold, a_pred):
-    gold_toks = get_tokens(a_gold)
-    pred_toks = get_tokens(a_pred)
-    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
-    num_same = sum(common.values())
-    if len(gold_toks) == 0 or len(pred_toks) == 0:
-        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
-        return int(gold_toks == pred_toks)
-    if num_same == 0:
-        return 0
-    precision = 1.0 * num_same / len(pred_toks)
-    recall = 1.0 * num_same / len(gold_toks)
-    f1 = (2 * precision * recall) / (precision + recall)
-    return f1
-
-
-def get_raw_scores(dataset, preds):
-    exact_scores = {}
-    f1_scores = {}
-    for article in dataset:
-        for p in article["paragraphs"]:
-            for qa in p["qas"]:
-                qid = qa["id"]
-                gold_answers = [t for t in qa["answers"]["text"] if normalize_answer(t)]
-                if not gold_answers:
-                    # For unanswerable questions, only correct answer is empty string
-                    gold_answers = [""]
-                if qid not in preds:
-                    print("Missing prediction for %s" % qid)
-                    continue
-                a_pred = preds[qid]
-                # Take max over all gold answers
-                exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
-                f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
-    return exact_scores, f1_scores
-
-
-def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
-    new_scores = {}
-    for qid, s in scores.items():
-        pred_na = na_probs[qid] > na_prob_thresh
-        if pred_na:
-            new_scores[qid] = float(not qid_to_has_ans[qid])
-        else:
-            new_scores[qid] = s
-    return new_scores
-
-
-def make_eval_dict(exact_scores, f1_scores, qid_list=None):
-    if not qid_list:
-        total = len(exact_scores)
-        return collections.OrderedDict(
-            [
-                ("exact", 100.0 * sum(exact_scores.values()) / total),
-                ("f1", 100.0 * sum(f1_scores.values()) / total),
-                ("total", total),
-            ]
-        )
-    else:
-        total = len(qid_list)
-        return collections.OrderedDict(
-            [
-                ("exact", 100.0 * sum(exact_scores[k] for k in qid_list) / total),
-                ("f1", 100.0 * sum(f1_scores[k] for k in qid_list) / total),
-                ("total", total),
-            ]
-        )
-
-
-def merge_eval(main_eval, new_eval, prefix):
-    for k in new_eval:
-        main_eval["%s_%s" % (prefix, k)] = new_eval[k]
-
-
-def plot_pr_curve(precisions, recalls, out_image, title):
-    plt.step(recalls, precisions, color="b", alpha=0.2, where="post")
-    plt.fill_between(recalls, precisions, step="post", alpha=0.2, color="b")
-    plt.xlabel("Recall")
-    plt.ylabel("Precision")
-    plt.xlim([0.0, 1.05])
-    plt.ylim([0.0, 1.05])
-    plt.title(title)
-    plt.savefig(out_image)
-    plt.clf()
-
-
-def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans, out_image=None, title=None):
-    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
-    true_pos = 0.0
-    cur_p = 1.0
-    cur_r = 0.0
-    precisions = [1.0]
-    recalls = [0.0]
-    avg_prec = 0.0
-    for i, qid in enumerate(qid_list):
-        if qid_to_has_ans[qid]:
-            true_pos += scores[qid]
-        cur_p = true_pos / float(i + 1)
-        cur_r = true_pos / float(num_true_pos)
-        if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i + 1]]:
-            # i.e., if we can put a threshold after this point
-            avg_prec += cur_p * (cur_r - recalls[-1])
-            precisions.append(cur_p)
-            recalls.append(cur_r)
-    if out_image:
-        plot_pr_curve(precisions, recalls, out_image, title)
-    return {"ap": 100.0 * avg_prec}
-
-
-def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, out_image_dir):
-    if out_image_dir and not os.path.exists(out_image_dir):
-        os.makedirs(out_image_dir)
-    num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
-    if num_true_pos == 0:
-        return
-    pr_exact = make_precision_recall_eval(
-        exact_raw,
-        na_probs,
-        num_true_pos,
-        qid_to_has_ans,
-        out_image=os.path.join(out_image_dir, "pr_exact.png"),
-        title="Precision-Recall curve for Exact Match score",
-    )
-    pr_f1 = make_precision_recall_eval(
-        f1_raw,
-        na_probs,
-        num_true_pos,
-        qid_to_has_ans,
-        out_image=os.path.join(out_image_dir, "pr_f1.png"),
-        title="Precision-Recall curve for F1 score",
-    )
-    oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
-    pr_oracle = make_precision_recall_eval(
-        oracle_scores,
-        na_probs,
-        num_true_pos,
-        qid_to_has_ans,
-        out_image=os.path.join(out_image_dir, "pr_oracle.png"),
-        title="Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)",
-    )
-    merge_eval(main_eval, pr_exact, "pr_exact")
-    merge_eval(main_eval, pr_f1, "pr_f1")
-    merge_eval(main_eval, pr_oracle, "pr_oracle")
-
-
-def histogram_na_prob(na_probs, qid_list, image_dir, name):
-    if not qid_list:
-        return
-    x = [na_probs[k] for k in qid_list]
-    weights = np.ones_like(x) / float(len(x))
-    plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
-    plt.xlabel("Model probability of no-answer")
-    plt.ylabel("Proportion of dataset")
-    plt.title("Histogram of no-answer probability: %s" % name)
-    plt.savefig(os.path.join(image_dir, "na_prob_hist_%s.png" % name))
-    plt.clf()
-
-
-def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
-    num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
-    cur_score = num_no_ans
-    best_score = cur_score
-    best_thresh = 0.0
-    qid_list = sorted(na_probs, key=lambda k: na_probs[k])
-    for i, qid in enumerate(qid_list):
-        if qid not in scores:
-            continue
-        if qid_to_has_ans[qid]:
-            diff = scores[qid]
-        else:
-            if preds[qid]:
-                diff = -1
-            else:
-                diff = 0
-        cur_score += diff
-        if cur_score > best_score:
-            best_score = cur_score
-            best_thresh = na_probs[qid]
-    return 100.0 * best_score / len(scores), best_thresh
-
-
-def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
-    best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
-    best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
-    main_eval["best_exact"] = best_exact
-    main_eval["best_exact_thresh"] = exact_thresh
-    main_eval["best_f1"] = best_f1
-    main_eval["best_f1_thresh"] = f1_thresh
-
-
-def main():
-    with open(OPTS.data_file) as f:
-        dataset_json = json.load(f)
-        dataset = dataset_json["data"]
-    with open(OPTS.pred_file) as f:
-        preds = json.load(f)
-    if OPTS.na_prob_file:
-        with open(OPTS.na_prob_file) as f:
-            na_probs = json.load(f)
-    else:
-        na_probs = {k: 0.0 for k in preds}
-    qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
-    has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
-    no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
-    exact_raw, f1_raw = get_raw_scores(dataset, preds)
-    exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
-    f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, OPTS.na_prob_thresh)
-    out_eval = make_eval_dict(exact_thresh, f1_thresh)
-    if has_ans_qids:
-        has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
-        merge_eval(out_eval, has_ans_eval, "HasAns")
-    if no_ans_qids:
-        no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
-        merge_eval(out_eval, no_ans_eval, "NoAns")
-    if OPTS.na_prob_file:
-        find_all_best_thresh(out_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans)
-    if OPTS.na_prob_file and OPTS.out_image_dir:
-        run_precision_recall_analysis(out_eval, exact_raw, f1_raw, na_probs, qid_to_has_ans, OPTS.out_image_dir)
-        histogram_na_prob(na_probs, has_ans_qids, OPTS.out_image_dir, "hasAns")
-        histogram_na_prob(na_probs, no_ans_qids, OPTS.out_image_dir, "noAns")
-    if OPTS.out_file:
-        with open(OPTS.out_file, "w") as f:
-            json.dump(out_eval, f)
-    else:
-        print(json.dumps(out_eval, indent=2))
-
-
-if __name__ == "__main__":
-    OPTS = parse_args()
-    if OPTS.out_image_dir:
-        import matplotlib
-
-        matplotlib.use("Agg")
-        import matplotlib.pyplot as plt
-    main()
--- a/examples/question-answering/squad_v2_local/squad_v2_local.py
+++ b/examples/question-answering/squad_v2_local/squad_v2_local.py
@@ -1,128 +0,0 @@
-# coding=utf-8
-# Copyright 2020 The HuggingFace Datasets Authors.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" SQuAD v2 metric. """
-
-import datasets
-
-from .evaluate import (
-    apply_no_ans_threshold,
-    find_all_best_thresh,
-    get_raw_scores,
-    make_eval_dict,
-    make_qid_to_has_ans,
-    merge_eval,
-)
-
-
-_CITATION = """\
-@inproceedings{Rajpurkar2016SQuAD10,
-  title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
-  author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
-  booktitle={EMNLP},
-  year={2016}
-}
-"""
-
-_DESCRIPTION = """
-This metric wrap the official scoring script for version 2 of the Stanford Question
-Answering Dataset (SQuAD).
-
-Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by
-crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span,
-from the corresponding reading passage, or the question might be unanswerable.
-
-SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
-written adversarially by crowdworkers to look similar to answerable ones.
-To do well on SQuAD2.0, systems must not only answer questions when possible, but also
-determine when no answer is supported by the paragraph and abstain from answering.
-"""
-
-_KWARGS_DESCRIPTION = """
-Computes SQuAD v2 scores (F1 and EM).
-Args:
-    predictions: List of triple for question-answers to score with the following elements:
-        - the question-answer 'id' field as given in the references (see below)
-        - the text of the answer
-        - the probability that the question has no answer
-    references: List of question-answers dictionaries with the following key-values:
-            - 'id': id of the question-answer pair (see above),
-            - 'answers': a list of Dict {'text': text of the answer as a string}
-    no_answer_threshold: float
-        Probability threshold to decide that a question has no answer.
-Returns:
-    'exact': Exact match (the normalized answer exactly match the gold answer)
-    'f1': The F-score of predicted tokens versus the gold answer
-    'total': Number of score considered
-    'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
-    'HasAns_f1': The F-score of predicted tokens versus the gold answer
-    'HasAns_total': Number of score considered
-    'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
-    'NoAns_f1': The F-score of predicted tokens versus the gold answer
-    'NoAns_total': Number of score considered
-    'best_exact': Best exact match (with varying threshold)
-    'best_exact_thresh': No-answer probability threshold associated to the best exact match
-    'best_f1': Best F1 (with varying threshold)
-    'best_f1_thresh': No-answer probability threshold associated to the best F1
-"""
-
-
-class SquadV2(datasets.Metric):
-    def _info(self):
-        return datasets.MetricInfo(
-            description=_DESCRIPTION,
-            citation=_CITATION,
-            inputs_description=_KWARGS_DESCRIPTION,
-            features=datasets.Features(
-                {
-                    "predictions": {
-                        "id": datasets.Value("string"),
-                        "prediction_text": datasets.Value("string"),
-                        "no_answer_probability": datasets.Value("float32"),
-                    },
-                    "references": {
-                        "id": datasets.Value("string"),
-                        "answers": datasets.features.Sequence(
-                            {"text": datasets.Value("string"), "answer_start": datasets.Value("int32")}
-                        ),
-                    },
-                }
-            ),
-            codebase_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
-            reference_urls=["https://rajpurkar.github.io/SQuAD-explorer/"],
-        )
-
-    def _compute(self, predictions, references, no_answer_threshold=1.0):
-        no_answer_probabilities = dict((p["id"], p["no_answer_probability"]) for p in predictions)
-        dataset = [{"paragraphs": [{"qas": references}]}]
-        predictions = dict((p["id"], p["prediction_text"]) for p in predictions)
-
-        qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
-        has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
-        no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
-
-        exact_raw, f1_raw = get_raw_scores(dataset, predictions)
-        exact_thresh = apply_no_ans_threshold(exact_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
-        f1_thresh = apply_no_ans_threshold(f1_raw, no_answer_probabilities, qid_to_has_ans, no_answer_threshold)
-        out_eval = make_eval_dict(exact_thresh, f1_thresh)
-
-        if has_ans_qids:
-            has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
-            merge_eval(out_eval, has_ans_eval, "HasAns")
-        if no_ans_qids:
-            no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
-            merge_eval(out_eval, no_ans_eval, "NoAns")
-        find_all_best_thresh(out_eval, predictions, exact_raw, f1_raw, no_answer_probabilities, qid_to_has_ans)
-
-        return out_eval
--- a/examples/research_projects/lxmert/requirements.txt
+++ b/examples/research_projects/lxmert/requirements.txt
@@ -3,7 +3,6 @@ argon2-cffi==20.1.0
 async-generator==1.10
 attrs==20.2.0
 backcall==0.2.0
-bleach==3.1.5
 CacheControl==0.12.6
 certifi==2020.6.20
 cffi==1.14.2
--- a/examples/research_projects/mlm_wwm/README.md
+++ b/examples/research_projects/mlm_wwm/README.md
@@ -0,0 +1,92 @@
+<!---
+Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+## Whole Word Mask Language Model
+
+
+These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
+need extra processing on your datasets.
+
+The following examples, will run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own
+text files for training and validation. We give examples of both below.
+
+
+
+The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
+selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
+to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
+
+To fine-tune a model using whole word masking, use the following script:
+```bash
+python run_mlm_wwm.py \
+    --model_name_or_path roberta-base \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --do_train \
+    --do_eval \
+    --output_dir /tmp/test-mlm-wwm
+```
+
+For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
+the character level.
+
+**Q :** Why a reference file?
+
+**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
+`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
+like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
+should be added `##`.
+
+**Q :** Why LTP ?
+
+**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
+It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
+we need LTP.
+
+You could run the following:
+
+
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export LTP_RESOURCE=/path/to/ltp/tokenizer
+export BERT_RESOURCE=/path/to/bert/tokenizer
+export SAVE_PATH=/path/to/data/ref.txt
+
+python run_chinese_ref.py \
+    --file_name=path_to_train_or_eval_file \
+    --ltp=path_to_ltp_tokenizer \
+    --bert=path_to_bert_tokenizer \
+    --save_path=path_to_reference_file
+```
+
+Then you can run the script like this: 
+
+
+```bash
+python run_mlm_wwm.py \
+    --model_name_or_path roberta-base \
+    --train_file path_to_train_file \
+    --validation_file path_to_validation_file \
+    --train_ref_file path_to_train_chinese_ref_file \
+    --validation_ref_file path_to_validation_chinese_ref_file \
+    --do_train \
+    --do_eval \
+    --output_dir /tmp/test-mlm-wwm
+```
+
+**Note1:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
+
+**Note2:** And if you have any questions or something goes wrong when runing this code, don't hesitate to pin @wlhgtc.
--- a/examples/research_projects/mlm_wwm/requirements.txt
+++ b/examples/research_projects/mlm_wwm/requirements.txt
@@ -0,0 +1,4 @@
+datasets >= 1.1.3
+sentencepiece != 0.1.92
+protobuf
+ltp
--- a/examples/research_projects/mlm_wwm/run_chinese_ref.py
+++ b/examples/research_projects/mlm_wwm/run_chinese_ref.py
@@ -0,0 +1,147 @@
+import argparse
+import json
+from typing import List
+
+from ltp import LTP
+from transformers.models.bert.tokenization_bert import BertTokenizer
+
+
+def _is_chinese_char(cp):
+    """Checks whether CP is the codepoint of a CJK character."""
+    # This defines a "chinese character" as anything in the CJK Unicode block:
+    #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+    #
+    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+    # despite its name. The modern Korean Hangul alphabet is a different block,
+    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+    # space-separated words, so they are not treated specially and handled
+    # like the all of the other languages.
+    if (
+        (cp >= 0x4E00 and cp <= 0x9FFF)
+        or (cp >= 0x3400 and cp <= 0x4DBF)  #
+        or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+        or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+        or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+        or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+        or (cp >= 0xF900 and cp <= 0xFAFF)
+        or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+    ):  #
+        return True
+
+    return False
+
+
+def is_chinese(word: str):
+    # word like '180' or '身高' or '神'
+    for char in word:
+        char = ord(char)
+        if not _is_chinese_char(char):
+            return 0
+    return 1
+
+
+def get_chinese_word(tokens: List[str]):
+    word_set = set()
+
+    for token in tokens:
+        chinese_word = len(token) > 1 and is_chinese(token)
+        if chinese_word:
+            word_set.add(token)
+    word_list = list(word_set)
+    return word_list
+
+
+def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set()):
+    if not chinese_word_set:
+        return bert_tokens
+    max_word_len = max([len(w) for w in chinese_word_set])
+
+    bert_word = bert_tokens
+    start, end = 0, len(bert_word)
+    while start < end:
+        single_word = True
+        if is_chinese(bert_word[start]):
+            l = min(end - start, max_word_len)
+            for i in range(l, 1, -1):
+                whole_word = "".join(bert_word[start : start + i])
+                if whole_word in chinese_word_set:
+                    for j in range(start + 1, start + i):
+                        bert_word[j] = "##" + bert_word[j]
+                    start = start + i
+                    single_word = False
+                    break
+        if single_word:
+            start += 1
+    return bert_word
+
+
+def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer):
+    ltp_res = []
+
+    for i in range(0, len(lines), 100):
+        res = ltp_tokenizer.seg(lines[i : i + 100])[0]
+        res = [get_chinese_word(r) for r in res]
+        ltp_res.extend(res)
+    assert len(ltp_res) == len(lines)
+
+    bert_res = []
+    for i in range(0, len(lines), 100):
+        res = bert_tokenizer(lines[i : i + 100], add_special_tokens=True, truncation=True, max_length=512)
+        bert_res.extend(res["input_ids"])
+    assert len(bert_res) == len(lines)
+
+    ref_ids = []
+    for input_ids, chinese_word in zip(bert_res, ltp_res):
+
+        input_tokens = []
+        for id in input_ids:
+            token = bert_tokenizer._convert_id_to_token(id)
+            input_tokens.append(token)
+        input_tokens = add_sub_symbol(input_tokens, chinese_word)
+        ref_id = []
+        # We only save pos of chinese subwords start with ##, which mean is part of a whole word.
+        for i, token in enumerate(input_tokens):
+            if token[:2] == "##":
+                clean_token = token[2:]
+                # save chinese tokens' pos
+                if len(clean_token) == 1 and _is_chinese_char(ord(clean_token)):
+                    ref_id.append(i)
+        ref_ids.append(ref_id)
+
+    assert len(ref_ids) == len(bert_res)
+
+    return ref_ids
+
+
+def main(args):
+    # For Chinese (Ro)Bert, the best result is from : RoBERTa-wwm-ext (https://github.com/ymcui/Chinese-BERT-wwm)
+    # If we want to fine-tune these model, we have to use same tokenizer : LTP (https://github.com/HIT-SCIR/ltp)
+    with open(args.file_name, "r", encoding="utf-8") as f:
+        data = f.readlines()
+    data = [line.strip() for line in data if len(line) > 0 and not line.isspace()]  # avoid delimiter like '\u2029'
+    ltp_tokenizer = LTP(args.ltp)  # faster in GPU device
+    bert_tokenizer = BertTokenizer.from_pretrained(args.bert)
+
+    ref_ids = prepare_ref(data, ltp_tokenizer, bert_tokenizer)
+
+    with open(args.save_path, "w", encoding="utf-8") as f:
+        data = [json.dumps(ref) + "\n" for ref in ref_ids]
+        f.writelines(data)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="prepare_chinese_ref")
+    parser.add_argument(
+        "--file_name",
+        type=str,
+        default="./resources/chinese-demo.txt",
+        help="file need process, same as training data in lm",
+    )
+    parser.add_argument(
+        "--ltp", type=str, default="./resources/ltp", help="resources for LTP tokenizer, usually a path"
+    )
+    parser.add_argument("--bert", type=str, default="./resources/robert", help="resources for Bert tokenizer")
+    parser.add_argument("--save_path", type=str, default="./resources/ref.txt", help="path to save res")
+
+    args = parser.parse_args()
+    main(args)
--- a/examples/research_projects/mlm_wwm/run_mlm_wwm.py
+++ b/examples/research_projects/mlm_wwm/run_mlm_wwm.py
@@ -44,7 +44,7 @@ from transformers import (
    TrainingArguments,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -184,23 +184,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -332,6 +337,10 @@ def main():
        tokenized_datasets["validation"] = add_chinese_references(
            tokenized_datasets["validation"], data_args.validation_ref_file
        )
+    # If we have ref files, need to avoid it removed by trainer
+    has_ref = data_args.train_ref_file or data_args.validation_ref_file
+    if has_ref:
+        training_args.remove_unused_columns = False

    # Data collator
    # This one will take care of randomly masking the tokens.
@@ -349,12 +358,13 @@ def main():

    # Training
    if training_args.do_train:
-        model_path = (
-            model_args.model_name_or_path
-            if (model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path))
-            else None
-        )
-        train_result = trainer.train(model_path=model_path)
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -22,14 +22,98 @@ For deprecated `bertabs` instructions, see [`bertabs/README.md`](https://github.

 ### Supported Architectures

- `BartForConditionalGeneration` (and anything that inherits from it)
+- `BartForConditionalGeneration`
 - `MarianMTModel`
 - `PegasusForConditionalGeneration`
 - `MBartForConditionalGeneration`
 - `FSMTForConditionalGeneration`
 - `T5ForConditionalGeneration`

-## Datasets
+This directory is in a bit of messy state and is undergoing some cleaning, please bare with us in the meantime :-) Here are the instructions to use the new and old scripts for fine-tuning sequence-to-sequence models.
+
+## New script
+
+The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (json or csv), then fine-tune one of the architectures above on it.
+
+Here is an example on a summarization task:
+```bash
+python examples/seq2seq/run_seq2seq.py \
+    --model_name_or_path t5-small \
+    --do_train \
+    --do_eval \
+    --task summarization \
+    --dataset_name xsum \
+    --output_dir ~/tmp/tst-summarization \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=4 \
+    --overwrite_output_dir \
+    --predict_with_generate
+```
+
+And here is how you would use it on your own files (replace `path_to_csv_or_json_file`, `text_column_name` and `summary_column_name` by the relevant values):
+```bash
+python examples/seq2seq/run_seq2seq.py \
+    -model_name_or_path t5-small \
+    --do_train \
+    --do_eval \
+    --task summarization \
+    --train_file path_to_csv_or_json_file \
+    --validation_file path_to_csv_or_json_file \
+    --output_dir ~/tmp/tst-summarization \
+    --overwrite_output_dir \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=4 \
+    --predict_with_generate \
+    --text_column text_column_name \
+    --summary_column summary_column_name 
+```
+The training and validation files should have a column for the inputs texts and a column for the summaries.
+
+Here is an example of a translation fine-tuning:
+```bash
+python examples/seq2seq/run_seq2seq.py \
+    --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
+    --do_train \
+    --do_eval \
+    --task translation_en_to_ro \
+    --dataset_name wmt16 \
+    --dataset_config_name ro-en \
+    --source_lang en_XX \
+    --target_lang ro_RO\
+    --output_dir ~/tmp/tst-translation \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=4 \
+    --overwrite_output_dir \
+    --predict_with_generate
+```
+
+And here is how you would use it on your own files (replace `path_to_json_file`, by the relevant values):
+```bash
+python examples/seq2seq/run_seq2seq.py \
+    --model_name_or_path sshleifer/student_marian_en_ro_6_1 \
+    --do_train \
+    --do_eval \
+    --task translation_en_to_ro \
+    --dataset_name wmt16 \
+    --dataset_config_name ro-en \
+    --source_lang en_XX \
+    --target_lang ro_RO\
+    --train_file path_to_json_file \
+    --validation_file path_to_json_file \
+    --output_dir ~/tmp/tst-translation \
+    --per_device_train_batch_size=4 \
+    --per_device_eval_batch_size=4 \
+    --overwrite_output_dir \
+    --predict_with_generate
+```
+Here the files are expected to be JSON files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`).
+
+## Old script
+
+The new script is very new and hasn't been widely tested yet. It also misses a few functionality offered by the old
+script, which is why we are leaving the old script here for now.
+
+### Downlowd the Datasets

 #### XSUM

--- a/examples/seq2seq/finetune_trainer.py
+++ b/examples/seq2seq/finetune_trainer.py
@@ -20,14 +20,15 @@ from dataclasses import dataclass, field
 from typing import Optional

 import transformers
+from seq2seq_trainer import Seq2SeqTrainer
+from seq2seq_training_args import Seq2SeqTrainingArguments
 from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    HfArgumentParser,
    MBartTokenizer,
-    Seq2SeqTrainer,
-    Seq2SeqTrainingArguments,
+    MBartTokenizerFast,
    set_seed,
 )
 from transformers.trainer_utils import EvaluationStrategy, is_main_process
@@ -174,11 +175,11 @@ def main():
        bool(training_args.parallel_mode == ParallelMode.DISTRIBUTED),
        training_args.fp16,
    )
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
    # Set the verbosity to info of the Transformers logger (on main process only):
    if is_main_process(training_args.local_rank):
        transformers.utils.logging.set_verbosity_info()
-        transformers.utils.logging.enable_default_handler()
-        transformers.utils.logging.enable_explicit_format()
    logger.info("Training/evaluation parameters %s", training_args)

    # Set seed
@@ -220,11 +221,14 @@ def main():
        data_args.eval_beams = model.config.num_beams

    # set decoder_start_token_id for MBart
-    if model.config.decoder_start_token_id is None and isinstance(tokenizer, MBartTokenizer):
+    if model.config.decoder_start_token_id is None and isinstance(tokenizer, (MBartTokenizer, MBartTokenizerFast)):
        assert (
            data_args.tgt_lang is not None and data_args.src_lang is not None
        ), "mBart requires --tgt_lang and --src_lang"
-        model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.tgt_lang]
+        if isinstance(tokenizer, MBartTokenizer):
+            model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.tgt_lang]
+        else:
+            model.config.decoder_start_token_id = tokenizer.convert_tokens_to_ids(data_args.tgt_lang)

    if model_args.freeze_embeds:
        freeze_embeds(model)
@@ -282,9 +286,12 @@ def main():
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
+        data_args=data_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
-        data_collator=Seq2SeqDataCollator(tokenizer, data_args, training_args.tpu_num_cores),
+        data_collator=Seq2SeqDataCollator(
+            tokenizer, data_args, model.config.decoder_start_token_id, training_args.tpu_num_cores
+        ),
        compute_metrics=compute_metrics_fn,
        tokenizer=tokenizer,
    )
@@ -317,9 +324,7 @@ def main():
    if training_args.do_eval:
        logger.info("*** Evaluate ***")

-        metrics = trainer.evaluate(
-            metric_key_prefix="val", max_length=data_args.val_max_target_length, num_beams=data_args.eval_beams
-        )
+        metrics = trainer.evaluate(metric_key_prefix="val")
        metrics["val_n_objs"] = data_args.n_val
        metrics["val_loss"] = round(metrics["val_loss"], 4)

@@ -331,12 +336,7 @@ def main():
    if training_args.do_predict:
        logger.info("*** Predict ***")

-        test_output = trainer.predict(
-            test_dataset=test_dataset,
-            metric_key_prefix="test",
-            max_length=data_args.val_max_target_length,
-            num_beams=data_args.eval_beams,
-        )
+        test_output = trainer.predict(test_dataset=test_dataset, metric_key_prefix="test")
        metrics = test_output.metrics
        metrics["test_n_objs"] = data_args.n_test

--- a/examples/seq2seq/run_seq2seq.py
+++ b/examples/seq2seq/run_seq2seq.py
@@ -0,0 +1,537 @@
+# coding=utf-8
+# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for sequence to sequence.
+"""
+# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
+
+import logging
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+from datasets import load_dataset, load_metric
+
+import transformers
+from transformers import (
+    AutoConfig,
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    HfArgumentParser,
+    MBartTokenizer,
+    Seq2SeqTrainer,
+    Seq2SeqTrainingArguments,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint, is_main_process
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    task: str = field(
+        default="summarization",
+        metadata={
+            "help": "The name of the task, should be summarization (or summarization_{dataset} for evaluating "
+            "pegasus) or translation (or translation_{xx}_to_{yy})."
+        },
+    )
+    dataset_name: Optional[str] = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+    )
+    text_column: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
+    )
+    summary_column: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
+    )
+    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
+    validation_file: Optional[str] = field(
+        default=None,
+        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    max_source_length: Optional[int] = field(
+        default=1024,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    max_target_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": "The maximum total sequence length for target text after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    val_max_target_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
+            "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
+            "during ``evaluate`` and ``predict``."
+        },
+    )
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to pad all samples to model maximum sentence length. "
+            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
+            "efficient on GPU but very bad for TPU."
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
+            "value if set."
+        },
+    )
+    max_val_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
+            "value if set."
+        },
+    )
+    source_lang: Optional[str] = field(default=None, metadata={"help": "Source language id for translation."})
+    target_lang: Optional[str] = field(default=None, metadata={"help": "Target language id for translation."})
+    eval_beams: Optional[int] = field(default=None, metadata={"help": "Number of beams to use for evaluation."})
+    ignore_pad_token_for_loss: bool = field(
+        default=True,
+        metadata={
+            "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
+        },
+    )
+    source_prefix: Optional[str] = field(
+        default=None, metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
+    )
+
+    def __post_init__(self):
+        if self.dataset_name is None and self.train_file is None and self.validation_file is None:
+            raise ValueError("Need either a dataset name or a training/validation file.")
+        else:
+            if self.train_file is not None:
+                extension = self.train_file.split(".")[-1]
+                assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
+            if self.validation_file is not None:
+                extension = self.validation_file.split(".")[-1]
+                assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
+        if not self.task.startswith("summarization") and not self.task.startswith("translation"):
+            raise ValueError(
+                "`task` should be summarization, summarization_{dataset}, translation or translation_{xx}_to_{yy}."
+            )
+        if self.val_max_target_length is None:
+            self.val_max_target_length = self.max_target_length
+
+
+summarization_name_mapping = {
+    "amazon_reviews_multi": ("review_body", "review_title"),
+    "big_patent": ("description", "abstract"),
+    "cnn_dailymail": ("article", "highlights"),
+    "orange_sum": ("text", "summary"),
+    "pn_summary": ("article", "summary"),
+    "psc": ("extract_text", "summary_text"),
+    "samsum": ("dialogue", "summary"),
+    "thaisum": ("body", "summary"),
+    "xglue": ("news_body", "news_title"),
+    "xsum": ("document", "summary"),
+    "wiki_summary": ("article", "highlights"),
+}
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    # Set the verbosity to info of the Transformers logger (on main process only):
+    if is_main_process(training_args.local_rank):
+        transformers.utils.logging.set_verbosity_info()
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+
+    # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below)
+    # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+    # (the dataset will be downloaded automatically from the datasets Hub).
+    #
+    # For CSV/JSON files in the summarization task, this script will use the first column for the full texts and the
+    # second column for the summaries (unless you specify column names for this with the `text_column` and
+    # `summary_column` arguments).
+    # For translation, only JSON files are supported, with one field named "translation" containing two keys for the
+    # source and target languages (unless you adapt what follows).
+    #
+    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
+    # download the dataset.
+    if data_args.dataset_name is not None:
+        # Downloading and loading a dataset from the hub.
+        datasets = load_dataset(data_args.dataset_name, data_args.dataset_config_name)
+    else:
+        data_files = {}
+        if data_args.train_file is not None:
+            data_files["train"] = data_args.train_file
+            extension = data_args.train_file.split(".")[-1]
+        if data_args.validation_file is not None:
+            data_files["validation"] = data_args.validation_file
+            extension = data_args.validation_file.split(".")[-1]
+        datasets = load_dataset(extension, data_files=data_files)
+    # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+    # https://huggingface.co/docs/datasets/loading_datasets.html.
+
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_fast=model_args.use_fast_tokenizer,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model = AutoModelForSeq2SeqLM.from_pretrained(
+        model_args.model_name_or_path,
+        from_tf=bool(".ckpt" in model_args.model_name_or_path),
+        config=config,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+
+    # Set decoder_start_token_id
+    if model.config.decoder_start_token_id is None and isinstance(tokenizer, MBartTokenizer):
+        model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.target_lang]
+    if model.config.decoder_start_token_id is None:
+        raise ValueError("Make sure that `config.decoder_start_token_id` is correctly defined")
+
+    # Get the default prefix if None is passed.
+    if data_args.source_prefix is None:
+        task_specific_params = model.config.task_specific_params
+        if task_specific_params is not None:
+            prefix = task_specific_params.get("prefix", "")
+        else:
+            prefix = ""
+    else:
+        prefix = data_args.source_prefix
+
+    # Preprocessing the datasets.
+    # We need to tokenize inputs and targets.
+    if training_args.do_train:
+        column_names = datasets["train"].column_names
+    else:
+        column_names = datasets["validation"].column_names
+
+    # For translation we set the codes of our source and target languages (only useful for mBART, the others will
+    # ignore those attributes).
+    if data_args.task.startswith("translation"):
+        if data_args.source_lang is not None:
+            tokenizer.src_lang = data_args.source_lang
+        if data_args.target_lang is not None:
+            tokenizer.tgt_lang = data_args.target_lang
+
+    # To serialize preprocess_function below, each of those four variables needs to be defined (even if we won't use
+    # them all).
+    source_lang, target_lang, text_column, summary_column = None, None, None, None
+
+    if data_args.task.startswith("summarization"):
+        # Get the column names for input/target.
+        dataset_columns = summarization_name_mapping.get(data_args.dataset_name, None)
+        if data_args.text_column is None:
+            text_column = dataset_columns[0] if dataset_columns is not None else column_names[0]
+        else:
+            text_column = data_args.text_column
+        if data_args.summary_column is None:
+            summary_column = dataset_columns[1] if dataset_columns is not None else column_names[1]
+        else:
+            summary_column = data_args.summary_column
+    else:
+        # Get the language codes for input/target.
+        lang_search = re.match("translation_([a-z]+)_to_([a-z]+)", data_args.task)
+        if data_args.source_lang is not None:
+            source_lang = data_args.source_lang.split("_")[0]
+        else:
+            assert (
+                lang_search is not None
+            ), "Provide a source language via --source_lang or rename your task 'translation_xx_to_yy'."
+            source_lang = lang_search.groups()[0]
+
+        if data_args.target_lang is not None:
+            target_lang = data_args.target_lang.split("_")[0]
+        else:
+            assert (
+                lang_search is not None
+            ), "Provide a target language via --target_lang or rename your task 'translation_xx_to_yy'."
+            target_lang = lang_search.groups()[1]
+
+    # Temporarily set max_target_length for training.
+    max_target_length = data_args.max_target_length
+    padding = "max_length" if data_args.pad_to_max_length else False
+
+    def preprocess_function(examples):
+        if data_args.task.startswith("translation"):
+            inputs = [ex[source_lang] for ex in examples["translation"]]
+            targets = [ex[target_lang] for ex in examples["translation"]]
+        else:
+            inputs = examples[text_column]
+            targets = examples[summary_column]
+        inputs = [prefix + inp for inp in inputs]
+        model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)
+
+        # Setup the tokenizer for targets
+        with tokenizer.as_target_tokenizer():
+            labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)
+
+        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
+        # padding in the loss.
+        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
+            labels["input_ids"] = [
+                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
+            ]
+
+        model_inputs["labels"] = labels["input_ids"]
+        return model_inputs
+
+    if training_args.do_train:
+        train_dataset = datasets["train"]
+        if data_args.max_train_samples is not None:
+            train_dataset = train_dataset.select(range(data_args.max_train_samples))
+        train_dataset = train_dataset.map(
+            preprocess_function,
+            batched=True,
+            num_proc=data_args.preprocessing_num_workers,
+            remove_columns=column_names,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+
+    if training_args.do_eval:
+        max_target_length = data_args.val_max_target_length
+        eval_dataset = datasets["validation"]
+        if data_args.max_val_samples is not None:
+            eval_dataset = eval_dataset.select(range(data_args.max_val_samples))
+        eval_dataset = eval_dataset.map(
+            preprocess_function,
+            batched=True,
+            num_proc=data_args.preprocessing_num_workers,
+            remove_columns=column_names,
+            load_from_cache_file=not data_args.overwrite_cache,
+        )
+
+    # Data collator
+    label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id
+    if data_args.pad_to_max_length:
+        data_collator = default_data_collator
+    else:
+        data_collator = DataCollatorForSeq2Seq(
+            tokenizer,
+            label_pad_token_id=label_pad_token_id,
+            pad_to_multiple_of=8 if training_args.fp16 else None,
+        )
+
+    # Metric
+    metric_name = "rouge" if data_args.task.startswith("summarization") else "sacrebleu"
+    metric = load_metric(metric_name)
+
+    def compute_metrics(eval_preds):
+        preds, labels = eval_preds
+        if isinstance(preds, tuple):
+            preds = preds[0]
+        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
+        if data_args.ignore_pad_token_for_loss:
+            # Replace -100 in the labels as we can't decode them.
+            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
+        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
+
+        # Some simple post-processing
+        decoded_preds = [pred.strip() for pred in decoded_preds]
+        decoded_labels = [label.strip() for label in decoded_labels]
+        if metric_name == "sacrebleu":
+            decoded_labels = [[label] for label in decoded_labels]
+
+        result = metric.compute(predictions=decoded_preds, references=decoded_labels)
+
+        # Extract a few results from ROUGE
+        if metric_name == "rouge":
+            result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
+        else:
+            result = {"bleu": result["score"]}
+
+        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
+        result["gen_len"] = np.mean(prediction_lens)
+
+        return result
+
+    # Initialize our Trainer
+    trainer = Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=eval_dataset if training_args.do_eval else None,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics if training_args.predict_with_generate else None,
+    )
+
+    # Training
+    if training_args.do_train:
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        trainer.save_model()  # Saves the tokenizer too for easy upload
+
+        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
+        if trainer.is_world_process_zero():
+            with open(output_train_file, "w") as writer:
+                logger.info("***** Train results *****")
+                for key, value in sorted(train_result.metrics.items()):
+                    logger.info(f"  {key} = {value}")
+                    writer.write(f"{key} = {value}\n")
+
+            # Need to save the state, since Trainer.save_model saves only the tokenizer with the model
+            trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
+
+    # Evaluation
+    results = {}
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+
+        results = trainer.evaluate()
+
+        output_eval_file = os.path.join(training_args.output_dir, "eval_results_seq2seq.txt")
+        if trainer.is_world_process_zero():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in sorted(results.items()):
+                    logger.info(f"  {key} = {value}")
+                    writer.write(f"{key} = {value}\n")
+
+    return results
+
+
+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/seq2seq/seq2seq_trainer.py
+++ b/examples/seq2seq/seq2seq_trainer.py
@@ -60,7 +60,7 @@ class Seq2SeqTrainer(Trainer):
            assert isinstance(
                self.model, PreTrainedModel
            ), f"If no `config` is passed the model to be trained has to be of type `PreTrainedModel`, but is {self.model.__class__}"
-            self.config = self._actual_model(self.model).config
+            self.config = self.model.config
        else:
            self.config = config

--- a/examples/seq2seq/test_finetune_trainer.py
+++ b/examples/seq2seq/test_finetune_trainer.py
@@ -112,6 +112,11 @@ class TestFinetuneTrainer(TestCasePlus):
    def test_finetune_trainer_deepspeed(self):
        self.finetune_trainer_quick(deepspeed=True)

+    @require_torch_multi_gpu
+    @require_deepspeed
+    def test_finetune_trainer_deepspeed_grad_acum(self):
+        self.finetune_trainer_quick(deepspeed=True, extra_args_str="--gradient_accumulation_steps 2")
+
    @slow
    def test_finetune_trainer_slow(self):
        # There is a missing call to __init__process_group somewhere
@@ -169,7 +174,7 @@ class TestFinetuneTrainer(TestCasePlus):
            --logging_steps 0
            --save_steps {str(eval_steps)}
            --eval_steps {str(eval_steps)}
-            --sortish_sampler
+            --group_by_length
            --label_smoothing_factor 0.1
            --adafactor
            --task translation
--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -33,8 +33,9 @@ from torch import nn
 from torch.utils.data import Dataset, Sampler

 from sentence_splitter import add_newline_to_end_of_each_sentence
-from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer
+from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer
 from transformers.file_utils import cached_property
+from transformers.models.bart.modeling_bart import shift_tokens_right


 try:
@@ -274,9 +275,10 @@ class Seq2SeqDataset(AbstractSeq2SeqDataset):


 class Seq2SeqDataCollator:
-    def __init__(self, tokenizer, data_args, tpu_num_cores=None):
+    def __init__(self, tokenizer, data_args, decoder_start_token_id, tpu_num_cores=None):
        self.tokenizer = tokenizer
        self.pad_token_id = tokenizer.pad_token_id
+        self.decoder_start_token_id = decoder_start_token_id
        assert (
            self.pad_token_id is not None
        ), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
@@ -304,9 +306,15 @@ class Seq2SeqDataCollator:
            labels = trim_batch(labels, self.pad_token_id)
            input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)

+        if isinstance(self.tokenizer, T5Tokenizer):
+            decoder_input_ids = self._shift_right_t5(labels)
+        else:
+            decoder_input_ids = shift_tokens_right(labels, self.pad_token_id, self.decoder_start_token_id)
+
        batch = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
            "labels": labels,
        }
        return batch
@@ -555,7 +563,7 @@ def freeze_embeds(model):
    """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
    model_type = model.config.model_type

-    if model_type == "t5":
+    if model_type in ["t5", "mt5"]:
        freeze_params(model.shared)
        for d in [model.encoder, model.decoder]:
            freeze_params(d.embed_tokens)
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -23,7 +23,7 @@ from unittest.mock import patch
 import torch

 from transformers.file_utils import is_apex_available
-from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me, torch_device
+from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me, slow, torch_device


 SRC_DIRS = [
@@ -35,6 +35,7 @@ SRC_DIRS = [
        "language-modeling",
        "multiple-choice",
        "question-answering",
+        "seq2seq",
    ]
 ]
 sys.path.extend(SRC_DIRS)
@@ -47,6 +48,7 @@ if SRC_DIRS is not None:
    import run_mlm
    import run_ner
    import run_qa as run_squad
+    import run_seq2seq
    import run_swag


@@ -184,7 +186,7 @@ class ExamplesTests(TestCasePlus):

        with patch.object(sys, "argv", testargs):
            result = run_ner.main()
-            self.assertGreaterEqual(result["eval_accuracy_score"], 0.75)
+            self.assertGreaterEqual(result["eval_accuracy"], 0.75)
            self.assertGreaterEqual(result["eval_precision"], 0.75)
            self.assertLess(result["eval_loss"], 0.5)

@@ -259,3 +261,67 @@ class ExamplesTests(TestCasePlus):
        with patch.object(sys, "argv", testargs + [model_type, model_name]):
            result = run_generation.main()
            self.assertGreaterEqual(len(result[0]), 10)
+
+    @slow
+    @require_torch_non_multi_gpu_but_fix_me
+    def test_run_seq2seq_summarization(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        tmp_dir = self.get_auto_remove_tmp_dir()
+        testargs = f"""
+            run_seq2seq.py
+            --model_name_or_path t5-small
+            --task summarization
+            --train_file tests/fixtures/tests_samples/xsum/sample.json
+            --validation_file tests/fixtures/tests_samples/xsum/sample.json
+            --output_dir {tmp_dir}
+            --overwrite_output_dir
+            --max_steps=50
+            --warmup_steps=8
+            --do_train
+            --do_eval
+            --learning_rate=2e-4
+            --per_device_train_batch_size=2
+            --per_device_eval_batch_size=1
+            --predict_with_generate
+        """.split()
+
+        with patch.object(sys, "argv", testargs):
+            result = run_seq2seq.main()
+
+            self.assertGreaterEqual(result["eval_rouge1"], 10)
+            self.assertGreaterEqual(result["eval_rouge2"], 2)
+            self.assertGreaterEqual(result["eval_rougeL"], 7)
+            self.assertGreaterEqual(result["eval_rougeLsum"], 7)
+
+    @slow
+    @require_torch_non_multi_gpu_but_fix_me
+    def test_run_seq2seq_translation(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        tmp_dir = self.get_auto_remove_tmp_dir()
+        testargs = f"""
+            run_seq2seq.py
+            --model_name_or_path sshleifer/student_marian_en_ro_6_1
+            --task translation_en_to_ro
+            --train_file tests/fixtures/tests_samples/wmt16/sample.json
+            --validation_file tests/fixtures/tests_samples/wmt16/sample.json
+            --output_dir {tmp_dir}
+            --overwrite_output_dir
+            --max_steps=50
+            --warmup_steps=8
+            --do_train
+            --do_eval
+            --learning_rate=3e-3
+            --per_device_train_batch_size=2
+            --per_device_eval_batch_size=1
+            --predict_with_generate
+            --source_lang en_XX
+            --target_lang ro_RO
+        """.split()
+
+        with patch.object(sys, "argv", testargs):
+            result = run_seq2seq.main()
+            self.assertGreaterEqual(result["eval_bleu"], 30)
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -147,7 +147,7 @@ This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It
 on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
 `$XNLI_DIR` directory.

-* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
+* [XNLI 1.0](https://cims.nyu.edu/~sbowman/xnli/XNLI-1.0.zip)
 * [XNLI-MT 1.0](https://dl.fbaipublicfiles.com/XNLI/XNLI-MT-1.0.zip)

 ```bash
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
@@ -30,6 +30,7 @@ from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
+    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    PretrainedConfig,
@@ -38,7 +39,7 @@ from transformers import (
    default_data_collator,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 task_to_keys = {
@@ -160,23 +161,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty. "
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -312,7 +318,7 @@ def main():
    if (
        model.config.label2id != PretrainedConfig(num_labels=num_labels).label2id
        and data_args.task_name is not None
-        and is_regression
+        and not is_regression
    ):
        # Some have all caps in their config, some don't.
        label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()}
@@ -371,6 +377,14 @@ def main():
        else:
            return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

+    # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding.
+    if data_args.pad_to_max_length:
+        data_collator = default_data_collator
+    elif training_args.fp16:
+        data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
+    else:
+        data_collator = None
+
    # Initialize our Trainer
    trainer = Trainer(
        model=model,
@@ -379,15 +393,18 @@ def main():
        eval_dataset=eval_dataset if training_args.do_eval else None,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
-        # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding.
-        data_collator=default_data_collator if data_args.pad_to_max_length else None,
+        data_collator=data_collator,
    )

    # Training
    if training_args.do_train:
-        train_result = trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        metrics = train_result.metrics

        trainer.save_model()  # Saves the tokenizer too for easy upload
--- a/examples/text-classification/run_tf_text_classification.py
+++ b/examples/text-classification/run_tf_text_classification.py
@@ -64,7 +64,7 @@ def get_tfds(
    label_name = features_name.pop(label_column_id)
    label_list = list(set(ds[list(files.keys())[0]][label_name]))
    label2id = {label: i for i, label in enumerate(label_list)}
-    input_names = ["input_ids"] + tokenizer.model_input_names
+    input_names = tokenizer.model_input_names
    transformed_ds = {}

    if len(features_name) == 1:
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -25,8 +25,7 @@ from dataclasses import dataclass, field
 from typing import Optional

 import numpy as np
-from datasets import ClassLabel, load_dataset
-from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score
+from datasets import ClassLabel, load_dataset, load_metric

 import transformers
 from transformers import (
@@ -40,7 +39,7 @@ from transformers import (
    TrainingArguments,
    set_seed,
 )
-from transformers.trainer_utils import is_main_process
+from transformers.trainer_utils import get_last_checkpoint, is_main_process


 logger = logging.getLogger(__name__)
@@ -124,6 +123,10 @@ class DataTrainingArguments:
            "one (in which case the other tokens will have a padding index)."
        },
    )
+    return_entity_level_metrics: bool = field(
+        default=False,
+        metadata={"help": "Whether to return all the entity levels during evaluation or just the overall ones."},
+    )

    def __post_init__(self):
        if self.dataset_name is None and self.train_file is None and self.validation_file is None:
@@ -151,23 +154,28 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

-    if (
-        os.path.exists(training_args.output_dir)
-        and os.listdir(training_args.output_dir)
-        and training_args.do_train
-        and not training_args.overwrite_output_dir
-    ):
-        raise ValueError(
-            f"Output directory ({training_args.output_dir}) already exists and is not empty."
-            "Use --overwrite_output_dir to overcome."
-        )
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
-        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
+        handlers=[logging.StreamHandler(sys.stdout)],
    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

    # Log on each process the small summary:
    logger.warning(
@@ -320,9 +328,11 @@ def main():
    )

    # Data collator
-    data_collator = DataCollatorForTokenClassification(tokenizer)
+    data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None)

    # Metrics
+    metric = load_metric("seqeval")
+
    def compute_metrics(p):
        predictions, labels = p
        predictions = np.argmax(predictions, axis=2)
@@ -337,12 +347,24 @@ def main():
            for prediction, label in zip(predictions, labels)
        ]

-        return {
-            "accuracy_score": accuracy_score(true_labels, true_predictions),
-            "precision": precision_score(true_labels, true_predictions),
-            "recall": recall_score(true_labels, true_predictions),
-            "f1": f1_score(true_labels, true_predictions),
-        }
+        results = metric.compute(predictions=true_predictions, references=true_labels)
+        if data_args.return_entity_level_metrics:
+            # Unpack nested dictionaries
+            final_results = {}
+            for key, value in results.items():
+                if isinstance(value, dict):
+                    for n, v in value.items():
+                        final_results[f"{key}_{n}"] = v
+                else:
+                    final_results[key] = value
+            return final_results
+        else:
+            return {
+                "precision": results["overall_precision"],
+                "recall": results["overall_recall"],
+                "f1": results["overall_f1"],
+                "accuracy": results["overall_accuracy"],
+            }

    # Initialize our Trainer
    trainer = Trainer(
@@ -357,9 +379,13 @@ def main():

    # Training
    if training_args.do_train:
-        train_result = trainer.train(
-            model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
-        )
+        if last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        elif os.path.isdir(model_args.model_name_or_path):
+            checkpoint = model_args.model_name_or_path
+        else:
+            checkpoint = None
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()  # Saves the tokenizer too for easy upload

        output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -44,37 +44,4 @@ Pull Request so it can be included under the Community notebooks.

 ## Community notebooks:

-| Notebook     |      Description      |      Author      |      |
-|:----------|:-------------|:-------------|------:|
-| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) |
-| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb)  | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
-| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb)  | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning |  [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
-| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb)  | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots |  [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) |
-| [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb)  | How to train on sequences as long as 500,000 tokens with Reformer |  [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb)  |
-| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) |
-| [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb)  | How to generate tweets in the style of your favorite Twitter account by fine-tune a GPT-2 model |  [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) |
-| [A Step by Step Guide to Tracking Hugging Face Model Performance](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8)  | A quick tutorial for training NLP models with HuggingFace and & visualizing their performance with Weights & Biases |  [Jack Morris](https://github.com/jxmorris12) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NEiqNPhiouu2pPwDAVeFoN4-vTYMz9F8) |
-| [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb)  | How to build a "long" version of existing pretrained models |  [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) |
-| [Fine-tune Longformer for QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | How to fine-tune longformer model for QA task | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) |
-| [Evaluate Model with 🤗nlp](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | How to evaluate longformer on TriviaQA with `nlp` | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) |
-| [Fine-tune T5 for Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb)  | How to fine-tune T5 for sentiment span extraction using a text-to-text format with PyTorch Lightning |  [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) |
-| [Fine-tune DistilBert for Multiclass Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | How to fine-tune DistilBert for multiclass classification with PyTorch | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)|
-|[Fine-tune BERT for Multi-label Classification](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|How to fine-tune BERT for multi-label classification using PyTorch|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|
-|[Fine-tune T5 for Summarization](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|How to fine-tune T5 for summarization in PyTorch and track experiments with WandB|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|
-|[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)|
-|[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)|
-|[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)|
-|[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)|
-|[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)|
-|[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)|
-|[Fine-tune ALBERT for sentence-pair classification](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | How to fine-tune an ALBERT model or another BERT-based model for the sentence-pair classification task | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)|
-|[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune an Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)|
-|[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)|
-|[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)|
-|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)|
-|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)|
-|[Fine-tuning TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)|
-|[Evaluating TAPAS on Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | How to evaluate a fine-tuned *TapasForSequenceClassification* with a *tapas-base-finetuned-tabfact* checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)|
-|[Fine-tuning mBART for translation](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | How to fine-tune mBART using Seq2SeqTrainer for Hindi to English translation | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)|
-|[Fine-tuning LayoutLM on FUNSD (a form understanding dataset)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | How to fine-tune *LayoutLMForTokenClassification* on the FUNSD dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)|
-|[Fine-Tune DistilGPT2 and Generate Text](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | How to fine-tune DistilGPT2 and generate text | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)|
+More notebooks developed by the community are available [here](https://huggingface.co/transformers/master/community.html#community-notebooks).
--- a/scripts/check_tokenizers.py
+++ b/scripts/check_tokenizers.py
@@ -0,0 +1,169 @@
+from collections import Counter
+import datasets
+import transformers
+from transformers.convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS
+
+from transformers.utils import logging
+
+logging.set_verbosity_info()
+
+TOKENIZER_CLASSES = {
+    name: (getattr(transformers, name), getattr(transformers, name + "Fast")) for name in SLOW_TO_FAST_CONVERTERS
+}
+
+dataset = datasets.load_dataset("xnli", split="test+validation")
+
+total = 0
+perfect = 0
+imperfect = 0
+wrong = 0
+
+
+def check_diff(spm_diff, tok_diff, slow, fast):
+    if spm_diff == list(reversed(tok_diff)):
+        # AAA -> AA+A vs A+AA case.
+        return True
+    elif len(spm_diff) == len(tok_diff) and fast.decode(spm_diff) == fast.decode(tok_diff):
+        # Second order OK
+        # Barrich -> Barr + ich vs Bar + rich
+        return True
+    spm_reencoded = slow.encode(slow.decode(spm_diff))
+    tok_reencoded = fast.encode(fast.decode(spm_diff))
+    if spm_reencoded != spm_diff and spm_reencoded == tok_reencoded:
+        # Type 3 error.
+        # Snehagatha ->
+        #       Sne, h, aga, th, a
+        #       Sne, ha, gat, ha
+        # Encoding the wrong with sp does not even recover what spm gave us
+        # It fits tokenizer however...
+        return True
+    return False
+
+
+def check_LTR_mark(line, idx, fast):
+    enc = fast.encode_plus(line)[0]
+    offsets = enc.offsets
+    curr, prev = offsets[idx], offsets[idx - 1]
+    if curr is not None and line[curr[0] : curr[1]] == "\u200f":
+        return True
+    if prev is not None and line[prev[0] : prev[1]] == "\u200f":
+        return True
+
+
+def check_details(line, spm_ids, tok_ids, slow, fast):
+    # Encoding can be the same with same result AAA -> A + AA vs AA + A
+    # We can check that we use at least exactly the same number of tokens.
+    for i, (spm_id, tok_id) in enumerate(zip(spm_ids, tok_ids)):
+        if spm_id != tok_id:
+            break
+    first = i
+    for i, (spm_id, tok_id) in enumerate(zip(reversed(spm_ids), reversed(tok_ids))):
+        if spm_id != tok_id:
+            break
+    last = len(spm_ids) - i
+
+    spm_diff = spm_ids[first:last]
+    tok_diff = tok_ids[first:last]
+
+    if check_diff(spm_diff, tok_diff, slow, fast):
+        return True
+
+    if check_LTR_mark(line, first, fast):
+        return True
+
+    if last - first > 5:
+        # We might have twice a single problem, attempt to subdivide the disjointed tokens into smaller problems
+        spms = Counter(spm_ids[first:last])
+        toks = Counter(tok_ids[first:last])
+
+        removable_tokens = {spm_ for (spm_, si) in spms.items() if toks.get(spm_, 0) == si}
+        min_width = 3
+        for i in range(last - first - min_width):
+            if all(spm_ids[first + i + j] in removable_tokens for j in range(min_width)):
+                possible_matches = [
+                    k
+                    for k in range(last - first - min_width)
+                    if tok_ids[first + k : first + k + min_width] == spm_ids[first + i : first + i + min_width]
+                ]
+                for j in possible_matches:
+                    if check_diff(spm_ids[first : first + i], tok_ids[first : first + j], sp, tok) and check_details(
+                        line,
+                        spm_ids[first + i : last],
+                        tok_ids[first + j : last],
+                        slow,
+                        fast,
+                    ):
+                        return True
+
+    print(f"Spm: {[fast.decode([spm_ids[i]]) for i in range(first, last)]}")
+    try:
+        print(f"Tok: {[fast.decode([tok_ids[i]]) for i in range(first, last)]}")
+    except Exception:
+        pass
+
+    ok_start = fast.decode(spm_ids[:first])
+    ok_end = fast.decode(spm_ids[last:])
+    wrong = fast.decode(spm_ids[first:last])
+    print()
+    print(wrong)
+    return False
+
+
+def test_string(slow, fast, text):
+    global perfect
+    global imperfect
+    global wrong
+    global total
+
+    slow_ids = slow.encode(text)
+    fast_ids = fast.encode(text)
+
+    skip_assert = False
+    total += 1
+
+    if slow_ids != fast_ids:
+        if check_details(text, slow_ids, fast_ids, slow, fast):
+            skip_assert = True
+            imperfect += 1
+        else:
+            wrong += 1
+    else:
+        perfect += 1
+
+    if total % 10000 == 0:
+        print(f"({perfect} / {imperfect} / {wrong} ----- {perfect + imperfect + wrong})")
+
+    if skip_assert:
+        return
+
+    assert (
+        slow_ids == fast_ids
+    ), f"line {text} : \n\n{slow_ids}\n{fast_ids}\n\n{slow.tokenize(text)}\n{fast.tokenize(text)}"
+
+
+def test_tokenizer(slow, fast):
+    global batch_total
+    for i in range(len(dataset)):
+        # premise, all languages
+        for text in dataset[i]["premise"].values():
+            test_string(slow, fast, text)
+
+        # hypothesis, all languages
+        for text in dataset[i]["hypothesis"]["translation"]:
+            test_string(slow, fast, text)
+
+
+if __name__ == "__main__":
+    for name, (slow_class, fast_class) in TOKENIZER_CLASSES.items():
+        checkpoint_names = list(slow_class.max_model_input_sizes.keys())
+        for checkpoint in checkpoint_names:
+            imperfect = 0
+            perfect = 0
+            wrong = 0
+            total = 0
+
+            print(f"========================== Checking {name}: {checkpoint} ==========================")
+            slow = slow_class.from_pretrained(checkpoint, force_download=True)
+            fast = fast_class.from_pretrained(checkpoint, force_download=True)
+            test_tokenizer(slow, fast)
+            print(f"Accuracy {perfect * 100 / total:.2f}")
--- a/setup.py
+++ b/setup.py
@@ -102,10 +102,10 @@ _deps = [
    "importlib_metadata",
    "ipadic>=1.0.0,<2.0",
    "isort>=5.5.4",
-    "jax>=0.2.0",
-    "jaxlib==0.1.55",
+    "jax>=0.2.8",
+    "jaxlib>=0.1.59",
    "keras2onnx",
-    "numpy",
+    "numpy>=1.17",
    "onnxconverter-common",
    "onnxruntime-tools>=1.4.2",
    "onnxruntime>=1.4.0",
@@ -123,6 +123,7 @@ _deps = [
    "sacremoses",
    "scikit-learn",
    "sentencepiece==0.1.91",
+    "soundfile",
    "sphinx-copybutton",
    "sphinx-markdown-tables",
    "sphinx-rtd-theme==0.4.3",  # sphinx-rtd-theme==0.5.0 introduced big changes in the style.
@@ -131,7 +132,7 @@ _deps = [
    "tensorflow-cpu>=2.3",
    "tensorflow>=2.3",
    "timeout-decorator",
-    "tokenizers==0.9.4",
+    "tokenizers>=0.10.1,<0.11",
    "torch>=1.0",
    "tqdm>=4.27",
    "unidic>=1.0.2",
@@ -140,10 +141,30 @@ _deps = [
 ]


-# tokenizers: "tokenizers==0.9.4" lookup table
-# support non-versions file too so that they can be checked at run time
+# this is a lookup table with items like:
+#
+# tokenizers: "tokenizers==0.9.4"
+# packaging: "packaging"
+#
+# some of the values are versioned whereas others aren't.
 deps = {b: a for a, b in (re.findall(r"^(([^!=<>]+)(?:[!=<>].*)?$)", x)[0] for x in _deps)}

+# since we save this data in src/transformers/dependency_versions_table.py it can be easily accessed from
+# anywhere. If you need to quickly access the data from this table in a shell, you can do so easily with:
+#
+# python -c 'import sys; from transformers.dependency_versions_table import deps; \
+# print(" ".join([ deps[x] for x in sys.argv[1:]]))' tokenizers datasets
+#
+# Just pass the desired package names to that script as it's shown with 2 packages above.
+#
+# If transformers is not yet installed and the work is done from the cloned repo remember to add `PYTHONPATH=src` to the script above
+#
+# You can then feed this for example to `pip`:
+#
+# pip install -U $(python -c 'import sys; from transformers.dependency_versions_table import deps; \
+# print(" ".join([ deps[x] for x in sys.argv[1:]]))' tokenizers datasets)
+#
+

 def deps_list(*pkgs):
    return [deps[pkg] for pkg in pkgs]
@@ -206,13 +227,14 @@ extras["onnxruntime"] = deps_list("onnxruntime", "onnxruntime-tools")
 extras["modelcreation"] = deps_list("cookiecutter")

 extras["serving"] = deps_list("pydantic", "uvicorn", "fastapi", "starlette")
+extras["speech"] = deps_list("soundfile")

 extras["sentencepiece"] = deps_list("sentencepiece", "protobuf")
-extras["retrieval"] = deps_list("faiss-cpu", "datasets")
 extras["testing"] = (
-    deps_list("pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil")
+    deps_list("pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil", "datasets")
    + extras["retrieval"]
    + extras["modelcreation"]
+    + extras["speech"]
 )
 extras["docs"] = deps_list("recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme", "sphinx-copybutton")
 extras["quality"] = deps_list("black", "isort", "flake8")
@@ -229,8 +251,20 @@ extras["dev"] = (
    + extras["modelcreation"]
 )

-extras["torchhub"] = deps_list("filelock", "importlib_metadata", "numpy", "packaging", "protobuf", "regex",
-                               "requests", "sacremoses", "sentencepiece", "torch", "tokenizers", "tqdm")
+extras["torchhub"] = deps_list(
+    "filelock",
+    "importlib_metadata",
+    "numpy",
+    "packaging",
+    "protobuf",
+    "regex",
+    "requests",
+    "sacremoses",
+    "sentencepiece",
+    "torch",
+    "tokenizers",
+    "tqdm",
+)

 # when modifying the following list, make sure to update src/transformers/dependency_versions_check.py
 install_requires = [
@@ -248,7 +282,7 @@ install_requires = [

 setup(
    name="transformers",
-    version="4.2.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
+    version="4.3.1",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Sylvain Gugger, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -22,7 +22,7 @@
 # to defer the actual importing for when the objects are requested. This way `import transformers` provides the names
 # in the namespace without actually importing anything (and especially none of the backends).

-__version__ = "4.2.0"
+__version__ = "4.3.1"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -125,6 +125,8 @@ _import_structure = {
    ],
    "models": [],
    # Models
+    "models.wav2vec2": ["WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Wav2Vec2Config", "Wav2Vec2Tokenizer"],
+    "models.convbert": ["CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvBertConfig", "ConvBertTokenizer"],
    "models.albert": ["ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "AlbertConfig"],
    "models.auto": [
        "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP",
@@ -275,6 +277,7 @@ else:
 # tokenziers-backed objects
 if is_tokenizers_available():
    # Fast tokenizers
+    _import_structure["models.convbert"].append("ConvBertTokenizerFast")
    _import_structure["models.albert"].append("AlbertTokenizerFast")
    _import_structure["models.bart"].append("BartTokenizerFast")
    _import_structure["models.barthez"].append("BarthezTokenizerFast")
@@ -324,6 +327,7 @@ if is_torch_available():
        "DataCollator",
        "DataCollatorForLanguageModeling",
        "DataCollatorForPermutationLanguageModeling",
+        "DataCollatorForSeq2Seq",
        "DataCollatorForSOP",
        "DataCollatorForTokenClassification",
        "DataCollatorForWholeWordMask",
@@ -359,6 +363,30 @@ if is_torch_available():
    _import_structure["generation_utils"] = ["top_k_top_p_filtering"]
    _import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
    # PyTorch models structure
+
+    _import_structure["models.wav2vec2"].extend(
+        [
+            "WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "Wav2Vec2ForCTC",
+            "Wav2Vec2ForMaskedLM",
+            "Wav2Vec2Model",
+            "Wav2Vec2PreTrainedModel",
+        ]
+    )
+    _import_structure["models.convbert"].extend(
+        [
+            "CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "ConvBertForMaskedLM",
+            "ConvBertForMultipleChoice",
+            "ConvBertForQuestionAnswering",
+            "ConvBertForSequenceClassification",
+            "ConvBertForTokenClassification",
+            "ConvBertLayer",
+            "ConvBertModel",
+            "ConvBertPreTrainedModel",
+            "load_tf_weights_in_convbert",
+        ]
+    )
    _import_structure["models.albert"].extend(
        [
            "ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -404,6 +432,7 @@ if is_torch_available():
    _import_structure["models.bart"].extend(
        [
            "BART_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "BartForCausalLM",
            "BartForConditionalGeneration",
            "BartForQuestionAnswering",
            "BartForSequenceClassification",
@@ -441,6 +470,7 @@ if is_torch_available():
            "BLENDERBOT_PRETRAINED_MODEL_ARCHIVE_LIST",
            "BlenderbotForConditionalGeneration",
            "BlenderbotModel",
+            "BlenderbotForCausalLM",
        ]
    )
    _import_structure["models.blenderbot_small"].extend(
@@ -448,6 +478,7 @@ if is_torch_available():
            "BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST",
            "BlenderbotSmallForConditionalGeneration",
            "BlenderbotSmallModel",
+            "BlenderbotSmallForCausalLM",
        ]
    )
    _import_structure["models.camembert"].extend(
@@ -476,7 +507,10 @@ if is_torch_available():
            "DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST",
            "DebertaForSequenceClassification",
            "DebertaModel",
+            "DebertaForMaskedLM",
            "DebertaPreTrainedModel",
+            "DebertaForTokenClassification",
+            "DebertaForQuestionAnswering",
        ]
    )
    _import_structure["models.distilbert"].extend(
@@ -598,9 +632,10 @@ if is_torch_available():
            "LxmertXLayer",
        ]
    )
-    _import_structure["models.marian"].extend(["MarianModel", "MarianMTModel"])
+    _import_structure["models.marian"].extend(["MarianModel", "MarianMTModel", "MarianForCausalLM"])
    _import_structure["models.mbart"].extend(
        [
+            "MBartForCausalLM",
            "MBartForConditionalGeneration",
            "MBartForQuestionAnswering",
            "MBartForSequenceClassification",
@@ -649,7 +684,9 @@ if is_torch_available():
            "load_tf_weights_in_openai_gpt",
        ]
    )
-    _import_structure["models.pegasus"].extend(["PegasusForConditionalGeneration", "PegasusModel"])
+    _import_structure["models.pegasus"].extend(
+        ["PegasusForConditionalGeneration", "PegasusModel", "PegasusForCausalLM"]
+    )
    _import_structure["models.prophetnet"].extend(
        [
            "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -812,6 +849,20 @@ if is_tf_available():
        "shape_list",
    ]
    # TensorFlow models structure
+
+    _import_structure["models.convbert"].extend(
+        [
+            "TF_CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "TFConvBertForMaskedLM",
+            "TFConvBertForMultipleChoice",
+            "TFConvBertForQuestionAnswering",
+            "TFConvBertForSequenceClassification",
+            "TFConvBertForTokenClassification",
+            "TFConvBertLayer",
+            "TFConvBertModel",
+            "TFConvBertPreTrainedModel",
+        ]
+    )
    _import_structure["models.albert"].extend(
        [
            "TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -1231,6 +1282,7 @@ if TYPE_CHECKING:
        BlenderbotSmallTokenizer,
    )
    from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
+    from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer
    from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
    from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
    from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer
@@ -1276,6 +1328,7 @@ if TYPE_CHECKING:
        TransfoXLCorpus,
        TransfoXLTokenizer,
    )
+    from .models.wav2vec2 import WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2Config, Wav2Vec2Tokenizer
    from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer
    from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig
    from .models.xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
@@ -1355,6 +1408,7 @@ if TYPE_CHECKING:
        from .models.barthez import BarthezTokenizerFast
        from .models.bert import BertTokenizerFast
        from .models.camembert import CamembertTokenizerFast
+        from .models.convbert import ConvBertTokenizerFast
        from .models.distilbert import DistilBertTokenizerFast
        from .models.dpr import DPRContextEncoderTokenizerFast, DPRQuestionEncoderTokenizerFast, DPRReaderTokenizerFast
        from .models.electra import ElectraTokenizerFast
@@ -1395,6 +1449,7 @@ if TYPE_CHECKING:
            DataCollator,
            DataCollatorForLanguageModeling,
            DataCollatorForPermutationLanguageModeling,
+            DataCollatorForSeq2Seq,
            DataCollatorForSOP,
            DataCollatorForTokenClassification,
            DataCollatorForWholeWordMask,
@@ -1469,6 +1524,7 @@ if TYPE_CHECKING:
        )
        from .models.bart import (
            BART_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BartForCausalLM,
            BartForConditionalGeneration,
            BartForQuestionAnswering,
            BartForSequenceClassification,
@@ -1498,11 +1554,13 @@ if TYPE_CHECKING:
        )
        from .models.blenderbot import (
            BLENDERBOT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BlenderbotForCausalLM,
            BlenderbotForConditionalGeneration,
            BlenderbotModel,
        )
        from .models.blenderbot_small import (
            BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BlenderbotSmallForCausalLM,
            BlenderbotSmallForConditionalGeneration,
            BlenderbotSmallModel,
        )
@@ -1516,6 +1574,18 @@ if TYPE_CHECKING:
            CamembertForTokenClassification,
            CamembertModel,
        )
+        from .models.convbert import (
+            CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ConvBertForMaskedLM,
+            ConvBertForMultipleChoice,
+            ConvBertForQuestionAnswering,
+            ConvBertForSequenceClassification,
+            ConvBertForTokenClassification,
+            ConvBertLayer,
+            ConvBertModel,
+            ConvBertPreTrainedModel,
+            load_tf_weights_in_convbert,
+        )
        from .models.ctrl import (
            CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,
            CTRLForSequenceClassification,
@@ -1525,7 +1595,10 @@ if TYPE_CHECKING:
        )
        from .models.deberta import (
            DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,
+            DebertaForMaskedLM,
+            DebertaForQuestionAnswering,
            DebertaForSequenceClassification,
+            DebertaForTokenClassification,
            DebertaModel,
            DebertaPreTrainedModel,
        )
@@ -1628,8 +1701,9 @@ if TYPE_CHECKING:
            LxmertVisualFeatureEncoder,
            LxmertXLayer,
        )
-        from .models.marian import MarianModel, MarianMTModel
+        from .models.marian import MarianForCausalLM, MarianModel, MarianMTModel
        from .models.mbart import (
+            MBartForCausalLM,
            MBartForConditionalGeneration,
            MBartForQuestionAnswering,
            MBartForSequenceClassification,
@@ -1671,7 +1745,7 @@ if TYPE_CHECKING:
            OpenAIGPTPreTrainedModel,
            load_tf_weights_in_openai_gpt,
        )
-        from .models.pegasus import PegasusForConditionalGeneration, PegasusModel
+        from .models.pegasus import PegasusForCausalLM, PegasusForConditionalGeneration, PegasusModel
        from .models.prophetnet import (
            PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
            ProphetNetDecoder,
@@ -1738,6 +1812,13 @@ if TYPE_CHECKING:
            TransfoXLPreTrainedModel,
            load_tf_weights_in_transfo_xl,
        )
+        from .models.wav2vec2 import (
+            WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST,
+            Wav2Vec2ForCTC,
+            Wav2Vec2ForMaskedLM,
+            Wav2Vec2Model,
+            Wav2Vec2PreTrainedModel,
+        )
        from .models.xlm import (
            XLM_PRETRAINED_MODEL_ARCHIVE_LIST,
            XLMForMultipleChoice,
@@ -1871,6 +1952,17 @@ if TYPE_CHECKING:
            TFCamembertForTokenClassification,
            TFCamembertModel,
        )
+        from .models.convbert import (
+            TF_CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            TFConvBertForMaskedLM,
+            TFConvBertForMultipleChoice,
+            TFConvBertForQuestionAnswering,
+            TFConvBertForSequenceClassification,
+            TFConvBertForTokenClassification,
+            TFConvBertLayer,
+            TFConvBertModel,
+            TFConvBertPreTrainedModel,
+        )
        from .models.ctrl import (
            TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST,
            TFCTRLForSequenceClassification,
--- a/src/transformers/activations_tf.py
+++ b/src/transformers/activations_tf.py
@@ -15,9 +15,10 @@
 import math

 import tensorflow as tf
+from packaging import version


-def gelu(x):
+def _gelu(x):
    """
    Gaussian Error Linear Unit. Original Implementation of the gelu activation function in Google Bert repo when
    initially created. For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
@@ -25,12 +26,12 @@ def gelu(x):
    https://arxiv.org/abs/1606.08415
    """
    x = tf.convert_to_tensor(x)
-    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
+    cdf = 0.5 * (1.0 + tf.math.erf(x / tf.cast(tf.sqrt(2.0), x.dtype)))

    return x * cdf


-def gelu_new(x):
+def _gelu_new(x):
    """
    Gaussian Error Linear Unit. This is a smoother version of the GELU. Original paper: https://arxiv.org/abs/1606.0841

@@ -56,21 +57,33 @@ def mish(x):

 def gelu_fast(x):
    x = tf.convert_to_tensor(x)
-    coeff1 = tf.cast(7978845608, x.dtype)
+    coeff1 = tf.cast(0.7978845608, x.dtype)
    coeff2 = tf.cast(0.044715, x.dtype)

    return 0.5 * x * (1.0 + tf.tanh(x * coeff2 * (1.0 + coeff1 * x * x)))


+if version.parse(tf.version.VERSION) >= version.parse("2.4"):
+
+    def approximate_gelu_wrap(x):
+        return tf.keras.activations.gelu(x, approximate=True)
+
+    gelu = tf.keras.activations.gelu
+    gelu_new = approximate_gelu_wrap
+else:
+    gelu = _gelu
+    gelu_new = _gelu_new
+
+
 ACT2FN = {
-    "gelu": tf.keras.layers.Activation(gelu),
+    "gelu": gelu,
    "relu": tf.keras.activations.relu,
    "swish": tf.keras.activations.swish,
    "silu": tf.keras.activations.swish,
-    "gelu_new": tf.keras.layers.Activation(gelu_new),
-    "mish": tf.keras.layers.Activation(mish),
+    "gelu_new": gelu_new,
+    "mish": mish,
    "tanh": tf.keras.activations.tanh,
-    "gelu_fast": tf.keras.layers.Activation(gelu_fast),
+    "gelu_fast": gelu_fast,
 }


--- a/src/transformers/commands/convert.py
+++ b/src/transformers/commands/convert.py
@@ -110,6 +110,13 @@ class ConvertCommand(BaseTransformersCLICommand):
            except ImportError:
                raise ImportError(IMPORT_ERROR_MESSAGE)

+            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
+        elif self._model_type == "t5":
+            try:
+                from ..models.t5.convert_t5_original_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch
+            except ImportError:
+                raise ImportError(IMPORT_ERROR_MESSAGE)
+
            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
        elif self._model_type == "gpt":
            from ..models.openai.convert_openai_original_tf_checkpoint_to_pytorch import (
@@ -168,5 +175,5 @@ class ConvertCommand(BaseTransformersCLICommand):
            convert_lxmert_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)
        else:
            raise ValueError(
-                "--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm, lxmert]"
+                "--model_type should be selected in the list [bert, gpt, gpt2, t5, transfo_xl, xlnet, xlm, lxmert]"
            )
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -117,6 +117,9 @@ class PretrainedConfig(object):
        - **no_repeat_ngram_size** (:obj:`int`, `optional`, defaults to 0) -- Value that will be used by default in the
          :obj:`generate` method of the model for ``no_repeat_ngram_size``. If set to int > 0, all ngrams of that size
          can only occur once.
+        - **encoder_no_repeat_ngram_size** (:obj:`int`, `optional`, defaults to 0) -- Value that will be used by
+          default in the :obj:`generate` method of the model for ``encoder_no_repeat_ngram_size``. If set to int > 0,
+          all ngrams of that size that occur in the ``encoder_input_ids`` cannot occur in the ``decoder_input_ids``.
        - **bad_words_ids** (:obj:`List[int]`, `optional`) -- List of token ids that are not allowed to be generated
          that will be used by default in the :obj:`generate` method of the model. In order to get the tokens of the
          words that should not appear in the generated text, use :obj:`tokenizer.encode(bad_word,
@@ -205,6 +208,7 @@ class PretrainedConfig(object):
        self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
        self.length_penalty = kwargs.pop("length_penalty", 1.0)
        self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
+        self.encoder_no_repeat_ngram_size = kwargs.pop("encoder_no_repeat_ngram_size", 0)
        self.bad_words_ids = kwargs.pop("bad_words_ids", None)
        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
        self.chunk_size_feed_forward = kwargs.pop("chunk_size_feed_forward", 0)
--- a/src/transformers/convert_slow_tokenizer.py
+++ b/src/transformers/convert_slow_tokenizer.py
@@ -21,7 +21,7 @@

 from typing import Dict, List, Tuple

-from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors
+from tokenizers import Regex, Tokenizer, decoders, normalizers, pre_tokenizers, processors
 from tokenizers.models import BPE, Unigram, WordPiece

 from .file_utils import requires_protobuf, requires_sentencepiece
@@ -340,7 +340,12 @@ class SpmConverter(Converter):

    def normalizer(self, proto):
        precompiled_charsmap = proto.normalizer_spec.precompiled_charsmap
-        return normalizers.Precompiled(precompiled_charsmap)
+        return normalizers.Sequence(
+            [normalizers.Precompiled(precompiled_charsmap), normalizers.Replace(Regex(" {2,}"), " ")]
+        )
+
+    def pre_tokenizer(self, replacement, add_prefix_space):
+        return pre_tokenizers.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space)

    def post_processor(self):
        return None
@@ -353,12 +358,7 @@ class SpmConverter(Converter):

        replacement = "▁"
        add_prefix_space = True
-        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
-            [
-                pre_tokenizers.WhitespaceSplit(),
-                pre_tokenizers.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space),
-            ]
-        )
+        tokenizer.pre_tokenizer = self.pre_tokenizer(replacement, add_prefix_space)
        tokenizer.decoder = decoders.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space)
        post_processor = self.post_processor()
        if post_processor:
@@ -375,7 +375,11 @@ class AlbertConverter(SpmConverter):
        ]

    def normalizer(self, proto):
-        list_normalizers = [normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
+        list_normalizers = [
+            normalizers.Replace("``", '"'),
+            normalizers.Replace("''", '"'),
+            normalizers.Replace(Regex(" {2,}"), " "),
+        ]
        if not self.original_tokenizer.keep_accents:
            list_normalizers.append(normalizers.NFKD())
            list_normalizers.append(normalizers.StripAccents())
@@ -529,7 +533,11 @@ class XLNetConverter(SpmConverter):
        ]

    def normalizer(self, proto):
-        list_normalizers = [normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]
+        list_normalizers = [
+            normalizers.Replace("``", '"'),
+            normalizers.Replace("''", '"'),
+            normalizers.Replace(Regex(" {2,}"), " "),
+        ]
        if not self.original_tokenizer.keep_accents:
            list_normalizers.append(normalizers.NFKD())
            list_normalizers.append(normalizers.StripAccents())
@@ -574,6 +582,14 @@ class PegasusConverter(SpmConverter):
    def unk_id(self, proto):
        return proto.trainer_spec.unk_id + self.original_tokenizer.offset

+    def pre_tokenizer(self, replacement, add_prefix_space):
+        return pre_tokenizers.Sequence(
+            [
+                pre_tokenizers.WhitespaceSplit(),
+                pre_tokenizers.Metaspace(replacement=replacement, add_prefix_space=add_prefix_space),
+            ]
+        )
+
    def post_processor(self):
        eos = self.original_tokenizer.eos_token
        special_tokens = [
@@ -605,6 +621,7 @@ SLOW_TO_FAST_CONVERTERS = {
    "BarthezTokenizer": BarthezConverter,
    "BertTokenizer": BertConverter,
    "CamembertTokenizer": CamembertConverter,
+    "ConvBertTokenizer": BertConverter,
    "DistilBertTokenizer": BertConverter,
    "DPRReaderTokenizer": BertConverter,
    "DPRQuestionEncoderTokenizer": BertConverter,
--- a/src/transformers/data/data_collator.py
+++ b/src/transformers/data/data_collator.py
@@ -224,6 +224,63 @@ def tolist(x: Union[List[Any], torch.Tensor]):
    return x.tolist() if isinstance(x, torch.Tensor) else x


+@dataclass
+class DataCollatorForSeq2Seq:
+    """
+    Data collator that will dynamically pad the inputs received, as well as the labels.
+
+    Args:
+        tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):
+            The tokenizer used for encoding the data.
+        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
+            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
+            among:
+
+            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+              sequence is provided).
+            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+              maximum acceptable input length for the model if that argument is not provided.
+            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+              different lengths).
+        max_length (:obj:`int`, `optional`):
+            Maximum length of the returned list and optionally padding length (see above).
+        pad_to_multiple_of (:obj:`int`, `optional`):
+            If set will pad the sequence to a multiple of the provided value.
+
+            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
+            7.5 (Volta).
+        label_pad_token_id (:obj:`int`, `optional`, defaults to -100):
+            The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).
+    """
+
+    tokenizer: PreTrainedTokenizerBase
+    padding: Union[bool, str, PaddingStrategy] = True
+    max_length: Optional[int] = None
+    pad_to_multiple_of: Optional[int] = None
+    label_pad_token_id: int = -100
+
+    def __call__(self, features):
+        labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None
+        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
+        # same length to return tensors.
+        if labels is not None:
+            max_label_length = max(len(l) for l in labels)
+            padding_side = self.tokenizer.padding_side
+            for feature in features:
+                remainder = [self.label_pad_token_id] * (max_label_length - len(feature["labels"]))
+                feature["labels"] = (
+                    feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
+                )
+
+        return self.tokenizer.pad(
+            features,
+            padding=self.padding,
+            max_length=self.max_length,
+            pad_to_multiple_of=self.pad_to_multiple_of,
+            return_tensors="pt",
+        )
+
+
@dataclass
 class DataCollatorForLanguageModeling:
    """
@@ -345,7 +402,7 @@ class DataCollatorForWholeWordMask(DataCollatorForLanguageModeling):
            # For Chinese tokens, we need extra inf to mark sub-word, e.g [喜,欢]-> [喜，##欢]
            if "chinese_ref" in e:
                ref_pos = tolist(e["chinese_ref"])
-                len_seq = e["input_ids"].size(0)
+                len_seq = len(e["input_ids"])
                for i in range(len_seq):
                    if i in ref_pos:
                        ref_tokens[i] = "##" + ref_tokens[i]
--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@@ -98,7 +98,7 @@ if is_tf_available():
                label = d.pop("label")
                yield (d, label)

-        input_names = ["input_ids"] + tokenizer.model_input_names
+        input_names = tokenizer.model_input_names

        return tf.data.Dataset.from_generator(
            gen,
--- a/src/transformers/dependency_versions_table.py
+++ b/src/transformers/dependency_versions_table.py
@@ -15,10 +15,10 @@ deps = {
    "importlib_metadata": "importlib_metadata",
    "ipadic": "ipadic>=1.0.0,<2.0",
    "isort": "isort>=5.5.4",
-    "jax": "jax>=0.2.0",
-    "jaxlib": "jaxlib==0.1.55",
+    "jax": "jax>=0.2.8",
+    "jaxlib": "jaxlib>=0.1.59",
    "keras2onnx": "keras2onnx",
-    "numpy": "numpy",
+    "numpy": "numpy>=1.17",
    "onnxconverter-common": "onnxconverter-common",
    "onnxruntime-tools": "onnxruntime-tools>=1.4.2",
    "onnxruntime": "onnxruntime>=1.4.0",
@@ -36,6 +36,7 @@ deps = {
    "sacremoses": "sacremoses",
    "scikit-learn": "scikit-learn",
    "sentencepiece": "sentencepiece==0.1.91",
+    "soundfile": "soundfile",
    "sphinx-copybutton": "sphinx-copybutton",
    "sphinx-markdown-tables": "sphinx-markdown-tables",
    "sphinx-rtd-theme": "sphinx-rtd-theme==0.4.3",
@@ -44,7 +45,7 @@ deps = {
    "tensorflow-cpu": "tensorflow-cpu>=2.3",
    "tensorflow": "tensorflow>=2.3",
    "timeout-decorator": "timeout-decorator",
-    "tokenizers": "tokenizers==0.9.4",
+    "tokenizers": "tokenizers>=0.10.1,<0.11",
    "torch": "torch>=1.0",
    "tqdm": "tqdm>=4.27",
    "unidic": "unidic>=1.0.2",
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -59,7 +59,7 @@ else:

 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name

-ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}
+ENV_VARS_TRUE_VALUES = {"1", "ON", "YES", "TRUE"}
 ENV_VARS_TRUE_AND_AUTO_VALUES = ENV_VARS_TRUE_VALUES.union({"AUTO"})

 USE_TF = os.environ.get("USE_TF", "AUTO").upper()
@@ -89,8 +89,20 @@ if USE_TF in ENV_VARS_TRUE_AND_AUTO_VALUES and USE_TORCH not in ENV_VARS_TRUE_VA
            try:
                _tf_version = importlib_metadata.version("tensorflow-cpu")
            except importlib_metadata.PackageNotFoundError:
-                _tf_version = None
-                _tf_available = False
+                try:
+                    _tf_version = importlib_metadata.version("tensorflow-gpu")
+                except importlib_metadata.PackageNotFoundError:
+                    try:
+                        _tf_version = importlib_metadata.version("tf-nightly")
+                    except importlib_metadata.PackageNotFoundError:
+                        try:
+                            _tf_version = importlib_metadata.version("tf-nightly-cpu")
+                        except importlib_metadata.PackageNotFoundError:
+                            try:
+                                _tf_version = importlib_metadata.version("tf-nightly-gpu")
+                            except importlib_metadata.PackageNotFoundError:
+                                _tf_version = None
+                                _tf_available = False
    if _tf_available:
        if version.parse(_tf_version) < version.parse("2"):
            logger.info(f"TensorFlow found but with version {_tf_version}. Transformers requires version 2 minimum.")
@@ -143,6 +155,14 @@ except importlib_metadata.PackageNotFoundError:
    _scatter_available = False


+_soundfile_available = importlib.util.find_spec("soundfile") is not None
+try:
+    _soundfile_version = importlib_metadata.version("soundfile")
+    logger.debug(f"Successfully imported soundfile version {_soundfile_version}")
+except importlib_metadata.PackageNotFoundError:
+    _soundfile_available = False
+
+
 torch_cache_home = os.getenv("TORCH_HOME", os.path.join(os.getenv("XDG_CACHE_HOME", "~/.cache"), "torch"))
 old_default_cache_path = os.path.join(torch_cache_home, "transformers")
 # New default cache, shared with the Datasets library
@@ -285,6 +305,24 @@ def is_pandas_available():
    return importlib.util.find_spec("pandas") is not None


+def is_sagemaker_distributed_available():
+    # Get the sagemaker specific env variable.
+    sagemaker_params = os.getenv("SM_FRAMEWORK_PARAMS", "{}")
+    try:
+        # Parse it and check the field "sagemaker_distributed_dataparallel_enabled".
+        sagemaker_params = json.loads(sagemaker_params)
+        if not sagemaker_params.get("sagemaker_distributed_dataparallel_enabled", False):
+            return False
+    except json.JSONDecodeError:
+        return False
+    # Lastly, check if the `smdistributed` module is present.
+    return importlib.util.find_spec("smdistributed") is not None
+
+
+def is_soundfile_availble():
+    return _soundfile_available
+
+
 def torch_only_method(fn):
    def wrapper(*args, **kwargs):
        if not _torch_available:
@@ -1213,7 +1251,7 @@ def get_from_cache(
                # the models might've been found if local_files_only=False
                # Notify the user about that
                if local_files_only:
-                    raise ValueError(
+                    raise FileNotFoundError(
                        "Cannot find the requested files in the cached path and outgoing traffic has been"
                        " disabled. To enable model look-ups and downloads online, set 'local_files_only'"
                        " to False."
--- a/src/transformers/generation_logits_process.py
+++ b/src/transformers/generation_logits_process.py
@@ -155,13 +155,12 @@ class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
        self.penalty = penalty

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
-        ranges = torch.arange(scores.shape[0])
-        score = scores[ranges[:, None], input_ids]
+        score = torch.gather(scores, 1, input_ids)

        # if score < 0 then repetition penalty has to be multiplied to reduce the previous token probability
        score = torch.where(score < 0, score * self.penalty, score / self.penalty)

-        scores[ranges[:, None], input_ids] = score
+        scores.scatter_(1, input_ids, score)
        return scores


@@ -236,6 +235,41 @@ class TopKLogitsWarper(LogitsWarper):
        return scores


+def _get_ngrams(ngram_size: int, prev_input_ids: torch.Tensor, num_hypos: int):
+    generated_ngrams = [{} for _ in range(num_hypos)]
+    for idx in range(num_hypos):
+        gen_tokens = prev_input_ids[idx].tolist()
+        generated_ngram = generated_ngrams[idx]
+        for ngram in zip(*[gen_tokens[i:] for i in range(ngram_size)]):
+            prev_ngram_tuple = tuple(ngram[:-1])
+            generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]
+    return generated_ngrams
+
+
+def _get_generated_ngrams(banned_ngrams, prev_input_ids, ngram_size, cur_len):
+    # Before decoding the next token, prevent decoding of ngrams that have already appeared
+    start_idx = cur_len + 1 - ngram_size
+    ngram_idx = tuple(prev_input_ids[start_idx:cur_len].tolist())
+    return banned_ngrams.get(ngram_idx, [])
+
+
+def _calc_banned_ngram_tokens(
+    ngram_size: int, prev_input_ids: torch.Tensor, num_hypos: int, cur_len: int
+) -> List[Iterable[int]]:
+    """Copied from fairseq for no_repeat_ngram in beam_search"""
+    if cur_len + 1 < ngram_size:
+        # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
+        return [[] for _ in range(num_hypos)]
+
+    generated_ngrams = _get_ngrams(ngram_size, prev_input_ids, num_hypos)
+
+    banned_tokens = [
+        _get_generated_ngrams(generated_ngrams[hypo_idx], prev_input_ids[hypo_idx], ngram_size, cur_len)
+        for hypo_idx in range(num_hypos)
+    ]
+    return banned_tokens
+
+
 class NoRepeatNGramLogitsProcessor(LogitsProcessor):
    r"""
    :class:`transformers.LogitsProcessor` that enforces no repetition of n-grams. See `Fairseq
@@ -254,36 +288,53 @@ class NoRepeatNGramLogitsProcessor(LogitsProcessor):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        num_batch_hypotheses = scores.shape[0]
        cur_len = input_ids.shape[-1]
-        banned_batch_tokens = self._calc_banned_ngram_tokens(input_ids, num_batch_hypotheses, cur_len)
+        banned_batch_tokens = _calc_banned_ngram_tokens(self.ngram_size, input_ids, num_batch_hypotheses, cur_len)

        for i, banned_tokens in enumerate(banned_batch_tokens):
            scores[i, banned_tokens] = -float("inf")

        return scores

-    def _calc_banned_ngram_tokens(
-        self, prev_input_ids: torch.Tensor, num_hypos: int, cur_len: int
-    ) -> List[Iterable[int]]:
-        """Copied from fairseq for no_repeat_ngram in beam_search"""
-        if cur_len + 1 < self.ngram_size:
-            # return no banned tokens if we haven't generated no_repeat_ngram_size tokens yet
-            return [[] for _ in range(num_hypos)]
-        generated_ngrams = [{} for _ in range(num_hypos)]
-        for idx in range(num_hypos):
-            gen_tokens = prev_input_ids[idx].tolist()
-            generated_ngram = generated_ngrams[idx]
-            for ngram in zip(*[gen_tokens[i:] for i in range(self.ngram_size)]):
-                prev_ngram_tuple = tuple(ngram[:-1])
-                generated_ngram[prev_ngram_tuple] = generated_ngram.get(prev_ngram_tuple, []) + [ngram[-1]]

-        def _get_generated_ngrams(hypo_idx):
-            # Before decoding the next token, prevent decoding of ngrams that have already appeared
-            start_idx = cur_len + 1 - self.ngram_size
-            ngram_idx = tuple(prev_input_ids[hypo_idx, start_idx:cur_len].tolist())
-            return generated_ngrams[hypo_idx].get(ngram_idx, [])
+class EncoderNoRepeatNGramLogitsProcessor(LogitsProcessor):
+    r"""
+    :class:`transformers.LogitsProcessor` that enforces no repetition of encoder input ids n-grams for the decoder ids.
+    See `ParlAI <https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/torch_generator_agent.py#L1350>`__.

-        banned_tokens = [_get_generated_ngrams(hypo_idx) for hypo_idx in range(num_hypos)]
-        return banned_tokens
+    Args:
+        encoder_ngram_size (:obj:`int`):
+            All ngrams of size :obj:`ngram_size` can only occur within the encoder input ids.
+        encoder_input_ids (:obj:`int`):
+            The encoder_input_ids that should not be repeated within the decoder ids.
+    """
+
+    def __init__(self, encoder_ngram_size: int, encoder_input_ids: torch.LongTensor):
+        if not isinstance(encoder_ngram_size, int) or encoder_ngram_size <= 0:
+            raise ValueError(
+                f"`encoder_ngram_size` has to be a strictly positive integer, but is {encoder_ngram_size}"
+            )
+        self.ngram_size = encoder_ngram_size
+        if len(encoder_input_ids.shape) == 1:
+            encoder_input_ids = encoder_input_ids.unsqueeze(0)
+        self.batch_size = encoder_input_ids.shape[0]
+        self.generated_ngrams = _get_ngrams(encoder_ngram_size, encoder_input_ids, self.batch_size)
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        # B x num_beams
+        num_hypos = scores.shape[0]
+        num_beams = num_hypos // self.batch_size
+        cur_len = input_ids.shape[-1]
+        banned_batch_tokens = [
+            _get_generated_ngrams(
+                self.generated_ngrams[hypo_idx // num_beams], input_ids[hypo_idx], self.ngram_size, cur_len
+            )
+            for hypo_idx in range(num_hypos)
+        ]
+
+        for i, banned_tokens in enumerate(banned_batch_tokens):
+            scores[i, banned_tokens] = -float("inf")
+
+        return scores


 class NoBadWordsLogitsProcessor(LogitsProcessor):
@@ -440,8 +491,6 @@ class HammingDiversityLogitsProcessor(LogitsProcessor):
            raise ValueError("`num_beam_groups` should be an integer strictly larger than 1.")
        if num_beam_groups > num_beams:
            raise ValueError("`beam_groups` has to be smaller or equal to `num_beams`.")
-        if num_beam_groups > num_beams:
-            raise ValueError("`beam_groups` has to be smaller or equal to `num_beams`")
        self._num_sub_beams = num_beams // num_beam_groups

    def __call__(
--- a/src/transformers/generation_utils.py
+++ b/src/transformers/generation_utils.py
@@ -23,6 +23,7 @@ from torch.nn import functional as F
 from .file_utils import ModelOutput
 from .generation_beam_search import BeamScorer, BeamSearchScorer
 from .generation_logits_process import (
+    EncoderNoRepeatNGramLogitsProcessor,
    HammingDiversityLogitsProcessor,
    LogitsProcessorList,
    MinLengthLogitsProcessor,
@@ -503,18 +504,10 @@ class GenerationMixin:

        return model_kwargs

-    @staticmethod
-    def _reorder_cache(past: Tuple[torch.Tensor], beam_idx: torch.Tensor) -> Tuple[torch.Tensor]:
-        """
-        This function is used to re-order the :obj:`past_key_values` or :obj:`mems` cache if
-        :meth:`~transformers.PretrainedModel.beam_search` or :meth:`~transformers.PretrainedModel.beam_sample` is
-        called. This is required to match :obj:`past_key_values` or :obj:`mems` with the correct beam_idx at every
-        generation step.
-
-        For custom re-ordering of :obj:`past_key_values` or :obj:`mems`, the function should be implemented in
-        subclasses of :class:`~transformers.PreTrainedModel`.
-        """
-        return tuple(layer_past.index_select(1, beam_idx.to(layer_past.device)) for layer_past in past)
+    def _reorder_cache(self, past, beam_idx):
+        raise NotImplementedError(
+            f"Make sure that a `_reorder_cache` function is correctly implemented in {self.__class__.__module__} to enable beam search for {self.__class__}"
+        )

    def _get_logits_warper(
        self, top_k: int = None, top_p: float = None, temperature: float = None, num_beams: int = None
@@ -545,6 +538,8 @@ class GenerationMixin:
        self,
        repetition_penalty: float,
        no_repeat_ngram_size: int,
+        encoder_no_repeat_ngram_size: int,
+        encoder_input_ids: torch.LongTensor,
        bad_words_ids: List[List[int]],
        min_length: int,
        eos_token_id: int,
@@ -563,6 +558,11 @@ class GenerationMixin:
        no_repeat_ngram_size = (
            no_repeat_ngram_size if no_repeat_ngram_size is not None else self.config.no_repeat_ngram_size
        )
+        encoder_no_repeat_ngram_size = (
+            encoder_no_repeat_ngram_size
+            if encoder_no_repeat_ngram_size is not None
+            else self.config.encoder_no_repeat_ngram_size
+        )
        bad_words_ids = bad_words_ids if bad_words_ids is not None else self.config.bad_words_ids
        min_length = min_length if min_length is not None else self.config.min_length
        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
@@ -582,6 +582,13 @@ class GenerationMixin:
            processors.append(RepetitionPenaltyLogitsProcessor(penalty=repetition_penalty))
        if no_repeat_ngram_size is not None and no_repeat_ngram_size > 0:
            processors.append(NoRepeatNGramLogitsProcessor(no_repeat_ngram_size))
+        if encoder_no_repeat_ngram_size is not None and encoder_no_repeat_ngram_size > 0:
+            if self.config.is_encoder_decoder:
+                processors.append(EncoderNoRepeatNGramLogitsProcessor(encoder_no_repeat_ngram_size, encoder_input_ids))
+            else:
+                raise ValueError(
+                    "It's impossible to use `encoder_no_repeat_ngram_size` with decoder-only architecture"
+                )
        if bad_words_ids is not None:
            processors.append(NoBadWordsLogitsProcessor(bad_words_ids, eos_token_id))
        if min_length is not None and eos_token_id is not None and min_length > -1:
@@ -609,6 +616,7 @@ class GenerationMixin:
        eos_token_id: Optional[int] = None,
        length_penalty: Optional[float] = None,
        no_repeat_ngram_size: Optional[int] = None,
+        encoder_no_repeat_ngram_size: Optional[int] = None,
        num_return_sequences: Optional[int] = None,
        decoder_start_token_id: Optional[int] = None,
        use_cache: Optional[bool] = None,
@@ -669,6 +677,9 @@ class GenerationMixin:
                sequences.
            no_repeat_ngram_size (:obj:`int`, `optional`, defaults to 0):
                If set to int > 0, all ngrams of that size can only occur once.
+            encoder_no_repeat_ngram_size (:obj:`int`, `optional`, defaults to 0):
+                If set to int > 0, all ngrams of that size that occur in the ``encoder_input_ids`` cannot occur in the
+                ``decoder_input_ids``.
            bad_words_ids(:obj:`List[List[int]]`, `optional`):
                List of token ids that are not allowed to be generated. In order to get the tokens of the words that
                should not appear in the generated text, use :obj:`tokenizer(bad_word,
@@ -828,6 +839,9 @@ class GenerationMixin:
            logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")
            pad_token_id = eos_token_id

+        # Storing encoder_input_ids for logits_processor that could use them
+        encoder_input_ids = input_ids if self.config.is_encoder_decoder else None
+
        if self.config.is_encoder_decoder:
            # add encoder_outputs to model_kwargs
            model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
@@ -870,6 +884,8 @@ class GenerationMixin:
        logits_processor = self._get_logits_processor(
            repetition_penalty=repetition_penalty,
            no_repeat_ngram_size=no_repeat_ngram_size,
+            encoder_no_repeat_ngram_size=encoder_no_repeat_ngram_size,
+            encoder_input_ids=encoder_input_ids,
            bad_words_ids=bad_words_ids,
            min_length=min_length,
            eos_token_id=eos_token_id,
@@ -1646,6 +1662,7 @@ class GenerationMixin:
            beam_idx = beam_outputs["next_beam_indices"]

            input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)
+
            cur_len = cur_len + 1

            model_kwargs = self._update_model_kwargs_for_generation(
--- a/src/transformers/hf_argparser.py
+++ b/src/transformers/hf_argparser.py
@@ -15,7 +15,7 @@
 import dataclasses
 import json
 import sys
-from argparse import ArgumentParser
+from argparse import ArgumentParser, ArgumentTypeError
 from enum import Enum
 from pathlib import Path
 from typing import Any, Iterable, List, NewType, Optional, Tuple, Union
@@ -25,6 +25,20 @@ DataClass = NewType("DataClass", Any)
 DataClassType = NewType("DataClassType", Any)


+# From https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse
+def string_to_bool(v):
+    if isinstance(v, bool):
+        return v
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise ArgumentTypeError(
+            f"Truthy value expected: got {v} but expected one of yes/no, true/false, t/f, y/n, 1/0 (case insensitive)."
+        )
+
+
 class HfArgumentParser(ArgumentParser):
    """
    This subclass of `argparse.ArgumentParser` uses type hints on dataclasses to generate arguments.
@@ -80,16 +94,25 @@ class HfArgumentParser(ArgumentParser):
                    field.type = prim_type

            if isinstance(field.type, type) and issubclass(field.type, Enum):
-                kwargs["choices"] = list(field.type)
-                kwargs["type"] = field.type
+                kwargs["choices"] = [x.value for x in field.type]
+                kwargs["type"] = type(kwargs["choices"][0])
                if field.default is not dataclasses.MISSING:
                    kwargs["default"] = field.default
            elif field.type is bool or field.type is Optional[bool]:
-                if field.type is bool or (field.default is not None and field.default is not dataclasses.MISSING):
-                    kwargs["action"] = "store_false" if field.default is True else "store_true"
                if field.default is True:
-                    field_name = f"--no_{field.name}"
-                    kwargs["dest"] = field.name
+                    self.add_argument(f"--no_{field.name}", action="store_false", dest=field.name, **kwargs)
+
+                # Hack because type=bool in argparse does not behave as we want.
+                kwargs["type"] = string_to_bool
+                if field.type is bool or (field.default is not None and field.default is not dataclasses.MISSING):
+                    # Default value is True if we have no default when of type bool.
+                    default = True if field.default is dataclasses.MISSING else field.default
+                    # This is the value that will get picked if we don't include --field_name in any way
+                    kwargs["default"] = default
+                    # This tells argparse we accept 0 or 1 value after --field_name
+                    kwargs["nargs"] = "?"
+                    # This is the value that will get picked if we do --field_name (without value)
+                    kwargs["const"] = True
            elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List):
                kwargs["nargs"] = "+"
                kwargs["type"] = field.type.__args__[0]
@@ -175,7 +198,7 @@ class HfArgumentParser(ArgumentParser):
        data = json.loads(Path(json_file).read_text())
        outputs = []
        for dtype in self.dataclass_types:
-            keys = {f.name for f in dataclasses.fields(dtype)}
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
            inputs = {k: v for k, v in data.items() if k in keys}
            obj = dtype(**inputs)
            outputs.append(obj)
@@ -188,7 +211,7 @@ class HfArgumentParser(ArgumentParser):
        """
        outputs = []
        for dtype in self.dataclass_types:
-            keys = {f.name for f in dataclasses.fields(dtype)}
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
            inputs = {k: v for k, v in args.items() if k in keys}
            obj = dtype(**inputs)
            outputs.append(obj)
--- a/src/transformers/integrations.py
+++ b/src/transformers/integrations.py
@@ -54,7 +54,8 @@ from .trainer_utils import PREFIX_CHECKPOINT_DIR, BestRun, EvaluationStrategy  #

 # Integration functions:
 def is_wandb_available():
-    if os.getenv("WANDB_DISABLED"):
+    # any value of WANDB_DISABLED disables wandb
+    if os.getenv("WANDB_DISABLED", "").upper() in ENV_VARS_TRUE_VALUES:
        return False
    return importlib.util.find_spec("wandb") is not None

@@ -125,13 +126,13 @@ def run_hp_search_optuna(trainer, n_trials: int, direction: str, **kwargs) -> Be
    import optuna

    def _objective(trial, checkpoint_dir=None):
-        model_path = None
+        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
-                    model_path = os.path.join(checkpoint_dir, subdir)
+                    checkpoint = os.path.join(checkpoint_dir, subdir)
        trainer.objective = None
-        trainer.train(model_path=model_path, trial=trial)
+        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
@@ -149,20 +150,20 @@ def run_hp_search_optuna(trainer, n_trials: int, direction: str, **kwargs) -> Be
 def run_hp_search_ray(trainer, n_trials: int, direction: str, **kwargs) -> BestRun:
    import ray

-    def _objective(trial, checkpoint_dir=None):
-        model_path = None
+    def _objective(trial, local_trainer, checkpoint_dir=None):
+        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
-                    model_path = os.path.join(checkpoint_dir, subdir)
-        trainer.objective = None
-        trainer.train(model_path=model_path, trial=trial)
+                    checkpoint = os.path.join(checkpoint_dir, subdir)
+        local_trainer.objective = None
+        local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
-        if getattr(trainer, "objective", None) is None:
-            metrics = trainer.evaluate()
-            trainer.objective = trainer.compute_objective(metrics)
-            trainer._tune_save_checkpoint()
-            ray.tune.report(objective=trainer.objective, **metrics, done=True)
+        if getattr(local_trainer, "objective", None) is None:
+            metrics = local_trainer.evaluate()
+            local_trainer.objective = local_trainer.compute_objective(metrics)
+            local_trainer._tune_save_checkpoint()
+            ray.tune.report(objective=local_trainer.objective, **metrics, done=True)

    # The model and TensorBoard writer do not pickle so we have to remove them (if they exists)
    # while doing the ray hp search.
@@ -217,7 +218,12 @@ def run_hp_search_ray(trainer, n_trials: int, direction: str, **kwargs) -> BestR
                "Trainer `args`.".format(cls=type(kwargs["scheduler"]).__name__)
            )

-    analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)
+    analysis = ray.tune.run(
+        ray.tune.with_parameters(_objective, local_trainer=trainer),
+        config=trainer.hp_space(None),
+        num_samples=n_trials,
+        **kwargs,
+    )
    best_trial = analysis.get_best_trial(metric="objective", mode=direction[:3])
    best_run = BestRun(best_trial.trial_id, best_trial.last_result["objective"], best_trial.config)
    if _tb_writer is not None:
@@ -225,6 +231,21 @@ def run_hp_search_ray(trainer, n_trials: int, direction: str, **kwargs) -> BestR
    return best_run


+def get_available_reporting_integrations():
+    integrations = []
+    if is_azureml_available():
+        integrations.append("azure_ml")
+    if is_comet_available():
+        integrations.append("comet_ml")
+    if is_mlflow_available():
+        integrations.append("mlflow")
+    if is_tensorboard_available():
+        integrations.append("tensorboard")
+    if is_wandb_available():
+        integrations.append("wandb")
+    return integrations
+
+
 def rewrite_logs(d):
    new_d = {}
    eval_prefix = "eval_"
@@ -415,8 +436,9 @@ class TensorBoardCallback(TrainerCallback):
                    self._SummaryWriter = SummaryWriter
                except ImportError:
                    self._SummaryWriter = None
+        else:
+            self._SummaryWriter = None
        self.tb_writer = tb_writer
-        self._SummaryWriter = SummaryWriter

    def _init_summary_writer(self, args, log_dir=None):
        log_dir = log_dir or args.logging_dir
@@ -495,6 +517,8 @@ class WandbCallback(TrainerCallback):
            else:
                self._wandb = wandb
        self._initialized = False
+        # log outputs
+        self._log_model = os.getenv("WANDB_LOG_MODEL", "FALSE").upper() in ENV_VARS_TRUE_VALUES.union({"TRUE"})

    def setup(self, args, state, model, reinit, **kwargs):
        """
@@ -512,7 +536,7 @@ class WandbCallback(TrainerCallback):
            WANDB_PROJECT (:obj:`str`, `optional`, defaults to :obj:`"huggingface"`):
                Set this to a custom string to store results in a different project.
            WANDB_DISABLED (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                Whether or not to disable wandb entirely.
+                Whether or not to disable wandb entirely. Set `WANDB_DISABLED=true` to disable.
        """
        if self._wandb is None:
            return
@@ -548,9 +572,6 @@ class WandbCallback(TrainerCallback):
                    model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, args.logging_steps)
                )

-            # log outputs
-            self._log_model = os.getenv("WANDB_LOG_MODEL", "FALSE").upper() in ENV_VARS_TRUE_VALUES.union({"TRUE"})
-
    def on_train_begin(self, args, state, control, model=None, **kwargs):
        if self._wandb is None:
            return
@@ -562,7 +583,8 @@ class WandbCallback(TrainerCallback):
        if self._wandb is None:
            return
        # commit last step
-        self._wandb.log({})
+        if state.is_world_process_zero:
+            self._wandb.log({})
        if self._log_model and self._initialized and state.is_world_process_zero:
            from .trainer import Trainer

@@ -756,3 +778,21 @@ class MLflowCallback(TrainerCallback):
        # not let you start a new run before the previous one is killed
        if self._ml_flow.active_run is not None:
            self._ml_flow.end_run(status="KILLED")
+
+
+INTEGRATION_TO_CALLBACK = {
+    "azure_ml": AzureMLCallback,
+    "comet_ml": CometCallback,
+    "mlflow": MLflowCallback,
+    "tensorboard": TensorBoardCallback,
+    "wandb": WandbCallback,
+}
+
+
+def get_reporting_integration_callbacks(report_to):
+    for integration in report_to:
+        if integration not in INTEGRATION_TO_CALLBACK:
+            raise ValueError(
+                f"{integration} is not supported, only {', '.join(INTEGRATION_TO_CALLBACK.keys())} are supported."
+            )
+    return [INTEGRATION_TO_CALLBACK[integration] for integration in report_to]
--- a/src/transformers/modeling_tf_outputs.py
+++ b/src/transformers/modeling_tf_outputs.py
@@ -175,7 +175,7 @@ class TFCausalLMOutput(ModelOutput):
    Base class for causal language model (or autoregressive) outputs.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of non-masked labels, returned when :obj:`labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
@@ -204,7 +204,7 @@ class TFCausalLMOutputWithPast(ModelOutput):
    Base class for causal language model (or autoregressive) outputs.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of non-masked labels, returned when :obj:`labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
@@ -240,7 +240,7 @@ class TFMaskedLMOutput(ModelOutput):
    Base class for masked language models outputs.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of non-masked labels, returned when :obj:`labels` is provided):
            Masked language modeling (MLM) loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
@@ -269,7 +269,7 @@ class TFSeq2SeqLMOutput(ModelOutput):
    Base class for sequence-to-sequence language models outputs.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of non-masked labels, returned when :obj:`labels` is provided):
            Language modeling loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
@@ -321,7 +321,7 @@ class TFNextSentencePredictorOutput(ModelOutput):
    Base class for outputs of models predicting if two sentences are consecutive or not.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`next_sentence_label` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of non-masked labels, returned when :obj:`next_sentence_label` is provided):
            Next sentence prediction loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, 2)`):
            Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
@@ -351,7 +351,7 @@ class TFSequenceClassifierOutput(ModelOutput):
    Base class for outputs of sentence classification models.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(batch_size, )`, `optional`, returned when :obj:`labels` is provided):
            Classification (or regression if config.num_labels==1) loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
@@ -432,7 +432,7 @@ class TFMultipleChoiceModelOutput(ModelOutput):
    Base class for outputs of multiple choice models.

    Args:
-        loss (:obj:`tf.Tensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape `(batch_size, )`, `optional`, returned when :obj:`labels` is provided):
            Classification loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`):
            `num_choices` is the second dimension of the input tensors. (see `input_ids` above).
@@ -463,7 +463,7 @@ class TFTokenClassifierOutput(ModelOutput):
    Base class for outputs of token classification models.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
+        loss (:obj:`tf.Tensor` of shape :obj:`(n,)`, `optional`, where n is the number of unmasked labels, returned when ``labels`` is provided) :
            Classification loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`):
            Classification scores (before SoftMax).
@@ -492,7 +492,7 @@ class TFQuestionAnsweringModelOutput(ModelOutput):
    Base class for outputs of question answering models.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(batch_size, )`, `optional`, returned when :obj:`start_positions` and :obj:`end_positions` are provided):
            Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
        start_logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
            Span-start scores (before SoftMax).
@@ -579,7 +579,7 @@ class TFSequenceClassifierOutputWithPast(ModelOutput):
    Base class for outputs of sentence classification models.

    Args:
-        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided):
+        loss (:obj:`tf.Tensor` of shape :obj:`(batch_size, )`, `optional`, returned when :obj:`labels` is provided):
            Classification (or regression if config.num_labels==1) loss.
        logits (:obj:`tf.Tensor` of shape :obj:`(batch_size, config.num_labels)`):
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
--- a/src/transformers/modeling_tf_pytorch_utils.py
+++ b/src/transformers/modeling_tf_pytorch_utils.py
@@ -64,6 +64,10 @@ def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove="")
    if tf_name[-1] == "beta":
        tf_name[-1] = "bias"

+    # The SeparableConv1D TF layer contains two weights that are translated to PyTorch Conv1D here
+    if tf_name[-1] == "pointwise_kernel" or tf_name[-1] == "depthwise_kernel":
+        tf_name[-1] = tf_name[-1].replace("_kernel", ".weight")
+
    # Remove prefix if needed
    tf_name = ".".join(tf_name)
    if start_prefix_to_remove:
@@ -127,7 +131,6 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a

    if tf_inputs is not None:
        tf_model(tf_inputs, training=False)  # Make sure model is built
-
    # Adapt state dict - TODO remove this and update the AWS weights files instead
    # Convert old format to new format if needed from a PyTorch state_dict
    old_keys = []
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -44,6 +44,11 @@ from .utils import logging


 logger = logging.get_logger(__name__)
+tf_logger = tf.get_logger()
+
+TFModelInputType = Union[
+    List[tf.Tensor], List[np.ndarray], Dict[str, tf.Tensor], Dict[str, np.ndarray], np.ndarray, tf.Tensor
+]


 class TFModelUtilsMixin:
@@ -147,7 +152,7 @@ class TFCausalLanguageModelingLoss:
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction=tf.keras.losses.Reduction.NONE
        )
-        # make sure only labels that are not equal to -100 do not affect loss
+        # make sure only labels that are not equal to -100 affect the loss
        active_loss = tf.not_equal(tf.reshape(labels, (-1,)), -100)
        reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)
        labels = tf.boolean_mask(tf.reshape(labels, (-1,)), active_loss)
@@ -285,7 +290,7 @@ def booleans_processing(config, **kwargs):
            or kwargs["output_hidden_states"] is not None
            or ("use_cache" in kwargs and kwargs["use_cache"] is not None)
        ):
-            logger.warning(
+            tf_logger.warn(
                "The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model."
                "They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`)."
            )
@@ -294,7 +299,7 @@ def booleans_processing(config, **kwargs):
        final_booleans["output_hidden_states"] = config.output_hidden_states

        if kwargs["return_dict"] is not None:
-            logger.warning("The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.")
+            tf_logger.warn("The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.")
        final_booleans["return_dict"] = True

        if "use_cache" in kwargs:
@@ -322,6 +327,7 @@ def input_processing(func, config, input_ids, **kwargs):
    """
    signature = dict(inspect.signature(func).parameters)
    signature.pop("kwargs", None)
+    signature.pop("self", None)
    parameter_names = list(signature.keys())
    output = {}
    allowed_types = (tf.Tensor, bool, int, ModelOutput, tuple, list, dict, np.ndarray)
@@ -346,6 +352,8 @@ def input_processing(func, config, input_ids, **kwargs):
            f"The following keyword arguments are not supported by this model: {list(kwargs['kwargs_call'].keys())}."
        )

+    kwargs.pop("kwargs_call")
+
    for k, v in kwargs.items():
        if isinstance(v, allowed_types) or v is None:
            output[k] = v
@@ -356,8 +364,8 @@ def input_processing(func, config, input_ids, **kwargs):
        for i, input in enumerate(input_ids):
            # EagerTensors don't allow to use the .name property so we check for a real Tensor
            if type(input) == tf.Tensor:
-                # Tensor names have always the pattern name:device_id then we check only the
-                # name and not the device id
+                # Tensor names have always the pattern `name:id` then we check only the
+                # `name` part
                tensor_name = input.name.split(":")[0]

                if tensor_name in parameter_names:
@@ -809,25 +817,29 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):

        return model_embeds

-    def _get_word_embedding_weight(self, embedding_layer):
-        if hasattr(embedding_layer, "word_embeddings"):
-            return embedding_layer.word_embeddings
-        elif hasattr(embedding_layer, "weight"):
-            return embedding_layer.weight
-        elif hasattr(embedding_layer, "decoder"):
-            return embedding_layer.decoder
-        else:
-            # Here we build the word embeddings weights if not exists.
-            # And then we retry to get the attribute once built.
-            self(self.dummy_inputs)
-            if hasattr(embedding_layer, "word_embeddings"):
-                return embedding_layer.word_embeddings
-            elif hasattr(embedding_layer, "weight"):
-                return embedding_layer.weight
-            elif hasattr(embedding_layer, "decoder"):
-                return embedding_layer.decoder
-            else:
-                return None
+    def _get_word_embedding_weight(model, embedding_layer):
+        embeds = getattr(embedding_layer, "weight", None)
+        if embeds is not None:
+            return embeds
+
+        embeds = getattr(embedding_layer, "decoder", None)
+        if embeds is not None:
+            return embeds
+
+        # The reason why the attributes don't exist might be
+        # because the model is not built, so retry getting
+        # the argument after building the model
+        model(model.dummy_inputs)
+
+        embeds = getattr(embedding_layer, "weight", None)
+        if embeds is not None:
+            return embeds
+
+        embeds = getattr(embedding_layer, "decoder", None)
+        if embeds is not None:
+            return embeds
+
+        return None

    def _resize_token_embeddings(self, new_num_tokens):
        old_embeddings = self._get_word_embedding_weight(self.get_input_embeddings())
@@ -1319,6 +1331,119 @@ class TFConv1D(tf.keras.layers.Layer):
        return x


+class WordEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, vocab_size: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape):
+        self.word_embeddings = self.add_weight(
+            name="weight",
+            shape=[self.vocab_size, self.hidden_size],
+            initializer=get_initializer(initializer_range=self.initializer_range),
+        )
+
+        super().build(input_shape=input_shape)
+
+    def get_config(self):
+        config = {
+            "vocab_size": self.vocab_size,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, input_ids):
+        flat_input_ids = tf.reshape(tensor=input_ids, shape=[-1])
+        embeddings = tf.gather(params=self.word_embeddings, indices=flat_input_ids)
+        embeddings = tf.reshape(
+            tensor=embeddings, shape=tf.concat(values=[shape_list(tensor=input_ids), [self.hidden_size]], axis=0)
+        )
+
+        embeddings.set_shape(shape=input_ids.shape.as_list() + [self.hidden_size])
+
+        return embeddings
+
+
+class TokenTypeEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, type_vocab_size: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.type_vocab_size = type_vocab_size
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape):
+        self.token_type_embeddings = self.add_weight(
+            name="embeddings",
+            shape=[self.type_vocab_size, self.hidden_size],
+            initializer=get_initializer(initializer_range=self.initializer_range),
+        )
+
+        super().build(input_shape=input_shape)
+
+    def get_config(self):
+        config = {
+            "type_vocab_size": self.type_vocab_size,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, token_type_ids):
+        flat_token_type_ids = tf.reshape(tensor=token_type_ids, shape=[-1])
+        one_hot_data = tf.one_hot(indices=flat_token_type_ids, depth=self.type_vocab_size, dtype=self._compute_dtype)
+        embeddings = tf.matmul(a=one_hot_data, b=self.token_type_embeddings)
+        embeddings = tf.reshape(
+            tensor=embeddings, shape=tf.concat(values=[shape_list(tensor=token_type_ids), [self.hidden_size]], axis=0)
+        )
+
+        embeddings.set_shape(shape=token_type_ids.shape.as_list() + [self.hidden_size])
+
+        return embeddings
+
+
+class PositionEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, max_position_embeddings: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape):
+        self.position_embeddings = self.add_weight(
+            name="embeddings",
+            shape=[self.max_position_embeddings, self.hidden_size],
+            initializer=get_initializer(initializer_range=self.initializer_range),
+        )
+
+        super().build(input_shape)
+
+    def get_config(self):
+        config = {
+            "max_position_embeddings": self.max_position_embeddings,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, position_ids):
+        input_shape = shape_list(tensor=position_ids)
+        position_embeddings = self.position_embeddings[: input_shape[1], :]
+
+        return tf.broadcast_to(input=position_embeddings, shape=input_shape)
+
+
 class TFSharedEmbeddings(tf.keras.layers.Layer):
    r"""
    Construct shared token embeddings.
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -28,6 +28,7 @@ from . import (
    blenderbot,
    blenderbot_small,
    camembert,
+    convbert,
    ctrl,
    deberta,
    dialogpt,
@@ -62,6 +63,7 @@ from . import (
    t5,
    tapas,
    transfo_xl,
+    wav2vec2,
    xlm,
    xlm_roberta,
    xlnet,
--- a/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py
@@ -19,8 +19,8 @@ import argparse

 import torch

-from ...utils import logging
-from . import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert
+from transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert
+from transformers.utils import logging


 logging.set_verbosity_info()
--- a/src/transformers/models/albert/modeling_tf_albert.py
+++ b/src/transformers/models/albert/modeling_tf_albert.py
@@ -17,7 +17,7 @@


 from dataclasses import dataclass
-from typing import Optional, Tuple
+from typing import Any, Dict, Optional, Tuple

 import tensorflow as tf

@@ -73,124 +73,185 @@ TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
 ]


+# Copied from transformers.models.bert.modeling_tf_bert.TFBertWordEmbeddings
+class TFAlbertWordEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, vocab_size: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape: tf.TensorShape):
+        self.weight = self.add_weight(
+            name="weight",
+            shape=[self.vocab_size, self.hidden_size],
+            initializer=get_initializer(self.initializer_range),
+        )
+
+        super().build(input_shape)
+
+    def get_config(self) -> Dict[str, Any]:
+        config = {
+            "vocab_size": self.vocab_size,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, input_ids: tf.Tensor) -> tf.Tensor:
+        flat_input_ids = tf.reshape(tensor=input_ids, shape=[-1])
+        embeddings = tf.gather(params=self.weight, indices=flat_input_ids)
+        embeddings = tf.reshape(
+            tensor=embeddings, shape=tf.concat(values=[shape_list(input_ids), [self.hidden_size]], axis=0)
+        )
+
+        embeddings.set_shape(input_ids.shape.as_list() + [self.hidden_size])
+
+        return embeddings
+
+
+# Copied from transformers.models.bert.modeling_tf_bert.TFBertTokenTypeEmbeddings
+class TFAlbertTokenTypeEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, type_vocab_size: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.type_vocab_size = type_vocab_size
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape: tf.TensorShape):
+        self.token_type_embeddings = self.add_weight(
+            name="embeddings",
+            shape=[self.type_vocab_size, self.hidden_size],
+            initializer=get_initializer(self.initializer_range),
+        )
+
+        super().build(input_shape)
+
+    def get_config(self) -> Dict[str, Any]:
+        config = {
+            "type_vocab_size": self.type_vocab_size,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, token_type_ids: tf.Tensor) -> tf.Tensor:
+        flat_token_type_ids = tf.reshape(tensor=token_type_ids, shape=[-1])
+        one_hot_data = tf.one_hot(indices=flat_token_type_ids, depth=self.type_vocab_size, dtype=self._compute_dtype)
+        embeddings = tf.matmul(a=one_hot_data, b=self.token_type_embeddings)
+        embeddings = tf.reshape(
+            tensor=embeddings, shape=tf.concat(values=[shape_list(token_type_ids), [self.hidden_size]], axis=0)
+        )
+
+        embeddings.set_shape(token_type_ids.shape.as_list() + [self.hidden_size])
+
+        return embeddings
+
+
+# Copied from transformers.models.bert.modeling_tf_bert.TFBertPositionEmbeddings
+class TFAlbertPositionEmbeddings(tf.keras.layers.Layer):
+    def __init__(self, max_position_embeddings: int, hidden_size: int, initializer_range: float, **kwargs):
+        super().__init__(**kwargs)
+
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.initializer_range = initializer_range
+
+    def build(self, input_shape: tf.TensorShape):
+        self.position_embeddings = self.add_weight(
+            name="embeddings",
+            shape=[self.max_position_embeddings, self.hidden_size],
+            initializer=get_initializer(self.initializer_range),
+        )
+
+        super().build(input_shape)
+
+    def get_config(self) -> Dict[str, Any]:
+        config = {
+            "max_position_embeddings": self.max_position_embeddings,
+            "hidden_size": self.hidden_size,
+            "initializer_range": self.initializer_range,
+        }
+        base_config = super().get_config()
+
+        return dict(list(base_config.items()) + list(config.items()))
+
+    def call(self, position_ids: tf.Tensor) -> tf.Tensor:
+        input_shape = shape_list(position_ids)
+        position_embeddings = self.position_embeddings[: input_shape[1], :]
+
+        return tf.broadcast_to(input=position_embeddings, shape=input_shape)
+
+
 class TFAlbertEmbeddings(tf.keras.layers.Layer):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config, **kwargs):
        super().__init__(**kwargs)

-        self.vocab_size = config.vocab_size
-        self.embedding_size = config.embedding_size
-        self.initializer_range = config.initializer_range
-        self.max_position_embeddings = config.max_position_embeddings
-        self.type_vocab_size = config.type_vocab_size
-        self.layer_norm_eps = config.layer_norm_eps
-        self.hidden_dropout_prob = config.hidden_dropout_prob
-
-        self.position_embeddings = tf.keras.layers.Embedding(
-            self.max_position_embeddings,
-            self.embedding_size,
-            embeddings_initializer=get_initializer(self.initializer_range),
+        self.word_embeddings = TFAlbertWordEmbeddings(
+            vocab_size=config.vocab_size,
+            hidden_size=config.embedding_size,
+            initializer_range=config.initializer_range,
+            name="word_embeddings",
+        )
+        self.position_embeddings = TFAlbertPositionEmbeddings(
+            max_position_embeddings=config.max_position_embeddings,
+            hidden_size=config.embedding_size,
+            initializer_range=config.initializer_range,
            name="position_embeddings",
        )
-        self.token_type_embeddings = tf.keras.layers.Embedding(
-            self.type_vocab_size,
-            self.embedding_size,
-            embeddings_initializer=get_initializer(self.initializer_range),
+        self.token_type_embeddings = TFAlbertTokenTypeEmbeddings(
+            type_vocab_size=config.type_vocab_size,
+            hidden_size=config.embedding_size,
+            initializer_range=config.initializer_range,
            name="token_type_embeddings",
        )
+        self.embeddings_sum = tf.keras.layers.Add()
+        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)

-        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
-        # any TensorFlow checkpoint file
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=self.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(self.hidden_dropout_prob)
-
-    def build(self, input_shape):
-        """Build shared word embedding layer """
-        with tf.name_scope("word_embeddings"):
-            # Create and initialize weights. The random normal initializer was chosen
-            # arbitrarily, and works well.
-            self.word_embeddings = self.add_weight(
-                "weight",
-                shape=[self.vocab_size, self.embedding_size],
-                initializer=get_initializer(self.initializer_range),
-            )
-        super().build(input_shape)
-
+    # Copied from transformers.models.bert.modeling_tf_bert.TFBertEmbeddings.call
    def call(
        self,
-        input_ids=None,
-        position_ids=None,
-        token_type_ids=None,
-        inputs_embeds=None,
-        mode="embedding",
-        training=False,
-    ):
+        input_ids: tf.Tensor,
+        position_ids: tf.Tensor,
+        token_type_ids: tf.Tensor,
+        inputs_embeds: tf.Tensor,
+        training: bool = False,
+    ) -> tf.Tensor:
        """
-        Get token embeddings of inputs
-
-        Args:
-            inputs: list of three int64 tensors with shape [batch_size, length]: (input_ids, position_ids, token_type_ids)
-            mode: string, a valid value is one of "embedding" and "linear"
+        Applies embedding based on inputs tensor.

        Returns:
-            outputs: (1) If mode == "embedding", output embedding tensor, float32 with shape [batch_size, length,
-            embedding_size]; (2) mode == "linear", output linear tensor, float32 with shape [batch_size, length,
-            vocab_size]
-
-        Raises:
-            ValueError: if mode is not valid.
-
-        Shared weights logic adapted from
-        https://github.com/tensorflow/models/blob/a009f4fb9d2fc4949e32192a944688925ef78659/official/transformer/v2/embedding_layer.py#L24
+            final_embeddings (:obj:`tf.Tensor`): output embedding tensor.
        """
-        if mode == "embedding":
-            return self._embedding(input_ids, position_ids, token_type_ids, inputs_embeds, training=training)
-        elif mode == "linear":
-            return self._linear(input_ids)
-        else:
-            raise ValueError("mode {} is not valid.".format(mode))
-
-    def _embedding(self, input_ids, position_ids, token_type_ids, inputs_embeds, training=False):
-        """Applies embedding based on inputs tensor."""
        assert not (input_ids is None and inputs_embeds is None)

        if input_ids is not None:
-            input_shape = shape_list(input_ids)
-        else:
-            input_shape = shape_list(inputs_embeds)[:-1]
+            inputs_embeds = self.word_embeddings(input_ids=input_ids)

-        seq_length = input_shape[1]
-        if position_ids is None:
-            position_ids = tf.range(seq_length, dtype=tf.int32)[tf.newaxis, :]
        if token_type_ids is None:
-            token_type_ids = tf.fill(input_shape, 0)
+            input_shape = shape_list(tensor=inputs_embeds)[:-1]
+            token_type_ids = tf.fill(dims=input_shape, value=0)

-        if inputs_embeds is None:
-            inputs_embeds = tf.gather(self.word_embeddings, input_ids)
-        position_embeddings = self.position_embeddings(position_ids)
-        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        if position_ids is None:
+            position_embeds = self.position_embeddings(position_ids=inputs_embeds)
+        else:
+            position_embeds = self.position_embeddings(position_ids=position_ids)

-        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
-        embeddings = self.LayerNorm(embeddings)
-        embeddings = self.dropout(embeddings, training=training)
-        return embeddings
+        token_type_embeds = self.token_type_embeddings(token_type_ids=token_type_ids)
+        final_embeddings = self.embeddings_sum(inputs=[inputs_embeds, position_embeds, token_type_embeds])
+        final_embeddings = self.LayerNorm(inputs=final_embeddings)
+        final_embeddings = self.dropout(inputs=final_embeddings, training=training)

-    def _linear(self, inputs):
-        """
-        Computes logits by running inputs through a linear layer
-
-        Args:
-            inputs: A float32 tensor with shape [batch_size, length, embedding_size
-
-        Returns:
-            float32 tensor with shape [batch_size, length, vocab_size].
-        """
-        batch_size = shape_list(inputs)[0]
-        length = shape_list(inputs)[1]
-        x = tf.reshape(inputs, [-1, self.embedding_size])
-        logits = tf.matmul(x, self.word_embeddings, transpose_b=True)
-        return tf.reshape(logits, [batch_size, length, self.vocab_size])
+        return final_embeddings


 class TFAlbertSelfOutput(tf.keras.layers.Layer):
@@ -446,8 +507,9 @@ class TFAlbertPreTrainedModel(TFPreTrainedModel):
 class TFAlbertMLMHead(tf.keras.layers.Layer):
    def __init__(self, config, input_embeddings, **kwargs):
        super().__init__(**kwargs)
-        self.vocab_size = config.vocab_size

+        self.vocab_size = config.vocab_size
+        self.embedding_size = config.embedding_size
        self.dense = tf.keras.layers.Dense(
            config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
        )
@@ -474,7 +536,7 @@ class TFAlbertMLMHead(tf.keras.layers.Layer):
        return self.decoder

    def set_output_embeddings(self, value):
-        self.decoder.word_embeddings = value
+        self.decoder.weight = value
        self.decoder.vocab_size = shape_list(value)[0]

    def get_bias(self):
@@ -486,10 +548,15 @@ class TFAlbertMLMHead(tf.keras.layers.Layer):
        self.vocab_size = shape_list(value["bias"])[0]

    def call(self, hidden_states):
-        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dense(inputs=hidden_states)
        hidden_states = self.activation(hidden_states)
-        hidden_states = self.LayerNorm(hidden_states)
-        hidden_states = self.decoder(hidden_states, mode="linear") + self.decoder_bias
+        hidden_states = self.LayerNorm(inputs=hidden_states)
+        seq_length = shape_list(tensor=hidden_states)[1]
+        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, self.embedding_size])
+        hidden_states = tf.matmul(a=hidden_states, b=self.decoder.weight, transpose_b=True)
+        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, seq_length, self.vocab_size])
+        hidden_states = tf.nn.bias_add(value=hidden_states, bias=self.decoder_bias)
+
        return hidden_states


@@ -516,11 +583,11 @@ class TFAlbertMainLayer(tf.keras.layers.Layer):
        )

    def get_input_embeddings(self):
-        return self.embeddings
+        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
-        self.embeddings.word_embeddings = value
-        self.embeddings.vocab_size = shape_list(value)[0]
+        self.embeddings.word_embeddings.weight = value
+        self.embeddings.word_embeddings.vocab_size = shape_list(value)[0]

    def _prune_heads(self, heads_to_prune):
        """
@@ -742,12 +809,15 @@ ALBERT_INPUTS_DOCSTRING = r"""
            vectors than the model's internal embedding lookup matrix.
        output_attentions (:obj:`bool`, `optional`):
            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
-            tensors for more detail.
+            tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the
+            config will be used instead.
        output_hidden_states (:obj:`bool`, `optional`):
            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
-            more detail.
+            more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
+            used instead.
        return_dict (:obj:`bool`, `optional`):
-            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. This
+            argument can be used in eager mode, in graph mode the value will always be set to True.
        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use the model in training mode (some modules like dropout modules have different
            behaviors between training and evaluation).
@@ -816,7 +886,7 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
        return outputs

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertModel.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

@@ -844,7 +914,7 @@ class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
        self.num_labels = config.num_labels

        self.albert = TFAlbertMainLayer(config, name="albert")
-        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name="predictions")
+        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings.word_embeddings, name="predictions")
        self.sop_classifier = TFAlbertSOPHead(config, name="sop_classifier")

    def get_lm_head(self):
@@ -964,7 +1034,7 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
        super().__init__(config, *inputs, **kwargs)

        self.albert = TFAlbertMainLayer(config, add_pooling_layer=False, name="albert")
-        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name="predictions")
+        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings.word_embeddings, name="predictions")

    def get_lm_head(self):
        return self.predictions
@@ -1042,7 +1112,7 @@ class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss)
        )

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

@@ -1145,7 +1215,7 @@ class TFAlbertForSequenceClassification(TFAlbertPreTrainedModel, TFSequenceClass
        )

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

@@ -1247,7 +1317,7 @@ class TFAlbertForTokenClassification(TFAlbertPreTrainedModel, TFTokenClassificat
        )

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

@@ -1362,7 +1432,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel, TFQuestionAnsweringL
        )

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForQuestionAnswering.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

@@ -1509,13 +1579,14 @@ class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel, TFMultipleChoiceLoss):
            }
        ]
    )
-    def serving(self, inputs):
-        output = self.call(inputs)
+    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving
+    def serving(self, inputs: Dict[str, tf.Tensor]):
+        output = self.call(input_ids=inputs)

        return self.serving_output(output)

    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving_output
-    def serving_output(self, output):
+    def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput:
        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None

--- a/src/transformers/models/albert/tokenization_albert.py
+++ b/src/transformers/models/albert/tokenization_albert.py
@@ -22,7 +22,7 @@ from typing import List, Optional, Tuple

 import sentencepiece as spm

-from ...tokenization_utils import PreTrainedTokenizer
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer
 from ...utils import logging


@@ -103,8 +103,9 @@ class AlbertTokenizer(PreTrainedTokenizer):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.

-    Attributes: sp_model (:obj:`SentencePieceProcessor`): The `SentencePiece` processor that is used for every
-    conversion (string, tokens and IDs).
+    Attributes:
+        sp_model (:obj:`SentencePieceProcessor`):
+            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
    """

    vocab_files_names = VOCAB_FILES_NAMES
@@ -126,6 +127,9 @@ class AlbertTokenizer(PreTrainedTokenizer):
        mask_token="[MASK]",
        **kwargs
    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
        super().__init__(
            do_lower_case=do_lower_case,
            remove_space=remove_space,
--- a/src/transformers/models/albert/tokenization_albert_fast.py
+++ b/src/transformers/models/albert/tokenization_albert_fast.py
@@ -20,6 +20,7 @@ from shutil import copyfile
 from typing import List, Optional, Tuple

 from ...file_utils import is_sentencepiece_available
+from ...tokenization_utils import AddedToken
 from ...tokenization_utils_fast import PreTrainedTokenizerFast
 from ...utils import logging

@@ -88,8 +89,11 @@ class AlbertTokenizerFast(PreTrainedTokenizerFast):
            Whether or not to keep accents when tokenizing.
        bos_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
-            .. note:: When building a sequence using special tokens, this is not the token that is used for the
-            beginning of sequence. The token used is the :obj:`cls_token`.
+
+            .. note::
+
+               When building a sequence using special tokens, this is not the token that is used for the beginning of
+               sequence. The token used is the :obj:`cls_token`.
        eos_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
            The end of sequence token. .. note:: When building a sequence using special tokens, this is not the token
            that is used for the end of sequence. The token used is the :obj:`sep_token`.
@@ -107,9 +111,7 @@ class AlbertTokenizerFast(PreTrainedTokenizerFast):
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
            The token used for masking values. This is the token used when training this model with masked language
-            modeling. This is the token which the model will try to predict. Attributes:
-        sp_model (:obj:`SentencePieceProcessor`):
-            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
+            modeling. This is the token which the model will try to predict.
    """

    vocab_files_names = VOCAB_FILES_NAMES
@@ -133,6 +135,9 @@ class AlbertTokenizerFast(PreTrainedTokenizerFast):
        mask_token="[MASK]",
        **kwargs
    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
        super().__init__(
            vocab_file,
            tokenizer_file=tokenizer_file,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -28,6 +28,7 @@ from ..blenderbot_small.configuration_blenderbot_small import (
    BlenderbotSmallConfig,
 )
 from ..camembert.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
+from ..convbert.configuration_convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig
 from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
 from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
 from ..distilbert.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
@@ -58,6 +59,7 @@ from ..squeezebert.configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFI
 from ..t5.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
 from ..tapas.configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
 from ..transfo_xl.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
+from ..wav2vec2.configuration_wav2vec2 import WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2Config
 from ..xlm.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
 from ..xlm_prophetnet.configuration_xlm_prophetnet import (
    XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -71,6 +73,8 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
    (key, value)
    for pretrained_map in [
        # Add archive maps here
+        WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        LED_PRETRAINED_CONFIG_ARCHIVE_MAP,
        BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -112,6 +116,8 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
 CONFIG_MAPPING = OrderedDict(
    [
        # Add configs here
+        ("wav2vec2", Wav2Vec2Config),
+        ("convbert", ConvBertConfig),
        ("led", LEDConfig),
        ("blenderbot-small", BlenderbotSmallConfig),
        ("retribert", RetriBertConfig),
@@ -159,6 +165,8 @@ CONFIG_MAPPING = OrderedDict(
 MODEL_NAMES_MAPPING = OrderedDict(
    [
        # Add full (and cased) model names here
+        ("wav2vec2", "Wav2Vec2"),
+        ("convbert", "ConvBERT"),
        ("led", "LED"),
        ("blenderbot-small", "BlenderbotSmall"),
        ("retribert", "RetriBERT"),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -33,6 +33,7 @@ from ..albert.modeling_albert import (
    AlbertModel,
 )
 from ..bart.modeling_bart import (
+    BartForCausalLM,
    BartForConditionalGeneration,
    BartForQuestionAnswering,
    BartForSequenceClassification,
@@ -50,8 +51,12 @@ from ..bert.modeling_bert import (
    BertModel,
 )
 from ..bert_generation.modeling_bert_generation import BertGenerationDecoder, BertGenerationEncoder
-from ..blenderbot.modeling_blenderbot import BlenderbotForConditionalGeneration, BlenderbotModel
-from ..blenderbot_small.modeling_blenderbot_small import BlenderbotSmallForConditionalGeneration, BlenderbotSmallModel
+from ..blenderbot.modeling_blenderbot import BlenderbotForCausalLM, BlenderbotForConditionalGeneration, BlenderbotModel
+from ..blenderbot_small.modeling_blenderbot_small import (
+    BlenderbotSmallForCausalLM,
+    BlenderbotSmallForConditionalGeneration,
+    BlenderbotSmallModel,
+)
 from ..camembert.modeling_camembert import (
    CamembertForCausalLM,
    CamembertForMaskedLM,
@@ -61,8 +66,24 @@ from ..camembert.modeling_camembert import (
    CamembertForTokenClassification,
    CamembertModel,
 )
+
+# Add modeling imports here
+from ..convbert.modeling_convbert import (
+    ConvBertForMaskedLM,
+    ConvBertForMultipleChoice,
+    ConvBertForQuestionAnswering,
+    ConvBertForSequenceClassification,
+    ConvBertForTokenClassification,
+    ConvBertModel,
+)
 from ..ctrl.modeling_ctrl import CTRLForSequenceClassification, CTRLLMHeadModel, CTRLModel
-from ..deberta.modeling_deberta import DebertaForSequenceClassification, DebertaModel
+from ..deberta.modeling_deberta import (
+    DebertaForMaskedLM,
+    DebertaForQuestionAnswering,
+    DebertaForSequenceClassification,
+    DebertaForTokenClassification,
+    DebertaModel,
+)
 from ..distilbert.modeling_distilbert import (
    DistilBertForMaskedLM,
    DistilBertForMultipleChoice,
@@ -122,8 +143,9 @@ from ..longformer.modeling_longformer import (
    LongformerModel,
 )
 from ..lxmert.modeling_lxmert import LxmertForPreTraining, LxmertForQuestionAnswering, LxmertModel
-from ..marian.modeling_marian import MarianModel, MarianMTModel
+from ..marian.modeling_marian import MarianForCausalLM, MarianModel, MarianMTModel
 from ..mbart.modeling_mbart import (
+    MBartForCausalLM,
    MBartForConditionalGeneration,
    MBartForQuestionAnswering,
    MBartForSequenceClassification,
@@ -149,7 +171,7 @@ from ..mpnet.modeling_mpnet import (
 )
 from ..mt5.modeling_mt5 import MT5ForConditionalGeneration, MT5Model
 from ..openai.modeling_openai import OpenAIGPTForSequenceClassification, OpenAIGPTLMHeadModel, OpenAIGPTModel
-from ..pegasus.modeling_pegasus import PegasusForConditionalGeneration, PegasusModel
+from ..pegasus.modeling_pegasus import PegasusForCausalLM, PegasusForConditionalGeneration, PegasusModel
 from ..prophetnet.modeling_prophetnet import ProphetNetForCausalLM, ProphetNetForConditionalGeneration, ProphetNetModel
 from ..rag.modeling_rag import (  # noqa: F401 - need to import all RagModels to be in globals() function
    RagModel,
@@ -189,6 +211,7 @@ from ..tapas.modeling_tapas import (
    TapasModel,
 )
 from ..transfo_xl.modeling_transfo_xl import TransfoXLForSequenceClassification, TransfoXLLMHeadModel, TransfoXLModel
+from ..wav2vec2.modeling_wav2vec2 import Wav2Vec2ForMaskedLM, Wav2Vec2Model
 from ..xlm.modeling_xlm import (
    XLMForMultipleChoice,
    XLMForQuestionAnsweringSimple,
@@ -228,6 +251,7 @@ from .configuration_auto import (
    BlenderbotConfig,
    BlenderbotSmallConfig,
    CamembertConfig,
+    ConvBertConfig,
    CTRLConfig,
    DebertaConfig,
    DistilBertConfig,
@@ -257,6 +281,7 @@ from .configuration_auto import (
    T5Config,
    TapasConfig,
    TransfoXLConfig,
+    Wav2Vec2Config,
    XLMConfig,
    XLMProphetNetConfig,
    XLMRobertaConfig,
@@ -271,6 +296,8 @@ logger = logging.get_logger(__name__)
 MODEL_MAPPING = OrderedDict(
    [
        # Base model mapping
+        (Wav2Vec2Config, Wav2Vec2Model),
+        (ConvBertConfig, ConvBertModel),
        (LEDConfig, LEDModel),
        (BlenderbotSmallConfig, BlenderbotSmallModel),
        (RetriBertConfig, RetriBertModel),
@@ -349,6 +376,8 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
 MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
    [
        # Model with LM heads mapping
+        (Wav2Vec2Config, Wav2Vec2ForMaskedLM),
+        (ConvBertConfig, ConvBertForMaskedLM),
        (LEDConfig, LEDForConditionalGeneration),
        (BlenderbotSmallConfig, BlenderbotSmallForConditionalGeneration),
        (LayoutLMConfig, LayoutLMForMaskedLM),
@@ -378,6 +407,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (FunnelConfig, FunnelForMaskedLM),
        (MPNetConfig, MPNetForMaskedLM),
        (TapasConfig, TapasForMaskedLM),
+        (DebertaConfig, DebertaForMaskedLM),
    ]
 )

@@ -401,16 +431,25 @@ MODEL_FOR_CAUSAL_LM_MAPPING = OrderedDict(
        (BertGenerationConfig, BertGenerationDecoder),
        (XLMProphetNetConfig, XLMProphetNetForCausalLM),
        (ProphetNetConfig, ProphetNetForCausalLM),
+        (BartConfig, BartForCausalLM),
+        (MBartConfig, MBartForCausalLM),
+        (PegasusConfig, PegasusForCausalLM),
+        (MarianConfig, MarianForCausalLM),
+        (BlenderbotConfig, BlenderbotForCausalLM),
+        (BlenderbotSmallConfig, BlenderbotSmallForCausalLM),
    ]
 )

 MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
    [
        # Model for Masked LM mapping
+        (Wav2Vec2Config, Wav2Vec2ForMaskedLM),
+        (ConvBertConfig, ConvBertForMaskedLM),
        (LayoutLMConfig, LayoutLMForMaskedLM),
        (DistilBertConfig, DistilBertForMaskedLM),
        (AlbertConfig, AlbertForMaskedLM),
        (BartConfig, BartForConditionalGeneration),
+        (MBartConfig, MBartForConditionalGeneration),
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
        (LongformerConfig, LongformerForMaskedLM),
@@ -425,6 +464,7 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
        (FunnelConfig, FunnelForMaskedLM),
        (MPNetConfig, MPNetForMaskedLM),
        (TapasConfig, TapasForMaskedLM),
+        (DebertaConfig, DebertaForMaskedLM),
    ]
 )

@@ -450,6 +490,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
 MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
    [
        # Model for Sequence Classification mapping
+        (ConvBertConfig, ConvBertForSequenceClassification),
        (LEDConfig, LEDForSequenceClassification),
        (DistilBertConfig, DistilBertForSequenceClassification),
        (AlbertConfig, AlbertForSequenceClassification),
@@ -482,6 +523,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
 MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
    [
        # Model for Question Answering mapping
+        (ConvBertConfig, ConvBertForQuestionAnswering),
        (LEDConfig, LEDForQuestionAnswering),
        (DistilBertConfig, DistilBertForQuestionAnswering),
        (AlbertConfig, AlbertForQuestionAnswering),
@@ -502,6 +544,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
        (FunnelConfig, FunnelForQuestionAnswering),
        (LxmertConfig, LxmertForQuestionAnswering),
        (MPNetConfig, MPNetForQuestionAnswering),
+        (DebertaConfig, DebertaForQuestionAnswering),
    ]
 )

@@ -515,6 +558,7 @@ MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING = OrderedDict(
 MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
    [
        # Model for Token Classification mapping
+        (ConvBertConfig, ConvBertForTokenClassification),
        (LayoutLMConfig, LayoutLMForTokenClassification),
        (DistilBertConfig, DistilBertForTokenClassification),
        (CamembertConfig, CamembertForTokenClassification),
@@ -532,12 +576,14 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
        (FlaubertConfig, FlaubertForTokenClassification),
        (FunnelConfig, FunnelForTokenClassification),
        (MPNetConfig, MPNetForTokenClassification),
+        (DebertaConfig, DebertaForTokenClassification),
    ]
 )

 MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
    [
        # Model for Multiple Choice mapping
+        (ConvBertConfig, ConvBertForMultipleChoice),
        (CamembertConfig, CamembertForMultipleChoice),
        (ElectraConfig, ElectraForMultipleChoice),
        (XLMRobertaConfig, XLMRobertaForMultipleChoice),
@@ -649,7 +695,7 @@ class AutoModel:
    r"""
    This is a generic model class that will be instantiated as one of the base model classes of the library when
    created with the :meth:`~transformers.AutoModel.from_pretrained` class method or the
-    :meth:`~transformers.AutoModel.from_config` class methods.
+    :meth:`~transformers.AutoModel.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -739,7 +785,7 @@ class AutoModel:
 class AutoModelForPreTraining:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with the
-    architecture used for pretraining this model---when created with the when created with the
+    architecture used for pretraining this model---when created with the
    :meth:`~transformers.AutoModelForPreTraining.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForPreTraining.from_config` class method.

@@ -833,9 +879,8 @@ class AutoModelForPreTraining:
 class AutoModelWithLMHead:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.AutoModelWithLMHead.from_pretrained` class method or the
-    :meth:`~transformers.AutoModelWithLMHead.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.AutoModelWithLMHead.from_pretrained` class
+    method or the :meth:`~transformers.AutoModelWithLMHead.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).

@@ -945,9 +990,8 @@ class AutoModelWithLMHead:
 class AutoModelForCausalLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a causal
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.AutoModelForCausalLM.from_pretrained` class method or the
-    :meth:`~transformers.AutoModelForCausalLM.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.AutoModelForCausalLM.from_pretrained` class
+    method or the :meth:`~transformers.AutoModelForCausalLM.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1039,9 +1083,8 @@ class AutoModelForCausalLM:
 class AutoModelForMaskedLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a masked
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.AutoModelForMaskedLM.from_pretrained` class method or the
-    :meth:`~transformers.AutoModelForMaskedLM.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.AutoModelForMaskedLM.from_pretrained` class
+    method or the :meth:`~transformers.AutoModelForMaskedLM.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1133,7 +1176,7 @@ class AutoModelForMaskedLM:
 class AutoModelForSeq2SeqLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    sequence-to-sequence language modeling head---when created with the when created with the
+    sequence-to-sequence language modeling head---when created with the
    :meth:`~transformers.AutoModelForSeq2SeqLM.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForSeq2SeqLM.from_config` class method.

@@ -1231,7 +1274,7 @@ class AutoModelForSeq2SeqLM:
 class AutoModelForSequenceClassification:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    sequence classification head---when created with the when created with the
+    sequence classification head---when created with the
    :meth:`~transformers.AutoModelForSequenceClassification.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForSequenceClassification.from_config` class method.

@@ -1329,9 +1372,8 @@ class AutoModelForSequenceClassification:
 class AutoModelForQuestionAnswering:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    question answering head---when created with the when created with the
-    :meth:`~transformers.AutoModeForQuestionAnswering.from_pretrained` class method or the
-    :meth:`~transformers.AutoModelForQuestionAnswering.from_config` class method.
+    question answering head---when created with the :meth:`~transformers.AutoModeForQuestionAnswering.from_pretrained`
+    class method or the :meth:`~transformers.AutoModelForQuestionAnswering.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1428,7 +1470,7 @@ class AutoModelForQuestionAnswering:
 class AutoModelForTableQuestionAnswering:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a table
-    question answering head---when created with the when created with the
+    question answering head---when created with the
    :meth:`~transformers.AutoModeForTableQuestionAnswering.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForTableQuestionAnswering.from_config` class method.

@@ -1528,9 +1570,8 @@ class AutoModelForTableQuestionAnswering:
 class AutoModelForTokenClassification:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a token
-    classification head---when created with the when created with the
-    :meth:`~transformers.AutoModelForTokenClassification.from_pretrained` class method or the
-    :meth:`~transformers.AutoModelForTokenClassification.from_config` class method.
+    classification head---when created with the :meth:`~transformers.AutoModelForTokenClassification.from_pretrained`
+    class method or the :meth:`~transformers.AutoModelForTokenClassification.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1627,7 +1668,7 @@ class AutoModelForTokenClassification:
 class AutoModelForMultipleChoice:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    multiple choice classification head---when created with the when created with the
+    multiple choice classification head---when created with the
    :meth:`~transformers.AutoModelForMultipleChoice.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForMultipleChoice.from_config` class method.

@@ -1726,8 +1767,8 @@ class AutoModelForMultipleChoice:

 class AutoModelForNextSentencePrediction:
    r"""
-    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    multiple choice classification head---when created with the when created with the
+    This is a generic model class that will be instantiated as one of the model classes of the library---with a next
+    sentence prediction head---when created with the
    :meth:`~transformers.AutoModelForNextSentencePrediction.from_pretrained` class method or the
    :meth:`~transformers.AutoModelForNextSentencePrediction.from_config` class method.

--- a/src/transformers/models/auto/modeling_tf_auto.py
+++ b/src/transformers/models/auto/modeling_tf_auto.py
@@ -57,6 +57,14 @@ from ..camembert.modeling_tf_camembert import (
    TFCamembertForTokenClassification,
    TFCamembertModel,
 )
+from ..convbert.modeling_tf_convbert import (
+    TFConvBertForMaskedLM,
+    TFConvBertForMultipleChoice,
+    TFConvBertForQuestionAnswering,
+    TFConvBertForSequenceClassification,
+    TFConvBertForTokenClassification,
+    TFConvBertModel,
+)
 from ..ctrl.modeling_tf_ctrl import TFCTRLForSequenceClassification, TFCTRLLMHeadModel, TFCTRLModel
 from ..distilbert.modeling_tf_distilbert import (
    TFDistilBertForMaskedLM,
@@ -173,6 +181,7 @@ from .configuration_auto import (
    BlenderbotConfig,
    BlenderbotSmallConfig,
    CamembertConfig,
+    ConvBertConfig,
    CTRLConfig,
    DistilBertConfig,
    DPRConfig,
@@ -206,6 +215,7 @@ logger = logging.get_logger(__name__)
 TF_MODEL_MAPPING = OrderedDict(
    [
        # Base model mapping
+        (ConvBertConfig, TFConvBertModel),
        (LEDConfig, TFLEDModel),
        (LxmertConfig, TFLxmertModel),
        (MT5Config, TFMT5Model),
@@ -268,6 +278,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
 TF_MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
    [
        # Model with LM heads mapping
+        (ConvBertConfig, TFConvBertForMaskedLM),
        (LEDConfig, TFLEDForConditionalGeneration),
        (T5Config, TFT5ForConditionalGeneration),
        (DistilBertConfig, TFDistilBertForMaskedLM),
@@ -312,6 +323,7 @@ TF_MODEL_FOR_CAUSAL_LM_MAPPING = OrderedDict(
 TF_MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
    [
        # Model for Masked LM mapping
+        (ConvBertConfig, TFConvBertForMaskedLM),
        (DistilBertConfig, TFDistilBertForMaskedLM),
        (AlbertConfig, TFAlbertForMaskedLM),
        (CamembertConfig, TFCamembertForMaskedLM),
@@ -347,6 +359,7 @@ TF_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
 TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
    [
        # Model for Sequence Classification mapping
+        (ConvBertConfig, TFConvBertForSequenceClassification),
        (DistilBertConfig, TFDistilBertForSequenceClassification),
        (AlbertConfig, TFAlbertForSequenceClassification),
        (CamembertConfig, TFCamembertForSequenceClassification),
@@ -371,6 +384,7 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
 TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
    [
        # Model for Question Answering mapping
+        (ConvBertConfig, TFConvBertForQuestionAnswering),
        (DistilBertConfig, TFDistilBertForQuestionAnswering),
        (AlbertConfig, TFAlbertForQuestionAnswering),
        (CamembertConfig, TFCamembertForQuestionAnswering),
@@ -391,6 +405,7 @@ TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
 TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
    [
        # Model for Token Classification mapping
+        (ConvBertConfig, TFConvBertForTokenClassification),
        (DistilBertConfig, TFDistilBertForTokenClassification),
        (AlbertConfig, TFAlbertForTokenClassification),
        (CamembertConfig, TFCamembertForTokenClassification),
@@ -411,6 +426,7 @@ TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
 TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
    [
        # Model for Multiple Choice mapping
+        (ConvBertConfig, TFConvBertForMultipleChoice),
        (CamembertConfig, TFCamembertForMultipleChoice),
        (XLMConfig, TFXLMForMultipleChoice),
        (XLMRobertaConfig, TFXLMRobertaForMultipleChoice),
@@ -523,7 +539,7 @@ class TFAutoModel(object):
    r"""
    This is a generic model class that will be instantiated as one of the base model classes of the library when
    created with the when created with the :meth:`~transformers.TFAutoModel.from_pretrained` class method or the
-    :meth:`~transformers.TFAutoModel.from_config` class methods.
+    :meth:`~transformers.TFAutoModel.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -613,7 +629,7 @@ class TFAutoModel(object):
 class TFAutoModelForPreTraining(object):
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with the
-    architecture used for pretraining this model---when created with the when created with the
+    architecture used for pretraining this model---when created with the
    :meth:`~transformers.TFAutoModelForPreTraining.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForPreTraining.from_config` class method.

@@ -707,9 +723,8 @@ class TFAutoModelForPreTraining(object):
 class TFAutoModelWithLMHead(object):
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.TFAutoModelWithLMHead.from_pretrained` class method or the
-    :meth:`~transformers.TFAutoModelWithLMHead.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.TFAutoModelWithLMHead.from_pretrained` class
+    method or the :meth:`~transformers.TFAutoModelWithLMHead.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).

@@ -820,9 +835,8 @@ class TFAutoModelWithLMHead(object):
 class TFAutoModelForCausalLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a causal
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.TFAutoModelForCausalLM.from_pretrained` class method or the
-    :meth:`~transformers.TFAutoModelForCausalLM.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.TFAutoModelForCausalLM.from_pretrained` class
+    method or the :meth:`~transformers.TFAutoModelForCausalLM.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -914,9 +928,8 @@ class TFAutoModelForCausalLM:
 class TFAutoModelForMaskedLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a masked
-    language modeling head---when created with the when created with the
-    :meth:`~transformers.TFAutoModelForMaskedLM.from_pretrained` class method or the
-    :meth:`~transformers.TFAutoModelForMaskedLM.from_config` class method.
+    language modeling head---when created with the :meth:`~transformers.TFAutoModelForMaskedLM.from_pretrained` class
+    method or the :meth:`~transformers.TFAutoModelForMaskedLM.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1008,7 +1021,7 @@ class TFAutoModelForMaskedLM:
 class TFAutoModelForSeq2SeqLM:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    sequence-to-sequence language modeling head---when created with the when created with the
+    sequence-to-sequence language modeling head---when created with the
    :meth:`~transformers.TFAutoModelForSeq2SeqLM.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForSeq2SeqLM.from_config` class method.

@@ -1106,7 +1119,7 @@ class TFAutoModelForSeq2SeqLM:
 class TFAutoModelForSequenceClassification(object):
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    sequence classification head---when created with the when created with the
+    sequence classification head---when created with the
    :meth:`~transformers.TFAutoModelForSequenceClassification.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForSequenceClassification.from_config` class method.

@@ -1204,7 +1217,7 @@ class TFAutoModelForSequenceClassification(object):
 class TFAutoModelForQuestionAnswering(object):
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    question answering head---when created with the when created with the
+    question answering head---when created with the
    :meth:`~transformers.TFAutoModeForQuestionAnswering.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForQuestionAnswering.from_config` class method.

@@ -1301,9 +1314,8 @@ class TFAutoModelForQuestionAnswering(object):
 class TFAutoModelForTokenClassification:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a token
-    classification head---when created with the when created with the
-    :meth:`~transformers.TFAutoModelForTokenClassification.from_pretrained` class method or the
-    :meth:`~transformers.TFAutoModelForTokenClassification.from_config` class method.
+    classification head---when created with the :meth:`~transformers.TFAutoModelForTokenClassification.from_pretrained`
+    class method or the :meth:`~transformers.TFAutoModelForTokenClassification.from_config` class method.

    This class cannot be instantiated directly using ``__init__()`` (throws an error).
    """
@@ -1398,7 +1410,7 @@ class TFAutoModelForTokenClassification:
 class TFAutoModelForMultipleChoice:
    r"""
    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    multiple choice classification head---when created with the when created with the
+    multiple choice classification head---when created with the
    :meth:`~transformers.TFAutoModelForMultipleChoice.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForMultipleChoice.from_config` class method.

@@ -1495,8 +1507,8 @@ class TFAutoModelForMultipleChoice:

 class TFAutoModelForNextSentencePrediction:
    r"""
-    This is a generic model class that will be instantiated as one of the model classes of the library---with a
-    multiple choice classification head---when created with the when created with the
+    This is a generic model class that will be instantiated as one of the model classes of the library---with a next
+    sentence prediction head---when created with the
    :meth:`~transformers.TFAutoModelForNextSentencePrediction.from_pretrained` class method or the
    :meth:`~transformers.TFAutoModelForNextSentencePrediction.from_config` class method.

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -26,6 +26,7 @@ from ..bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer
 from ..bertweet.tokenization_bertweet import BertweetTokenizer
 from ..blenderbot.tokenization_blenderbot import BlenderbotTokenizer
 from ..blenderbot_small.tokenization_blenderbot_small import BlenderbotSmallTokenizer
+from ..convbert.tokenization_convbert import ConvBertTokenizer
 from ..ctrl.tokenization_ctrl import CTRLTokenizer
 from ..deberta.tokenization_deberta import DebertaTokenizer
 from ..distilbert.tokenization_distilbert import DistilBertTokenizer
@@ -51,6 +52,7 @@ from ..roberta.tokenization_roberta import RobertaTokenizer
 from ..squeezebert.tokenization_squeezebert import SqueezeBertTokenizer
 from ..tapas.tokenization_tapas import TapasTokenizer
 from ..transfo_xl.tokenization_transfo_xl import TransfoXLTokenizer
+from ..wav2vec2.tokenization_wav2vec2 import Wav2Vec2Tokenizer
 from ..xlm.tokenization_xlm import XLMTokenizer
 from .configuration_auto import (
    AlbertConfig,
@@ -61,6 +63,7 @@ from .configuration_auto import (
    BlenderbotConfig,
    BlenderbotSmallConfig,
    CamembertConfig,
+    ConvBertConfig,
    CTRLConfig,
    DebertaConfig,
    DistilBertConfig,
@@ -91,6 +94,7 @@ from .configuration_auto import (
    T5Config,
    TapasConfig,
    TransfoXLConfig,
+    Wav2Vec2Config,
    XLMConfig,
    XLMProphetNetConfig,
    XLMRobertaConfig,
@@ -134,6 +138,7 @@ if is_tokenizers_available():
    from ..barthez.tokenization_barthez_fast import BarthezTokenizerFast
    from ..bert.tokenization_bert_fast import BertTokenizerFast
    from ..camembert.tokenization_camembert_fast import CamembertTokenizerFast
+    from ..convbert.tokenization_convbert_fast import ConvBertTokenizerFast
    from ..distilbert.tokenization_distilbert_fast import DistilBertTokenizerFast
    from ..dpr.tokenization_dpr_fast import DPRQuestionEncoderTokenizerFast
    from ..electra.tokenization_electra_fast import ElectraTokenizerFast
@@ -163,6 +168,7 @@ else:
    BarthezTokenizerFast = None
    BertTokenizerFast = None
    CamembertTokenizerFast = None
+    ConvBertTokenizerFast = None
    DistilBertTokenizerFast = None
    DPRQuestionEncoderTokenizerFast = None
    ElectraTokenizerFast = None
@@ -170,6 +176,7 @@ else:
    GPT2TokenizerFast = None
    HerbertTokenizerFast = None
    LayoutLMTokenizerFast = None
+    LEDTokenizerFast = None
    LongformerTokenizerFast = None
    LxmertTokenizerFast = None
    MBartTokenizerFast = None
@@ -232,6 +239,8 @@ TOKENIZER_MAPPING = OrderedDict(
        (MPNetConfig, (MPNetTokenizer, MPNetTokenizerFast)),
        (TapasConfig, (TapasTokenizer, None)),
        (LEDConfig, (LEDTokenizer, LEDTokenizerFast)),
+        (ConvBertConfig, (ConvBertTokenizer, ConvBertTokenizerFast)),
+        (Wav2Vec2Config, (Wav2Vec2Tokenizer, None)),
    ]
 )

@@ -243,6 +252,7 @@ NO_CONFIG_TOKENIZER = [
    HerbertTokenizerFast,
    PhobertTokenizer,
    BarthezTokenizer,
+    BarthezTokenizerFast,
 ]


--- a/src/transformers/models/bart/init.py
+++ b/src/transformers/models/bart/init.py
@@ -31,6 +31,7 @@ if is_tokenizers_available():
 if is_torch_available():
    _import_structure["modeling_bart"] = [
        "BART_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "BartForCausalLM",
        "BartForConditionalGeneration",
        "BartForQuestionAnswering",
        "BartForSequenceClassification",
@@ -53,6 +54,7 @@ if TYPE_CHECKING:
    if is_torch_available():
        from .modeling_bart import (
            BART_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BartForCausalLM,
            BartForConditionalGeneration,
            BartForQuestionAnswering,
            BartForSequenceClassification,
--- a/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py
@@ -22,10 +22,16 @@ from pathlib import Path
 import fairseq
 import torch
 from packaging import version
+from torch import nn

-from ...utils import logging
-from . import BartConfig, BartForConditionalGeneration, BartForSequenceClassification, BartModel, BartTokenizer
-from .modeling_bart import _make_linear_from_emb
+from transformers import (
+    BartConfig,
+    BartForConditionalGeneration,
+    BartForSequenceClassification,
+    BartModel,
+    BartTokenizer,
+)
+from transformers.utils import logging


 FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn", "bart_xsum/model.pt"]
@@ -72,6 +78,13 @@ def load_xsum_checkpoint(checkpoint_path):
    return hub_interface


+def make_linear_from_emb(emb):
+    vocab_size, emb_size = emb.weight.shape
+    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)
+    lin_layer.weight.data = emb.weight.data
+    return lin_layer
+
+
@torch.no_grad()
 def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):
    """
@@ -113,7 +126,7 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkp
            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt
            model.model.load_state_dict(state_dict)
            if hasattr(model, "lm_head"):
-                model.lm_head = _make_linear_from_emb(model.model.shared)
+                model.lm_head = make_linear_from_emb(model.model.shared)
            new_model_outputs = model.model(tokens)[0]

    # Check results
--- a/src/transformers/models/bart/modeling_bart.py
+++ b/src/transformers/models/bart/modeling_bart.py
@@ -13,8 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ PyTorch BART model. """
-
-
+import copy
 import math
 import random
 import warnings
@@ -22,6 +21,7 @@ from typing import Optional, Tuple

 import torch
 import torch.nn.functional as F
+import torch.utils.checkpoint
 from torch import nn
 from torch.nn import CrossEntropyLoss

@@ -36,6 +36,7 @@ from ...file_utils import (
 from ...modeling_outputs import (
    BaseModelOutput,
    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
    Seq2SeqLMOutput,
    Seq2SeqModelOutput,
    Seq2SeqQuestionAnsweringModelOutput,
@@ -159,6 +160,7 @@ class BartAttention(nn.Module):
        key_value_states: Optional[torch.Tensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
        output_attentions: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        """Input shape: Batch x Time x Channel"""
@@ -226,6 +228,13 @@ class BartAttention(nn.Module):

        attn_weights = F.softmax(attn_weights, dim=-1)

+        if layer_head_mask is not None:
+            assert layer_head_mask.size() == (
+                self.num_heads,
+            ), f"Head mask for a single layer should be of size {(self.num_heads,)}, but is {layer_head_mask.size()}"
+            attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
        if output_attentions:
            # this operation is a bit akward, but it's required to
            # make sure that attn_weights keeps its gradient.
@@ -274,19 +283,30 @@ class BartEncoderLayer(nn.Module):
        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
        self.final_layer_norm = nn.LayerNorm(self.embed_dim)

-    def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool = False):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        layer_head_mask: torch.Tensor,
+        output_attentions: bool = False,
+    ):
        """
        Args:
            hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
            attention_mask (:obj:`torch.FloatTensor`): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (:obj:`torch.FloatTensor`): mask for attention heads in a given layer of size
+                `(config.encoder_attention_heads,)`.
            output_attentions (:obj:`bool`, `optional`):
                Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
                returned tensors for more detail.
        """
        residual = hidden_states
        hidden_states, attn_weights, _ = self.self_attn(
-            hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
+            output_attentions=output_attentions,
        )
        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
        hidden_states = residual + hidden_states
@@ -345,6 +365,8 @@ class BartDecoderLayer(nn.Module):
        attention_mask: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
+        layer_head_mask: Optional[torch.Tensor] = None,
+        encoder_layer_head_mask: Optional[torch.Tensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = True,
@@ -357,6 +379,10 @@ class BartDecoderLayer(nn.Module):
            encoder_hidden_states (:obj:`torch.FloatTensor`): cross attention input to the layer of shape `(seq_len, batch, embed_dim)`
            encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (:obj:`torch.FloatTensor`): mask for attention heads in a given layer of size
+                `(config.encoder_attention_heads,)`.
+            encoder_layer_head_mask (:obj:`torch.FloatTensor`): mask for encoder attention heads in a given layer of
+                size `(config.encoder_attention_heads,)`.
            past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states
            output_attentions (:obj:`bool`, `optional`):
                Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
@@ -372,6 +398,7 @@ class BartDecoderLayer(nn.Module):
            hidden_states=hidden_states,
            past_key_value=self_attn_past_key_value,
            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
            output_attentions=output_attentions,
        )
        hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
@@ -390,6 +417,7 @@ class BartDecoderLayer(nn.Module):
                hidden_states=hidden_states,
                key_value_states=encoder_hidden_states,
                attention_mask=encoder_attention_mask,
+                layer_head_mask=encoder_layer_head_mask,
                past_key_value=cross_attn_past_key_value,
                output_attentions=output_attentions,
            )
@@ -567,6 +595,18 @@ BART_INPUTS_DOCSTRING = r"""
            If you want to change padding behavior, you should read :func:`modeling_bart._prepare_decoder_inputs` and
            modify to your needs. See diagram 1 in `the paper <https://arxiv.org/abs/1910.13461>`__ for more
            information on the default strategy.
+        head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the heas is **masked**.
+
+        decoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):
            Tuple consists of (:obj:`last_hidden_state`, `optional`: :obj:`hidden_states`, `optional`:
            :obj:`attentions`) :obj:`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`,
@@ -644,6 +684,7 @@ class BartEncoder(BartPretrainedModel):
        self,
        input_ids=None,
        attention_mask=None,
+        head_mask=None,
        inputs_embeds=None,
        output_attentions=None,
        output_hidden_states=None,
@@ -667,6 +708,12 @@ class BartEncoder(BartPretrainedModel):
                - 0 for tokens that are **masked**.

                `What are attention masks? <../glossary.html#attention-mask>`__
+            head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
            inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
                representation. This is useful if you want more control over how to convert :obj:`input_ids` indices
@@ -713,7 +760,13 @@ class BartEncoder(BartPretrainedModel):

        encoder_states = () if output_hidden_states else None
        all_attentions = () if output_attentions else None
-        for encoder_layer in self.layers:
+
+        # check if head_mask has a correct number of layers specified if desired
+        if head_mask is not None:
+            assert head_mask.size()[0] == (
+                len(self.layers)
+            ), f"The head_mask should be specified for {len(self.layers)} layers, but it is for {head_mask.size()[0]}."
+        for idx, encoder_layer in enumerate(self.layers):
            if output_hidden_states:
                encoder_states = encoder_states + (hidden_states,)
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
@@ -721,7 +774,7 @@ class BartEncoder(BartPretrainedModel):
            if self.training and (dropout_probability < self.layerdrop):  # skip the layer
                layer_outputs = (None, None)
            else:
-                if getattr(self.config, "gradient_checkpointing", False):
+                if getattr(self.config, "gradient_checkpointing", False) and self.training:

                    def create_custom_forward(module):
                        def custom_forward(*inputs):
@@ -733,9 +786,15 @@ class BartEncoder(BartPretrainedModel):
                        create_custom_forward(encoder_layer),
                        hidden_states,
                        attention_mask,
+                        (head_mask[idx] if head_mask is not None else None),
                    )
                else:
-                    layer_outputs = encoder_layer(hidden_states, attention_mask, output_attentions=output_attentions)
+                    layer_outputs = encoder_layer(
+                        hidden_states,
+                        attention_mask,
+                        layer_head_mask=(head_mask[idx] if head_mask is not None else None),
+                        output_attentions=output_attentions,
+                    )

                hidden_states = layer_outputs[0]

@@ -784,12 +843,38 @@ class BartDecoder(BartPretrainedModel):

        self.init_weights()

+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask = None
+        if input_shape[-1] > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length
+            ).to(self.device)
+
+        if attention_mask is not None:
+            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+
+        return combined_attention_mask
+
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
+        head_mask=None,
+        encoder_head_mask=None,
        past_key_values=None,
        inputs_embeds=None,
        use_cache=None,
@@ -826,6 +911,19 @@ class BartDecoder(BartPretrainedModel):
                - 0 for tokens that are **masked**.

                `What are attention masks? <../glossary.html#attention-mask>`__
+            head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
+            encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
+                on hidden heads. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
            past_key_values (:obj:`Tuple[Tuple[torch.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
                Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
                decoding.
@@ -871,19 +969,9 @@ class BartDecoder(BartPretrainedModel):
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

-        # create causal mask
-        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-        combined_attention_mask = None
-        if input_shape[-1] > 1:
-            combined_attention_mask = _make_causal_mask(
-                input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length
-            ).to(self.device)
-
-        if attention_mask is not None and combined_attention_mask is not None:
-            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-            combined_attention_mask = combined_attention_mask + _expand_mask(
-                attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
-            )
+        attention_mask = self._prepare_decoder_attention_mask(
+            attention_mask, input_shape, inputs_embeds, past_key_values_length
+        )

        # expand encoder attention mask
        if encoder_hidden_states is not None and encoder_attention_mask is not None:
@@ -901,8 +989,14 @@ class BartDecoder(BartPretrainedModel):
        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
-        all_cross_attentions = () if output_attentions else None
+        all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
        next_decoder_cache = () if use_cache else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        if head_mask is not None:
+            assert head_mask.size()[0] == (
+                len(self.layers)
+            ), f"The head_mask should be specified for {len(self.layers)} layers, but it is for {head_mask.size()[0]}."
        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if output_hidden_states:
@@ -913,11 +1007,14 @@ class BartDecoder(BartPretrainedModel):

            past_key_value = past_key_values[idx] if past_key_values is not None else None

-            if getattr(self.config, "gradient_checkpointing", False):
+            if getattr(self.config, "gradient_checkpointing", False) and self.training:
+
                if use_cache:
-                    raise ValueError(
-                        "When using `gradient_checkpointing, make sure that `use_cache=False` and `config.use_cache=False`."
+                    logger.warn(
+                        "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
+                        "`use_cache=False`..."
                    )
+                    use_cache = False

                def create_custom_forward(module):
                    def custom_forward(*inputs):
@@ -929,18 +1026,22 @@ class BartDecoder(BartPretrainedModel):
                layer_outputs = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(decoder_layer),
                    hidden_states,
-                    combined_attention_mask,
+                    attention_mask,
                    encoder_hidden_states,
                    encoder_attention_mask,
+                    head_mask[idx] if head_mask is not None else None,
+                    encoder_head_mask[idx] if encoder_head_mask is not None else None,
                    None,
                )
            else:

                layer_outputs = decoder_layer(
                    hidden_states,
-                    attention_mask=combined_attention_mask,
+                    attention_mask=attention_mask,
                    encoder_hidden_states=encoder_hidden_states,
                    encoder_attention_mask=encoder_attention_mask,
+                    layer_head_mask=(head_mask[idx] if head_mask is not None else None),
+                    encoder_layer_head_mask=(encoder_head_mask[idx] if encoder_head_mask is not None else None),
                    past_key_value=past_key_value,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
@@ -952,7 +1053,9 @@ class BartDecoder(BartPretrainedModel):

            if output_attentions:
                all_self_attns += (layer_outputs[1],)
-                all_cross_attentions += (layer_outputs[2],)
+
+                if encoder_hidden_states is not None:
+                    all_cross_attentions += (layer_outputs[2],)

        # add hidden states from the last decoder layer
        if output_hidden_states:
@@ -1017,6 +1120,8 @@ class BartModel(BartPretrainedModel):
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs=None,
        past_key_values=None,
        inputs_embeds=None,
@@ -1045,6 +1150,7 @@ class BartModel(BartPretrainedModel):
            encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
+                head_mask=head_mask,
                inputs_embeds=inputs_embeds,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
@@ -1064,6 +1170,8 @@ class BartModel(BartPretrainedModel):
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=encoder_outputs[0],
            encoder_attention_mask=attention_mask,
+            head_mask=decoder_head_mask,
+            encoder_head_mask=head_mask,
            past_key_values=past_key_values,
            inputs_embeds=decoder_inputs_embeds,
            use_cache=use_cache,
@@ -1142,6 +1250,8 @@ class BartForConditionalGeneration(BartPretrainedModel):
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs=None,
        past_key_values=None,
        inputs_embeds=None,
@@ -1174,6 +1284,8 @@ class BartForConditionalGeneration(BartPretrainedModel):
            decoder_input_ids=decoder_input_ids,
            encoder_outputs=encoder_outputs,
            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            decoder_inputs_embeds=decoder_inputs_embeds,
@@ -1206,7 +1318,14 @@ class BartForConditionalGeneration(BartPretrainedModel):
        )

    def prepare_inputs_for_generation(
-        self, decoder_input_ids, past=None, attention_mask=None, use_cache=None, encoder_outputs=None, **kwargs
+        self,
+        decoder_input_ids,
+        past=None,
+        attention_mask=None,
+        head_mask=None,
+        use_cache=None,
+        encoder_outputs=None,
+        **kwargs
    ):
        # cut decoder_input_ids if past is used
        if past is not None:
@@ -1218,6 +1337,7 @@ class BartForConditionalGeneration(BartPretrainedModel):
            "past_key_values": past,
            "decoder_input_ids": decoder_input_ids,
            "attention_mask": attention_mask,
+            "head_mask": head_mask,
            "use_cache": use_cache,  # change this to avoid caching (presumably for debugging)
        }

@@ -1277,6 +1397,8 @@ class BartForSequenceClassification(BartPretrainedModel):
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs=None,
        inputs_embeds=None,
        decoder_inputs_embeds=None,
@@ -1305,6 +1427,8 @@ class BartForSequenceClassification(BartPretrainedModel):
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
            encoder_outputs=encoder_outputs,
            inputs_embeds=inputs_embeds,
            decoder_inputs_embeds=decoder_inputs_embeds,
@@ -1378,6 +1502,8 @@ class BartForQuestionAnswering(BartPretrainedModel):
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs=None,
        start_positions=None,
        end_positions=None,
@@ -1407,6 +1533,8 @@ class BartForQuestionAnswering(BartPretrainedModel):
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
            encoder_outputs=encoder_outputs,
            inputs_embeds=inputs_embeds,
            decoder_inputs_embeds=decoder_inputs_embeds,
@@ -1459,3 +1587,208 @@ class BartForQuestionAnswering(BartPretrainedModel):
            encoder_hidden_states=outputs.encoder_hidden_states,
            encoder_attentions=outputs.encoder_attentions,
        )
+
+
+class BartDecoderWrapper(BartPretrainedModel):
+    """
+    This wrapper class is a helper class to correctly load pretrained checkpoints when the causal language model is
+    used in combination with the :class:`~transformers.EncoderDecoderModel` framework.
+    """
+
+    def __init__(self, config):
+        super().__init__(config)
+        self.decoder = BartDecoder(config)
+
+    def forward(self, *args, **kwargs):
+        return self.decoder(*args, **kwargs)
+
+
+class BartForCausalLM(BartPretrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        config = copy.deepcopy(config)
+        config.is_decoder = True
+        config.is_encoder_decoder = False
+        self.model = BartDecoderWrapper(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.model.decoder.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.decoder.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model.decoder = decoder
+
+    def get_decoder(self):
+        return self.model.decoder
+
+    @replace_return_docstrings(output_type=CausalLMOutputWithCrossAttentions, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        head_mask=None,
+        encoder_head_mask=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        Args:
+            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
+                provide it.
+
+                Indices can be obtained using :class:`~transformers.BartTokenizer`. See
+                :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__`
+                for details.
+
+                `What are input IDs? <../glossary.html#input-ids>`__
+            attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+                Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+
+                `What are attention masks? <../glossary.html#attention-mask>`__
+            encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
+                if the model is configured as a decoder.
+            encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used
+                in the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
+            encoder_head_mask (:obj:`torch.Tensor` of shape :obj:`(num_layers, num_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
+                on hidden heads. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
+            past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+                Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
+                decoding.
+
+                If :obj:`past_key_values` are used, the user can optionally input only the last ``decoder_input_ids``
+                (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+                instead of all ``decoder_input_ids`` of shape :obj:`(batch_size, sequence_length)`.
+            labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+                Labels for computing the masked language modeling loss. Indices should either be in ``[0, ...,
+                config.vocab_size]`` or -100 (see ``input_ids`` docstring). Tokens with indices set to ``-100`` are
+                ignored (masked), the loss is only computed for the tokens with labels in ``[0, ...,
+                config.vocab_size]``.
+            use_cache (:obj:`bool`, `optional`):
+                If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+                decoding (see :obj:`past_key_values`).
+
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+            output_attentions (:obj:`bool`, `optional`):
+                Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
+                returned tensors for more detail.
+            output_hidden_states (:obj:`bool`, `optional`):
+                Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors
+                for more detail.
+            return_dict (:obj:`bool`, `optional`):
+                Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+
+        Returns:
+
+        Example::
+
+            >>> from transformers import BartTokenizer, BartForCausalLM
+
+            >>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
+            >>> model = BartForCausalLM.from_pretrained('facebook/bart-large', add_cross_attention=False)
+            >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder."
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+
+            >>> last_hidden_states = outputs.last_hidden_state
+        """
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model.decoder(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            head_mask=head_mask,
+            encoder_head_mask=encoder_head_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        logits = self.lm_head(outputs[0])
+
+        loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, use_cache=None, **kwargs):
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_ids.shape)
+
+        if past:
+            input_ids = input_ids[:, -1:]
+        # first step, decoder_cached_states are empty
+        return {
+            "input_ids": input_ids,  # encoder_outputs is defined. input_ids not needed
+            "attention_mask": attention_mask,
+            "past_key_values": past,
+            "use_cache": use_cache,
+        }
+
+    @staticmethod
+    def _reorder_cache(past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past
--- a/src/transformers/models/bart/modeling_tf_bart.py
+++ b/src/transformers/models/bart/modeling_tf_bart.py
@@ -38,6 +38,7 @@ from ...modeling_tf_outputs import (
 # Public API
 from ...modeling_tf_utils import (
    DUMMY_INPUTS,
+    TFCausalLanguageModelingLoss,
    TFPreTrainedModel,
    TFSharedEmbeddings,
    TFWrappedEmbeddings,
@@ -164,6 +165,7 @@ class TFBartAttention(tf.keras.layers.Layer):
        key_value_states: Optional[tf.Tensor] = None,
        past_key_value: Optional[Tuple[Tuple[tf.Tensor]]] = None,
        attention_mask: Optional[tf.Tensor] = None,
+        layer_head_mask: Optional[tf.Tensor] = None,
        training=False,
    ) -> Tuple[tf.Tensor, Optional[tf.Tensor]]:
        """Input shape: Batch x Time x Channel"""
@@ -230,6 +232,17 @@ class TFBartAttention(tf.keras.layers.Layer):

        attn_weights = tf.nn.softmax(attn_weights, axis=-1)

+        if layer_head_mask is not None:
+            tf.debugging.assert_equal(
+                shape_list(layer_head_mask),
+                [self.num_heads],
+                message=f"Head mask for a single layer should be of size {(self.num_heads)}, but is {shape_list(layer_head_mask)}",
+            )
+            attn_weights = tf.reshape(layer_head_mask, (1, -1, 1, 1)) * tf.reshape(
+                attn_weights, (bsz, self.num_heads, tgt_len, src_len)
+            )
+            attn_weights = tf.reshape(attn_weights, (bsz * self.num_heads, tgt_len, src_len))
+
        attn_probs = self.dropout(attn_weights, training=training)

        attn_output = tf.matmul(attn_probs, value_states)
@@ -266,16 +279,18 @@ class TFBartEncoderLayer(tf.keras.layers.Layer):
        self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2")
        self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm")

-    def call(self, hidden_states: tf.Tensor, attention_mask: tf.Tensor, training=False):
+    def call(self, hidden_states: tf.Tensor, attention_mask: tf.Tensor, layer_head_mask: tf.Tensor, training=False):
        """
        Args:
            hidden_states (:obj:`tf.Tensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
            attention_mask (:obj:`tf.Tensor`): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (:obj:`tf.Tensor`): mask for attention heads in a given layer of size
+                `(encoder_attention_heads,)`
        """
        residual = hidden_states
        hidden_states, self_attn_weights, _ = self.self_attn(
-            hidden_states=hidden_states, attention_mask=attention_mask
+            hidden_states=hidden_states, attention_mask=attention_mask, layer_head_mask=layer_head_mask
        )
        tf.debugging.assert_equal(
            shape_list(hidden_states),
@@ -331,6 +346,8 @@ class TFBartDecoderLayer(tf.keras.layers.Layer):
        attention_mask: Optional[tf.Tensor] = None,
        encoder_hidden_states: Optional[tf.Tensor] = None,
        encoder_attention_mask: Optional[tf.Tensor] = None,
+        layer_head_mask: Optional[tf.Tensor] = None,
+        encoder_layer_head_mask: Optional[tf.Tensor] = None,
        past_key_value: Optional[Tuple[tf.Tensor]] = None,
        training=False,
    ) -> Tuple[tf.Tensor, tf.Tensor, Tuple[Tuple[tf.Tensor]]]:
@@ -342,6 +359,10 @@ class TFBartDecoderLayer(tf.keras.layers.Layer):
            encoder_hidden_states (:obj:`tf.Tensor`): cross attention input to the layer of shape `(seq_len, batch, embed_dim)`
            encoder_attention_mask (:obj:`tf.Tensor`): encoder attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            layer_head_mask (:obj:`tf.Tensor`): mask for attention heads in a given layer of size
+                `(decoder_attention_heads,)`
+            encoder_layer_head_mask (:obj:`tf.Tensor`): mask for encoder attention heads in a given layer of size
+                `(encoder_attention_heads,)`
            past_key_value (:obj:`Tuple(tf.Tensor)`): cached past key and value projection states
        """
        residual = hidden_states
@@ -354,6 +375,7 @@ class TFBartDecoderLayer(tf.keras.layers.Layer):
            hidden_states=hidden_states,
            past_key_value=self_attn_past_key_value,
            attention_mask=attention_mask,
+            layer_head_mask=layer_head_mask,
        )
        hidden_states = self.dropout(hidden_states, training=training)
        hidden_states = residual + hidden_states
@@ -370,6 +392,7 @@ class TFBartDecoderLayer(tf.keras.layers.Layer):
                hidden_states=hidden_states,
                key_value_states=encoder_hidden_states,
                attention_mask=encoder_attention_mask,
+                layer_head_mask=encoder_layer_head_mask,
                past_key_value=cross_attn_past_key_value,
            )
            hidden_states = self.dropout(hidden_states, training=training)
@@ -411,29 +434,6 @@ class TFBartPretrainedModel(TFPreTrainedModel):
        }
        return dummy_inputs

-    def get_input_embeddings(self):
-        base_model = getattr(self, self.base_model_prefix, self)
-
-        return base_model.shared
-
-    def set_input_embeddings(self, value):
-        base_model = getattr(self, self.base_model_prefix, self)
-
-        try:
-            base_model.shared.weight = value
-        except AttributeError:
-            self(self.dummy_inputs)
-            base_model.shared.weight = value
-
-        base_model.shared.vocab_size = shape_list(base_model.shared.weight)[0]
-
-        with tf.compat.v1.variable_scope("model.shared") as shared_abs_scope_name:
-            pass
-
-        embed_tokens = TFWrappedEmbeddings(base_model.shared, abs_scope_name=shared_abs_scope_name)
-        base_model.encoder.set_embed_tokens(embed_tokens)
-        base_model.decoder.set_embed_tokens(embed_tokens)
-
    @tf.function(
        input_signature=[
            {
@@ -550,6 +550,18 @@ BART_INPUTS_DOCSTRING = r"""
            the right for denoising pre-training following the paper.
        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
            will be made by default and ignore pad tokens. It is not recommended to set this for most use cases.
+        head_mask (:obj:`tf.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the heas is **masked**.
+
+        decoder_head_mask (:obj:`tf.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
+            Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in ``[0, 1]``:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
        encoder_outputs (:obj:`tf.FloatTensor`, `optional`):
            hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
            of shape :obj:`(batch_size, sequence_length, hidden_size)` is a sequence of
@@ -563,12 +575,15 @@ BART_INPUTS_DOCSTRING = r"""
            decoding (see :obj:`past_key_values`). Set to :obj:`False` during training, :obj:`True` during generation
        output_attentions (:obj:`bool`, `optional`):
            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
-            tensors for more detail.
+            tensors for more detail. This argument can be used only in eager mode, in graph mode the value in the
+            config will be used instead.
        output_hidden_states (:obj:`bool`, `optional`):
            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
-            more detail.
+            more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
+            used instead.
        return_dict (:obj:`bool`, `optional`):
-            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. This
+            argument can be used in eager mode, in graph mode the value will always be set to True.
        training (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether or not to use the model in training mode (some modules like dropout modules have different
            behaviors between training and evaluation).
@@ -605,6 +620,9 @@ class TFBartEncoder(tf.keras.layers.Layer):
        self.layers = [TFBartEncoderLayer(config, name=f"layers.{i}") for i in range(config.encoder_layers)]
        self.layernorm_embedding = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding")

+    def get_embed_tokens(self):
+        return self.embed_tokens
+
    def set_embed_tokens(self, embed_tokens):
        self.embed_tokens = embed_tokens

@@ -613,6 +631,7 @@ class TFBartEncoder(tf.keras.layers.Layer):
        input_ids=None,
        inputs_embeds=None,
        attention_mask=None,
+        head_mask=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
@@ -637,6 +656,12 @@ class TFBartEncoder(tf.keras.layers.Layer):
                - 0 for tokens that are **masked**.

                `What are attention masks? <../glossary.html#attention-mask>`__
+            head_mask (:obj:`tf.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional):
+                Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
            inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
                Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded
                representation. This is useful if you want more control over how to convert :obj:`input_ids` indices
@@ -655,6 +680,7 @@ class TFBartEncoder(tf.keras.layers.Layer):
            config=self.config,
            input_ids=input_ids,
            attention_mask=attention_mask,
+            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
@@ -690,8 +716,15 @@ class TFBartEncoder(tf.keras.layers.Layer):
        encoder_states = () if inputs["output_hidden_states"] else None
        all_attentions = () if inputs["output_attentions"] else None

+        # check if head_mask has a correct number of layers specified if desired
+        if inputs["head_mask"] is not None:
+            tf.debugging.assert_equal(
+                shape_list(inputs["head_mask"])[0],
+                len(self.layers),
+                message=f"The head_mask should be specified for {len(self.layers)} layers, but it is for {shape_list(inputs['head_mask'])[0]}.",
+            )
        # encoder layers
-        for encoder_layer in self.layers:
+        for idx, encoder_layer in enumerate(self.layers):

            if inputs["output_hidden_states"]:
                encoder_states = encoder_states + (hidden_states,)
@@ -700,7 +733,11 @@ class TFBartEncoder(tf.keras.layers.Layer):
            if inputs["training"] and (dropout_probability < self.layerdrop):  # skip the layer
                continue

-            hidden_states, attn = encoder_layer(hidden_states, attention_mask)
+            hidden_states, attn = encoder_layer(
+                hidden_states,
+                attention_mask,
+                inputs["head_mask"][idx] if inputs["head_mask"] is not None else None,
+            )

            if inputs["output_attentions"]:
                all_attentions += (attn,)
@@ -744,6 +781,9 @@ class TFBartDecoder(tf.keras.layers.Layer):

        self.dropout = tf.keras.layers.Dropout(config.dropout)

+    def get_embed_tokens(self):
+        return self.embed_tokens
+
    def set_embed_tokens(self, embed_tokens):
        self.embed_tokens = embed_tokens

@@ -754,6 +794,8 @@ class TFBartDecoder(tf.keras.layers.Layer):
        attention_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
+        head_mask=None,
+        encoder_head_mask=None,
        past_key_values=None,
        use_cache=None,
        output_attentions=None,
@@ -791,6 +833,19 @@ class TFBartDecoder(tf.keras.layers.Layer):
                - 0 for tokens that are **masked**.

                `What are attention masks? <../glossary.html#attention-mask>`__
+            head_mask (:obj:`tf.Tensor` of shape :obj:`(decoder_layers, decoder_attention_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
+            encoder_head_mask (:obj:`tf.Tensor` of shape :obj:`(encoder_layers, encoder_attention_heads)`, `optional`):
+                Mask to nullify selected heads of the attention modules in encoder to avoid performing cross-attention
+                on hidden heads. Mask values selected in ``[0, 1]``:
+
+                - 1 indicates the head is **not masked**,
+                - 0 indicates the heas is **masked**.
+
            past_key_values (:obj:`Tuple[Tuple[tf.Tensor]]` of length :obj:`config.n_layers` with each tuple having 2 tuples each of which has 2 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
                Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
                decoding.
@@ -819,6 +874,8 @@ class TFBartDecoder(tf.keras.layers.Layer):
            attention_mask=attention_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
+            head_mask=head_mask,
+            encoder_head_mask=encoder_head_mask,
            inputs_embeds=inputs_embeds,
            past_key_values=past_key_values,
            use_cache=use_cache,
@@ -858,7 +915,7 @@ class TFBartDecoder(tf.keras.layers.Layer):
                tf.ones((input_shape[0], input_shape[1] + past_key_values_length)), tgt_len=input_shape[-1]
            )

-        if inputs["attention_mask"] is not None and input_shape[-1] > 1:
+        if inputs["attention_mask"] is not None:
            combined_attention_mask = combined_attention_mask + _expand_mask(
                inputs["attention_mask"], tgt_len=input_shape[-1]
            )
@@ -871,13 +928,22 @@ class TFBartDecoder(tf.keras.layers.Layer):
        hidden_states = self.dropout(hidden_states, training=inputs["training"])

        # decoder layers
-        all_hidden_states = ()
-        all_self_attns = ()
-        present_key_values = ()
+        all_hidden_states = () if inputs["output_hidden_states"] else None
+        all_self_attns = () if inputs["output_attentions"] else None
+        present_key_values = () if inputs["use_cache"] else None
+
+        # check if head_mask has a correct number of layers specified if desired
+        if inputs["head_mask"] is not None:
+            tf.debugging.assert_equal(
+                shape_list(inputs["head_mask"])[0],
+                len(self.layers),
+                message=f"The head_mask should be specified for {len(self.layers)} layers, but it is for {shape_list(inputs['head_mask'])[0]}.",
+            )
        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if inputs["output_hidden_states"]:
                all_hidden_states += (hidden_states,)
+
            dropout_probability = random.uniform(0, 1)

            if inputs["training"] and (dropout_probability < self.layerdrop):
@@ -890,6 +956,10 @@ class TFBartDecoder(tf.keras.layers.Layer):
                attention_mask=combined_attention_mask,
                encoder_hidden_states=inputs["encoder_hidden_states"],
                encoder_attention_mask=inputs["encoder_attention_mask"],
+                layer_head_mask=inputs["head_mask"][idx] if inputs["head_mask"] is not None else None,
+                encoder_layer_head_mask=inputs["encoder_head_mask"][idx]
+                if inputs["encoder_head_mask"] is not None
+                else None,
                past_key_value=past_key_value,
            )

@@ -901,12 +971,12 @@ class TFBartDecoder(tf.keras.layers.Layer):

        if inputs["output_hidden_states"]:
            all_hidden_states += (hidden_states,)
-        else:
-            all_hidden_states = None

-        all_self_attns = list(all_self_attns) if inputs["output_attentions"] else None
+        if inputs["output_attentions"]:
+            all_self_attns = list(all_self_attns)

-        present_key_values = (encoder_hidden_states, present_key_values) if inputs["use_cache"] else None
+        if inputs["use_cache"]:
+            present_key_values = (inputs["encoder_hidden_states"], present_key_values)

        if not inputs["return_dict"]:
            return hidden_states, present_key_values, all_hidden_states, all_self_attns
@@ -919,16 +989,14 @@ class TFBartDecoder(tf.keras.layers.Layer):
            )


-@add_start_docstrings(
-    "The bare BART Model outputting raw hidden-states without any specific head on top.",
-    BART_START_DOCSTRING,
-)
@keras_serializable
-class TFBartModel(TFBartPretrainedModel):
-    base_model_prefix = "model"
+class TFBartMainLayer(tf.keras.layers.Layer):
+    config_class = BartConfig

-    def __init__(self, config: BartConfig, *inputs, **kwargs):
-        super().__init__(config, *inputs, **kwargs)
+    def __init__(self, config: BartConfig, **kwargs):
+        super().__init__(**kwargs)
+
+        self.config = config
        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, config.pad_token_id, name="model.shared")

        with tf.compat.v1.variable_scope("model.shared") as shared_abs_scope_name:
@@ -942,25 +1010,28 @@ class TFBartModel(TFBartPretrainedModel):
        self.encoder = TFBartEncoder(config, embed_tokens, name="encoder")
        self.decoder = TFBartDecoder(config, embed_tokens, name="decoder")

-    def get_encoder(self):
-        return self.encoder
+    def get_input_embeddings(self):
+        return self.shared

-    def get_decoder(self):
-        return self.decoder
+    def set_input_embeddings(self, new_embeddings):
+        self.shared.weight = new_embeddings
+        self.shared.vocab_size = self.shared.weight.shape[0]
+        # retrieve correct absolute scope for embed token wrapper
+        with tf.compat.v1.variable_scope("model.shared") as shared_abs_scope_name:
+            pass
+        # Wraps layer to avoid problems with weight restoring and ensuring we're in the correct TF scope.
+        embed_tokens = TFWrappedEmbeddings(self.shared, abs_scope_name=shared_abs_scope_name)
+        self.encoder.set_embed_tokens(embed_tokens)
+        self.decoder.set_embed_tokens(embed_tokens)

-    @add_start_docstrings_to_model_forward(BART_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
-    @add_code_sample_docstrings(
-        tokenizer_class=_TOKENIZER_FOR_DOC,
-        checkpoint="facebook/bart-large",
-        output_type=TFSeq2SeqModelOutput,
-        config_class=_CONFIG_FOR_DOC,
-    )
    def call(
        self,
        input_ids=None,
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None,
        past_key_values=None,
        inputs_embeds=None,
@@ -979,6 +1050,8 @@ class TFBartModel(TFBartPretrainedModel):
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
            encoder_outputs=encoder_outputs,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
@@ -1009,6 +1082,7 @@ class TFBartModel(TFBartPretrainedModel):
            inputs["encoder_outputs"] = self.encoder(
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
+                head_mask=inputs["head_mask"],
                inputs_embeds=inputs["inputs_embeds"],
                output_attentions=inputs["output_attentions"],
                output_hidden_states=inputs["output_hidden_states"],
@@ -1031,6 +1105,8 @@ class TFBartModel(TFBartPretrainedModel):
            attention_mask=inputs["decoder_attention_mask"],
            encoder_hidden_states=inputs["encoder_outputs"][0],
            encoder_attention_mask=inputs["attention_mask"],
+            head_mask=inputs["decoder_head_mask"],
+            encoder_head_mask=inputs["head_mask"],
            past_key_values=inputs["past_key_values"],
            inputs_embeds=inputs["decoder_inputs_embeds"],
            use_cache=inputs["use_cache"],
@@ -1053,8 +1129,92 @@ class TFBartModel(TFBartPretrainedModel):
            encoder_attentions=inputs["encoder_outputs"].attentions,
        )

+
+@add_start_docstrings(
+    "The bare BART Model outputting raw hidden-states without any specific head on top.",
+    BART_START_DOCSTRING,
+)
+class TFBartModel(TFBartPretrainedModel):
+    def __init__(self, config: BartConfig, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+
+        self.model = TFBartMainLayer(config, name="model")
+
+    def get_encoder(self):
+        return self.model.encoder
+
+    def get_decoder(self):
+        return self.model.decoder
+
+    @add_start_docstrings_to_model_forward(BART_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        tokenizer_class=_TOKENIZER_FOR_DOC,
+        checkpoint="facebook/bart-large",
+        output_type=TFSeq2SeqModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def call(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        decoder_input_ids=None,
+        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
+        encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None,
+        past_key_values=None,
+        inputs_embeds=None,
+        decoder_inputs_embeds=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        training=False,
+        **kwargs
+    ):
+        inputs = input_processing(
+            func=self.call,
+            config=self.config,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
+            encoder_outputs=encoder_outputs,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            decoder_inputs_embeds=decoder_inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+            kwargs_call=kwargs,
+        )
+
+        outputs = self.model(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            decoder_input_ids=inputs["decoder_input_ids"],
+            decoder_attention_mask=inputs["decoder_attention_mask"],
+            head_mask=inputs["head_mask"],
+            decoder_head_mask=inputs["decoder_head_mask"],
+            encoder_outputs=inputs["encoder_outputs"],
+            past_key_values=inputs["past_key_values"],
+            inputs_embeds=inputs["inputs_embeds"],
+            decoder_inputs_embeds=inputs["decoder_inputs_embeds"],
+            use_cache=inputs["use_cache"],
+            output_attentions=inputs["output_attentions"],
+            output_hidden_states=inputs["output_hidden_states"],
+            return_dict=inputs["return_dict"],
+            training=inputs["training"],
+        )
+
+        return outputs
+
    def serving_output(self, output):
-        pkv = (tf.tuple(output.past_key_values)[1] if self.config.use_cache else None,)
+        pkv = tf.tuple(output.past_key_values)[1] if self.config.use_cache else None
        dec_hs = tf.convert_to_tensor(output.decoder_hidden_states) if self.config.output_hidden_states else None
        dec_attns = tf.convert_to_tensor(output.decoder_attentions) if self.config.output_attentions else None
        enc_hs = tf.convert_to_tensor(output.encoder_hidden_states) if self.config.output_hidden_states else None
@@ -1075,7 +1235,7 @@ class TFBartModel(TFBartPretrainedModel):
    "The BART Model with a language modeling head. Can be used for summarization.",
    BART_START_DOCSTRING,
 )
-class TFBartForConditionalGeneration(TFBartPretrainedModel):
+class TFBartForConditionalGeneration(TFBartPretrainedModel, TFCausalLanguageModelingLoss):
    _keys_to_ignore_on_load_unexpected = [
        r"model.encoder.embed_tokens.weight",
        r"model.decoder.embed_tokens.weight",
@@ -1083,7 +1243,7 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):

    def __init__(self, config, *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
-        self.model = TFBartModel(config, name="model")
+        self.model = TFBartMainLayer(config, name="model")
        self.use_cache = config.use_cache
        # final_bias_logits is registered as a buffer in pytorch, so not trainable for the the sake of consistency.
        self.final_logits_bias = self.add_weight(
@@ -1117,6 +1277,8 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
        attention_mask=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
+        head_mask=None,
+        decoder_head_mask=None,
        encoder_outputs: Optional[TFBaseModelOutput] = None,
        past_key_values=None,
        inputs_embeds=None,
@@ -1145,6 +1307,8 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
+            head_mask=head_mask,
+            decoder_head_mask=decoder_head_mask,
            encoder_outputs=encoder_outputs,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
@@ -1159,6 +1323,11 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
        )

        if inputs["labels"] is not None:
+            inputs["labels"] = tf.where(
+                inputs["labels"] == self.config.pad_token_id,
+                tf.fill(shape_list(inputs["labels"]), -100),
+                inputs["labels"],
+            )
            inputs["use_cache"] = False
            if inputs["decoder_input_ids"] is None:
                inputs["decoder_input_ids"] = shift_tokens_right(
@@ -1171,6 +1340,8 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            decoder_input_ids=inputs["decoder_input_ids"],
            encoder_outputs=inputs["encoder_outputs"],
            decoder_attention_mask=inputs["decoder_attention_mask"],
+            head_mask=inputs["head_mask"],
+            decoder_head_mask=inputs["decoder_head_mask"],
            past_key_values=inputs["past_key_values"],
            inputs_embeds=inputs["inputs_embeds"],
            decoder_inputs_embeds=inputs["decoder_inputs_embeds"],
@@ -1193,13 +1364,13 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            past_key_values=outputs.past_key_values,  # index 1 of d outputs
            decoder_hidden_states=outputs.decoder_hidden_states,  # index 2 of d outputs
            decoder_attentions=outputs.decoder_attentions,  # index 3 of d outputs
-            encoder_last_hidden_state=outputs.last_hidden_state,  # index 0 of encoder outputs
+            encoder_last_hidden_state=outputs.encoder_last_hidden_state,  # index 0 of encoder outputs
            encoder_hidden_states=outputs.encoder_hidden_states,  # 1 of e out
            encoder_attentions=outputs.encoder_attentions,  # 2 of e out
        )

    def serving_output(self, output):
-        pkv = (tf.tuple(output.past_key_values)[1] if self.config.use_cache else None,)
+        pkv = tf.tuple(output.past_key_values)[1] if self.config.use_cache else None
        dec_hs = tf.convert_to_tensor(output.decoder_hidden_states) if self.config.output_hidden_states else None
        dec_attns = tf.convert_to_tensor(output.decoder_attentions) if self.config.output_attentions else None
        enc_hs = tf.convert_to_tensor(output.encoder_hidden_states) if self.config.output_hidden_states else None
@@ -1215,7 +1386,15 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            encoder_attentions=enc_attns,
        )

-    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs) -> Dict:
+    def prepare_inputs_for_generation(
+        self,
+        decoder_input_ids,
+        past,
+        attention_mask,
+        head_mask=None,
+        use_cache=None,
+        **kwargs,
+    ) -> Dict:
        assert past is not None and len(past) in {1, 2}, f"past has to be an iterable of length 1,2 got {past}"
        if len(past) == 1:
            assert isinstance(past[0], tf.Tensor), f"`past[0]` has to be of type `tf.Tensor`, but is {type(past[0])}"
@@ -1247,6 +1426,7 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            "past_key_values": past_key_values,
            "decoder_input_ids": decoder_input_ids,
            "attention_mask": attention_mask,
+            "head_mask": head_mask,
            "use_cache": use_cache,  # change this to avoid caching (presumably for debugging)
        }

@@ -1274,15 +1454,3 @@ class TFBartForConditionalGeneration(TFBartPretrainedModel):
            return tf.where(vocab_range != self.config.eos_token_id, LARGE_NEGATIVE, logits)
        else:
            return logits
-
-    def compute_loss(self, labels, logits):
-        """CrossEntropyLoss that ignores pad tokens"""
-        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
-            from_logits=True,
-            reduction=tf.keras.losses.Reduction.NONE,
-        )
-        melted_labels = tf.reshape(labels, (-1,))
-        active_loss = tf.not_equal(melted_labels, self.config.pad_token_id)
-        reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)
-        labels = tf.boolean_mask(melted_labels, active_loss)
-        return loss_fn(labels, reduced_logits)
--- a/src/transformers/models/bart/tokenization_bart.py
+++ b/src/transformers/models/bart/tokenization_bart.py
@@ -38,11 +38,9 @@ class BartTokenizer(RobertaTokenizer):
    r"""
    Construct a BART tokenizer.

-    :class:`~transformers.BartTokenizer` is identical to :class:`~transformers.RobertaTokenizer` and adds a new
-    :meth:`~transformers.BartTokenizer.prepare_seq2seq_batch`
-
-    Refer to superclass :class:`~transformers.RobertaTokenizer` for usage examples and documentation concerning the
-    initialization parameters and other methods.
+    :class:`~transformers.BartTokenizer` is identical to :class:`~transformers.RobertaTokenizer`. Refer to superclass
+    :class:`~transformers.RobertaTokenizer` for usage examples and documentation concerning the initialization
+    parameters and other methods.
    """
    # merges and vocab same as Roberta
    max_model_input_sizes = {m: 1024 for m in _all_bart_models}
--- a/src/transformers/models/bart/tokenization_bart_fast.py
+++ b/src/transformers/models/bart/tokenization_bart_fast.py
@@ -37,6 +37,13 @@ _all_bart_models = [


 class BartTokenizerFast(RobertaTokenizerFast):
+    r"""
+    Construct a "fast" BART tokenizer (backed by HuggingFace's `tokenizers` library).
+
+    :class:`~transformers.BartTokenizerFast` is identical to :class:`~transformers.RobertaTokenizerFast`. Refer to
+    superclass :class:`~transformers.RobertaTokenizerFast` for usage examples and documentation concerning the
+    initialization parameters and other methods.
+    """
    # merges and vocab same as Roberta
    max_model_input_sizes = {m: 1024 for m in _all_bart_models}
    pretrained_vocab_files_map = {
--- a/src/transformers/models/barthez/tokenization_barthez.py
+++ b/src/transformers/models/barthez/tokenization_barthez.py
@@ -21,7 +21,7 @@ from typing import List, Optional, Tuple

 import sentencepiece as spm

-from ...tokenization_utils import PreTrainedTokenizer
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer
 from ...utils import logging


@@ -90,14 +90,15 @@ class BarthezTokenizer(PreTrainedTokenizer):
        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<s>NOTUSED", "</s>NOTUSED"]`):
            Additional special tokens used by the tokenizer.

-    Attributes: sp_model (:obj:`SentencePieceProcessor`): The `SentencePiece` processor that is used for every
-    conversion (string, tokens and IDs).
+    Attributes:
+        sp_model (:obj:`SentencePieceProcessor`):
+            The `SentencePiece` processor that is used for every conversion (string, tokens and IDs).
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
-    model_input_names = ["attention_mask"]
+    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
@@ -111,6 +112,9 @@ class BarthezTokenizer(PreTrainedTokenizer):
        mask_token="<mask>",
        **kwargs
    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
--- a/src/transformers/models/barthez/tokenization_barthez_fast.py
+++ b/src/transformers/models/barthez/tokenization_barthez_fast.py
@@ -20,6 +20,7 @@ from shutil import copyfile
 from typing import List, Optional, Tuple

 from ...file_utils import is_sentencepiece_available
+from ...tokenization_utils import AddedToken
 from ...tokenization_utils_fast import PreTrainedTokenizerFast
 from ...utils import logging

@@ -98,15 +99,12 @@ class BarthezTokenizerFast(PreTrainedTokenizerFast):
            modeling. This is the token which the model will try to predict.
        additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`["<s>NOTUSED", "</s>NOTUSED"]`):
            Additional special tokens used by the tokenizer.
-
-    Attributes: sp_model (:obj:`SentencePieceProcessor`): The `SentencePiece` processor that is used for every
-    conversion (string, tokens and IDs).
    """

    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
-    model_input_names = ["attention_mask"]
+    model_input_names = ["input_ids", "attention_mask"]
    slow_tokenizer_class = BarthezTokenizer

    def __init__(
@@ -122,6 +120,9 @@ class BarthezTokenizerFast(PreTrainedTokenizerFast):
        mask_token="<mask>",
        **kwargs
    ):
+        # Mask token behave like a normal word, i.e. include the space before it
+        mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
+
        super().__init__(
            vocab_file,
            tokenizer_file=tokenizer_file,
--- a/src/transformers/models/bert/convert_bert_original_tf2_checkpoint_to_pytorch.py
+++ b/src/transformers/models/bert/convert_bert_original_tf2_checkpoint_to_pytorch.py
@@ -28,8 +28,8 @@ import re
 import tensorflow as tf
 import torch

-from ...utils import logging
-from . import BertConfig, BertModel
+from transformers import BertConfig, BertModel
+from transformers.utils import logging


 logging.set_verbosity_info()
--- a/src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py
@@ -19,8 +19,8 @@ import argparse

 import torch

-from ...utils import logging
-from . import BertConfig, BertForPreTraining, load_tf_weights_in_bert
+from transformers import BertConfig, BertForPreTraining, load_tf_weights_in_bert
+from transformers.utils import logging


 logging.set_verbosity_info()
--- a/Show More
+++ b/Show More