Release: v3.4.0

Update README.md
add summary (#7927 )
2020-10-20 16:22:26 +02:00 · 2020-10-20 16:13:49 +02:00 · 2020-10-20 10:11:02 -04:00 · 2020-10-20 09:50:47 -04:00 · 2020-10-20 15:42:29 +02:00 · 2020-10-20 15:13:22 +02:00
437 changed files with 35833 additions and 6138 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -84,7 +84,7 @@ jobs:
                key: v0.3-{{ checksum "setup.py" }}
                paths:
                    - '~/.cache/pip'
-            - run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ --cov  | tee output.txt
+            - run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ --cov --durations=0 | tee output.txt
            - run: codecov
            - store_artifacts:
                  path: ~/transformers/output.txt
@@ -139,6 +139,31 @@ jobs:
            - store_artifacts:
               path: ~/transformers/output.txt
               destination: test_output.txt
+    run_tests_flax:
+        working_directory: ~/transformers
+        docker:
+            - image: circleci/python:3.7
+        environment:
+            OMP_NUM_THREADS: 1
+        resource_class: xlarge
+        parallelism: 1
+        steps:
+            - checkout
+            - restore_cache:
+                keys:
+                    - v0.3-flax-{{ checksum "setup.py" }}
+                    - v0.3-{{ checksum "setup.py" }}
+            - run: pip install --upgrade pip
+            - run: pip install git+https://github.com/huggingface/datasets
+            - run: sudo pip install .[flax,sklearn,torch,testing]
+            - save_cache:
+                  key: v0.3-flax-{{ checksum "setup.py" }}
+                  paths:
+                      - '~/.cache/pip'
+            - run: python -m pytest -n 8 --dist=loadfile -rA -s ./tests/ | tee output.txt
+            - store_artifacts:
+                  path: ~/transformers/output.txt
+                  destination: test_output.txt
    run_tests_custom_tokenizers:
        working_directory: ~/transformers
        docker:
@@ -198,7 +223,7 @@ jobs:
                      - v0.3-build_doc-{{ checksum "setup.py" }}
                      - v0.3-{{ checksum "setup.py" }}
            - run: pip install --upgrade pip
-            - run: pip install .[tf,torch,docs]
+            - run: pip install .[tf,torch,sentencepiece,docs]
            - save_cache:
                  key: v0.3-build_doc-{{ checksum "setup.py" }}
                  paths:
@@ -219,7 +244,7 @@ jobs:
                  keys:
                      - v0.3-deploy_doc-{{ checksum "setup.py" }}
                      - v0.3-{{ checksum "setup.py" }}
-            - run: pip install .[tf,torch,docs]
+            - run: pip install .[tf,torch,sentencepiece,docs]
            - save_cache:
                  key: v0.3-deploy_doc-{{ checksum "setup.py" }}
                  paths:
@@ -239,7 +264,7 @@ jobs:
                      - v0.3-{{ checksum "setup.py" }}
            - run: pip install --upgrade pip
            - run: pip install isort
-            - run: pip install .[tf,torch,quality]
+            - run: pip install .[tf,torch,flax,quality]
            - save_cache:
                  key: v0.3-code_quality-{{ checksum "setup.py" }}
                  paths:
@@ -248,6 +273,7 @@ jobs:
            - run: isort --check-only examples templates tests src utils
            - run: flake8 examples templates tests src utils
            - run: python utils/check_copies.py
+            - run: python utils/check_dummies.py
            - run: python utils/check_repo.py
    check_repository_consistency:
        working_directory: ~/transformers
@@ -304,6 +330,7 @@ workflows:
            - run_tests_torch_and_tf
            - run_tests_torch
            - run_tests_tf
+            - run_tests_flax
            - build_doc
            - deploy_doc: *workflow_filters
    tpu_testing_jobs:
--- a/.circleci/deploy.sh
+++ b/.circleci/deploy.sh
@@ -49,4 +49,5 @@ deploy_doc "10d7239" v2.10.0
 deploy_doc "b42586e" v2.11.0
 deploy_doc "7fb8bdf" v3.0.2
 deploy_doc "4b3ee9c" v3.1.0
-deploy_doc "3ebb1b3" # v3.2.0 Latest stable release
+deploy_doc "3ebb1b3" v3.2.0
+deploy_doc "0613f05" # v3.3.0 Latest stable release
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -16,15 +16,15 @@ Fixes # (issue)


 ## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dimiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests), 
+- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
+- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
 - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
      to the it if that's the case.
 - [ ] Did you make sure to update the documentation with your changes? Here are the
      [documentation guidelines](https://github.com/huggingface/transformers/tree/master/docs), and
      [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/master/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests? 
+- [ ] Did you write any new necessary tests?


 ## Who can review?
@@ -37,25 +37,27 @@ members/contributors which may be interested in your PR.
 If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
 Please tag fewer than 3 people.

- albert, bert, GPT2, XLM: @LysandreJik 
+ albert, bert, XLM: @LysandreJik
+ GPT2: @LysandreJik, @patrickvonplaten
 tokenizers: @mfuntowicz
 Trainer: @sgugger
- Speed and Memory Benchmarks: @patrickvonplaten
+ Benchmarks: @patrickvonplaten
 Model Cards: @julien-c
 Translation: @sshleifer
 Summarization: @sshleifer
- TextGeneration: @TevenLeScao 
 examples/distillation: @VictorSanh
 nlp datasets: [different repo](https://github.com/huggingface/nlp)
 rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
- Text Generation: @TevenLeScao
+ Text Generation: @patrickvonplaten, @TevenLeScao
 Blenderbot, Bart, Marian, Pegasus: @sshleifer
 T5: @patrickvonplaten
- Longformer/Reformer: @patrickvonplaten
- TransfoXL/XLNet: @TevenLeScao 
+ Rag: @patrickvonplaten, @lhoestq
+ EncoderDecoder: @patrickvonplaten
+ Longformer, Reformer: @patrickvonplaten
+ TransfoXL, XLNet: @TevenLeScao, @patrickvonplaten
 examples/seq2seq: @sshleifer
 examples/bert-loses-patience: @JetRunner
 tensorflow: @jplu
 examples/token-classification: @stefan-it
 documentation: @sgugger
- -->
+ -->
--- a/.github/workflows/github-torch-hub.yml
+++ b/.github/workflows/github-torch-hub.yml
@@ -30,7 +30,7 @@ jobs:
      run: |
        pip install --upgrade pip
        pip install torch
-        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging
+        pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses tokenizers packaging

    - name: Torch hub list
      run: |
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -14,7 +14,7 @@ on:

 jobs:
  run_tests_torch_and_tf_gpu:
-    runs-on: self-hosted
+    runs-on: [self-hosted, single-gpu]
    steps:
    - uses: actions/checkout@v2
    - name: Python version
@@ -51,14 +51,64 @@ jobs:
    - name: Are GPUs recognized by our DL frameworks
      run: |
        source .env/bin/activate
-        python -c "import torch; print(torch.cuda.is_available())"
+        python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
+        python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"

    - name: Run all non-slow tests on GPU
      env:
        TF_FORCE_GPU_ALLOW_GROWTH: "true"
        # TF_GPU_MEMORY_LIMIT: 4096
        OMP_NUM_THREADS: 1
-        USE_CUDA: yes
      run: |
        source .env/bin/activate
        python -m pytest -n 2 --dist=loadfile -s ./tests/
+
+  run_tests_torch_and_tf_multiple_gpu:
+    runs-on: [self-hosted, multi-gpu]
+    steps:
+      - uses: actions/checkout@v2
+      - name: Python version
+        run: |
+          which python
+          python --version
+          pip --version
+      - name: Current dir
+        run: pwd
+      - run: nvidia-smi
+
+      - name: Loading cache.
+        uses: actions/cache@v2
+        id: cache
+        with:
+          path: .env
+          key: v0-tests_tf_torch_multiple_gpu-${{ hashFiles('setup.py') }}
+
+      - name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
+        run: |
+          python -m venv .env
+          source .env/bin/activate
+          which python
+          python --version
+          pip --version
+      - name: Install dependencies
+        run: |
+          source .env/bin/activate
+          pip install --upgrade pip
+          pip install torch!=1.6.0
+          pip install .[sklearn,testing,onnxruntime]
+          pip install git+https://github.com/huggingface/datasets
+
+      - name: Are GPUs recognized by our DL frameworks
+        run: |
+          source .env/bin/activate
+          python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
+          python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
+
+      - name: Run all non-slow tests on GPU
+        env:
+          TF_FORCE_GPU_ALLOW_GROWTH: "true"
+          # TF_GPU_MEMORY_LIMIT: 4096
+          OMP_NUM_THREADS: 1
+        run: |
+          source .env/bin/activate
+          python -m pytest -n 2 --dist=loadfile -s ./tests/
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -10,7 +10,7 @@ on:

 jobs:
  run_all_tests_torch_and_tf_gpu:
-    runs-on: self-hosted
+    runs-on: [self-hosted, single-gpu]
    steps:
    - uses: actions/checkout@v2

@@ -48,25 +48,86 @@ jobs:
    - name: Are GPUs recognized by our DL frameworks
      run: |
        source .env/bin/activate
-        python -c "import torch; print(torch.cuda.is_available())"
+        python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
+        python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
+

    - name: Run all tests on GPU
      env:
        TF_FORCE_GPU_ALLOW_GROWTH: "true"
        OMP_NUM_THREADS: 1
        RUN_SLOW: yes
-        USE_CUDA: yes
      run: |
        source .env/bin/activate
-        python -m pytest -n 1 --dist=loadfile -s ./tests/
+        python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=0

    - name: Run examples tests on GPU
      env:
        TF_FORCE_GPU_ALLOW_GROWTH: "true"
        OMP_NUM_THREADS: 1
        RUN_SLOW: yes
-        USE_CUDA: yes
      run: |
        source .env/bin/activate
        pip install -r examples/requirements.txt
-        python -m pytest -n 1 --dist=loadfile -s examples
+        python -m pytest -n 1 --dist=loadfile -s examples --durations=0
+
+  run_all_tests_torch_and_tf_multiple_gpu:
+    runs-on: [self-hosted, multi-gpu]
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Loading cache.
+        uses: actions/cache@v2
+        id: cache
+        with:
+          path: .env
+          key: v0-slow_tests_tf_torch_multi_gpu-${{ hashFiles('setup.py') }}
+
+      - name: Python version
+        run: |
+          which python
+          python --version
+          pip --version
+      - name: Current dir
+        run: pwd
+      - run: nvidia-smi
+      - name: Create new python env (on self-hosted runners we have to handle isolation ourselves)
+        if: steps.cache.outputs.cache-hit != 'true'
+        run: |
+          python -m venv .env
+          source .env/bin/activate
+          which python
+          python --version
+          pip --version
+      - name: Install dependencies
+        run: |
+          source .env/bin/activate
+          pip install --upgrade pip
+          pip install torch!=1.6.0
+          pip install .[sklearn,testing,onnxruntime]
+          pip install git+https://github.com/huggingface/datasets
+
+      - name: Are GPUs recognized by our DL frameworks
+        run: |
+          source .env/bin/activate
+          python -c "import torch; print('Cuda available:', torch.cuda.is_available())"
+          python -c "import torch; print('Number of GPUs available:', torch.cuda.device_count())"
+
+      - name: Run all tests on GPU
+        env:
+          TF_FORCE_GPU_ALLOW_GROWTH: "true"
+          OMP_NUM_THREADS: 1
+          RUN_SLOW: yes
+        run: |
+          source .env/bin/activate
+          python -m pytest -n 1 --dist=loadfile -s ./tests/ --durations=0
+
+      - name: Run examples tests on GPU
+        env:
+          TF_FORCE_GPU_ALLOW_GROWTH: "true"
+          OMP_NUM_THREADS: 1
+          RUN_SLOW: yes
+        run: |
+          source .env/bin/activate
+          pip install -r examples/requirements.txt
+          python -m pytest -n 1 --dist=loadfile -s examples --durations=0
--- a/.gitignore
+++ b/.gitignore
@@ -9,9 +9,11 @@ __pycache__/
 *.so

 # tests and logs
-tests/fixtures
+tests/fixtures/*
+!tests/fixtures/sample_text_no_unicode.txt
 logs/
 lightning_logs/
+lang_code_data/

 # Distribution / packaging
 .Python
@@ -155,3 +157,6 @@ debug.env

 #ctags
 tags
+
+# pre-commit
+.pre-commit*
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,129 @@
+
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, religion, or sexual identity
+and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our
+community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+feedback@huggingface.co.
+All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series
+of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.0, available at
+https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -9,6 +9,9 @@ It also helps us if you spread the word: reference the library from blog posts
 on the awesome projects it made possible, shout out on Twitter every time it has
 helped you, or simply star the repo to say "thank you".

+Whichever way you choose to contribute, please be mindful to respect our
+[code of conduct](https://github.com/huggingface/transformers/blob/master/CODE_OF_CONDUCT.md).
+
 ## You can contribute in so many ways!

 There are 4 ways you can contribute to transformers:
@@ -176,13 +179,14 @@ Follow these steps to start contributing:
   ```bash
   $ make quality
   ```
-
   You can do the automatic style corrections and code verifications that can't be automated in one go:

   ```bash
   $ make fixup
   ```

+   This target is also optimized to only work with files modified by the PR you're working on.
+
   If you're modifying documents under `docs/source`, make sure to validate that
   they can still be built. This check also runs in CI. To run a local check
   make sure you have installed the documentation builder requirements, by
--- a/42
+++ b/42
@@ -1,29 +1,53 @@
-.PHONY: quality_checks quality style fixup test test-examples docs
+.PHONY: modified_only_fixup extra_quality_checks quality style fixup fix-copies test test-examples docs
+
+
+check_dirs := examples templates tests src utils
+
+# get modified files since the branch was made
+fork_point_sha := $(shell git merge-base --fork-point master)
+joined_dirs := $(shell echo $(check_dirs) | tr " " "|")
+modified_py_files := $(shell git diff --name-only $(fork_point_sha) | egrep '^($(joined_dirs))' | egrep '\.py$$')
+#$(info modified files are: $(modified_py_files))
+
+modified_only_fixup:
+	@if [ -n "$(modified_py_files)" ]; then \
+		echo "Checking/fixing $(modified_py_files)"; \
+		black $(modified_py_files); \
+		isort $(modified_py_files); \
+		flake8 $(modified_py_files); \
+	else \
+		echo "No library .py files were modified"; \
+	fi

 # Check that source code meets quality standards

-quality_checks:
-	flake8 examples templates tests src utils
+extra_quality_checks:
 	python utils/check_copies.py
+	python utils/check_dummies.py
 	python utils/check_repo.py

+# this target runs checks on all files
 quality:
-	black --check examples templates tests src utils
-	isort --check-only examples templates tests src utils
-	${MAKE} quality_checks
+	black --check $(check_dirs)
+	isort --check-only $(check_dirs)
+	flake8 $(check_dirs)
+	${MAKE} extra_quality_checks

 # Format source code automatically and check is there are any problems left that need manual fixing

 style:
-	black examples templates tests src utils
-	isort examples templates tests src utils
+	black $(check_dirs)
+	isort $(check_dirs)

-fixup: style quality_checks
+# Super fast fix and check target that only works on relevant modified files since the branch was made
+
+fixup: modified_only_fixup extra_quality_checks

 # Make marked copies of snippets of codes conform to the original

 fix-copies:
 	python utils/check_copies.py --fix_and_overwrite
+	python utils/check_dummies.py --fix_and_overwrite

 # Run tests for the library

--- a/README.md
+++ b/README.md
@@ -16,15 +16,18 @@
    <a href="https://github.com/huggingface/transformers/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/transformers.svg">
    </a>
+    <a href="https://github.com/huggingface/transformers/blob/master/CODE_OF_CONDUCT.md">
+        <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg">
+    </a>
 </p>

 <h3 align="center">
 <p>State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0
 </h3>

-🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone. 
+🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

-🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments. 
+🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our [model hub](https://huggingface.co/models). At the same time, each python module defining an architecture can be used as a standalone and modified to enable quick research experiments.

 🤗 Transformers is backed by the two most popular deep learning libraries, [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/), with a seamless integration between them, allowing you to train your models with one then load it for inference with the other.

@@ -35,7 +38,7 @@

 You can test most of our models directly on their pages from the [model hub](https://huggingface.co/models). We also offer an [inference API](https://huggingface.co/pricing) to use those models.

-Here are a few examples: 
+Here are a few examples:
 - [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France)
 - [Name Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city)
 - [Text generation with GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+)
@@ -48,7 +51,7 @@ Here are a few examples:

 ## Quick tour

-To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts 
+To immediately use a model on a given text, we provide the `pipeline` API. Pipelines group together a pretrained model with the preprocessing that was used during that model training. Here is how to quickly use a pipeline to classify positive versus negative texts

 ```python
 >>> from transformers import pipeline
@@ -59,7 +62,7 @@ To immediately use a model on a given text, we provide the `pipeline` API. Pipel
 [{'label': 'POSITIVE', 'score': 0.9978193640708923}]
 ```

-The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%. 
+The second line of code downloads and caches the pretrained model used by the pipeline, the third line evaluates it on the given text. Here the answer is "positive" with a confidence of 99.8%.

 This is another example of pipeline used for that can extract question answers from some context:

@@ -78,7 +81,7 @@ This is another example of pipeline used for that can extract question answers f

 On top of the answer, the pretrained model used here returned its confidence score, along with the start position and its end position in the tokenized sentence. You can learn more about the tasks supported by the `pipeline` API in [this tutorial](https://huggingface.co/transformers/task_summary.html).

-To download and use any of the pretrained models on your given task, you just need to use those three lines of codes (PyTorch verison):
+To download and use any of the pretrained models on your given task, you just need to use those three lines of codes (PyTorch version):
 ```python
 >>> from transformers import AutoTokenizer, AutoModel

@@ -108,7 +111,7 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
 1. Easy-to-use state-of-the-art models:
    - High performance on NLU and NLG tasks.
    - Low barrier to entry for educators and practitioners.
-    - Few user-facing abastractions with just three classes to learn.
+    - Few user-facing abstractions with just three classes to learn.
    - A unified API for using all our pretrained models.

 1. Lower compute costs, smaller carbon footprint:
@@ -124,7 +127,7 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta
 1. Easily customize a model or an example to your needs:
    - Examples for each architecture to reproduce the results by the official authors of said architecture.
    - Expose the models internal as consistently as possible.
-    - Model files can be used independently of the library for quick experiments. 
+    - Model files can be used independently of the library for quick experiments.

 ## Why shouldn't I use transformers?

@@ -155,37 +158,43 @@ If you'd like to play with the examples, you must [install the library from sour

 🤗 Transformers currently provides the following architectures (see [here](https://huggingface.co/transformers/model_summary.html) for a high-level summary of each them):

+1. **[ALBERT](https://huggingface.co/transformers/model_doc/albert.html)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
+1. **[BART](https://huggingface.co/transformers/model_doc/bart.html)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BERT](https://huggingface.co/transformers/model_doc/bert.html)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-2. **[GPT](https://huggingface.co/transformers/model_doc/gpt.html)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-3. **[GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
-4. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-5. **[XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-6. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
-7. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
-9. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
-10. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
-11. **[ALBERT](https://huggingface.co/transformers/model_doc/albert.html)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
-12. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-13. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
-14. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
-15. **[FlauBERT](https://huggingface.co/transformers/model_doc/flaubert.html)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
-16. **[BART](https://huggingface.co/transformers/model_doc/bart.html)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
-17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
-18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
-19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
-22. **[DPR](https://github.com/facebookresearch/DPR)** (from Facebook) released with the paper [Dense Passage Retrieval
+1. **[BERT For Sequence Generation](https://huggingface.co/transformers/model_doc/bertgeneration.html)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
+1. **[Blenderbot](https://huggingface.co/transformers/model_doc/blenderbot.html)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+1. **[CamemBERT](https://huggingface.co/transformers/model_doc/camembert.html)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+1. **[CTRL](https://huggingface.co/transformers/model_doc/ctrl.html)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
+1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
+1. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
+1. **[DPR](https://huggingface.co/transformers/model_doc/dpr.html)** (from Facebook) released with the paper [Dense Passage Retrieval
 for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon
 Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
-23. **[Pegasus](https://github.com/google-research/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
-24. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.  
-25. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
-26. **[Funnel Transformer](https://github.com/laiguokun/Funnel-Transformer)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
-27. **[LayoutLM](https://github.com/microsoft/unilm/tree/master/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
-28. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
-29. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+1. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
+1. **[FlauBERT](https://huggingface.co/transformers/model_doc/flaubert.html)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
+1. **[Funnel Transformer](https://huggingface.co/transformers/model_doc/funnel.html)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
+1. **[GPT](https://huggingface.co/transformers/model_doc/gpt.html)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+1. **[GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
+1. **[LayoutLM](https://huggingface.co/transformers/model_doc/layoutlm.html)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
+1. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+1. **[LXMERT](https://huggingface.co/transformers/model_doc/lxmert.html)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
+1. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+1. **[MBart](https://huggingface.co/transformers/model_doc/mbart.html)** (from Facebook) released with the paper  [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
+1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
+ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
+1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
+1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
+1. **[XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+1. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
+1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations. You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

--- a/codecov.yml
+++ b/codecov.yml
@@ -4,7 +4,4 @@ coverage:
      default:
        informational: true
    patch: off
-comment:
-  require_changes: true    # only comment if there was change in coverage
-  require_head: yes        # don't report if there is no head coverage report
-  require_base: yes        # don't report if there is no base coverage report
+comment: false
--- a/docker/transformers-gpu/Dockerfile
+++ b/docker/transformers-gpu/Dockerfile
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
 LABEL maintainer="Hugging Face"
 LABEL repository="transformers"

@@ -18,9 +18,14 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
    tensorflow \
    torch

+RUN git clone https://github.com/NVIDIA/apex
+RUN cd apex && \
+    python3 setup.py install && \
+    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+
 WORKDIR /workspace
 COPY . transformers/
 RUN cd transformers/ && \
    python3 -m pip install --no-cache-dir .

-CMD ["/bin/bash"]
+CMD ["/bin/bash"]
--- a/docker/transformers-pytorch-gpu/Dockerfile
+++ b/docker/transformers-pytorch-gpu/Dockerfile
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
+FROM nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04
 LABEL maintainer="Hugging Face"
 LABEL repository="transformers"

@@ -17,9 +17,14 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip && \
    mkl \
    torch

+RUN git clone https://github.com/NVIDIA/apex
+RUN cd apex && \
+    python3 setup.py install && \
+    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+
 WORKDIR /workspace
 COPY . transformers/
 RUN cd transformers/ && \
    python3 -m pip install --no-cache-dir .

-CMD ["/bin/bash"]
+CMD ["/bin/bash"]
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@@ -1,10 +1,11 @@
 // These two things need to be updated at each release for the version selector.
 // Last stable version
-const stableVersion = "v3.2.0"
+const stableVersion = "v3.3.0"
 // Dictionary doc folder to label
 const versionMapping = {
    "master": "master",
-    "": "v3.2.0",
+    "": "v3.3.0/v3.3.1",
+    "v3.2.0": "v3.2.0",
    "v3.1.0": "v3.1.0 (stable)",
    "v3.0.2": "v3.0.0/v3.0.1/v3.0.2",
    "v2.11.0": "v2.11.0",
@@ -235,9 +236,11 @@ function platformToggle() {

    const createFrameworkButtons = sample => {
            const pytorchButton = document.createElement("button");
+            pytorchButton.classList.add('pytorch-button')
            pytorchButton.innerText = "PyTorch";

            const tensorflowButton = document.createElement("button");
+            tensorflowButton.classList.add('tensorflow-button')
            tensorflowButton.innerText = "TensorFlow";

            const selectorDiv = document.createElement("div");
@@ -252,22 +255,36 @@ function platformToggle() {
            tensorflowButton.classList.remove("selected");

            pytorchButton.addEventListener("click", () => {
-                sample.element.innerHTML = sample.pytorchSample;
-                pytorchButton.classList.add("selected");
-                tensorflowButton.classList.remove("selected");
+                for(const codeBlock of updatedCodeBlocks){
+                    codeBlock.element.innerHTML = codeBlock.pytorchSample;
+                }
+                Array.from(document.getElementsByClassName('pytorch-button')).forEach(button => {
+                    button.classList.add("selected");
+                })
+                Array.from(document.getElementsByClassName('tensorflow-button')).forEach(button => {
+                    button.classList.remove("selected");
+                })
            });
            tensorflowButton.addEventListener("click", () => {
-               sample.element.innerHTML = sample.tensorflowSample;
-                tensorflowButton.classList.add("selected");
-                pytorchButton.classList.remove("selected");
+                for(const codeBlock of updatedCodeBlocks){
+                    codeBlock.element.innerHTML = codeBlock.tensorflowSample;
+                }
+                Array.from(document.getElementsByClassName('tensorflow-button')).forEach(button => {
+                    button.classList.add("selected");
+                })
+                Array.from(document.getElementsByClassName('pytorch-button')).forEach(button => {
+                    button.classList.remove("selected");
+                })
            });
        };

-    codeBlocks
+    const updatedCodeBlocks = codeBlocks
        .map(element => {return {element: element.firstChild, innerText: element.innerText}})
        .filter(codeBlock => codeBlock.innerText.includes(pytorchIdentifier) && codeBlock.innerText.includes(tensorflowIdentifier))
        .map(getFrameworkSpans)
-        .forEach(createFrameworkButtons);
+
+    updatedCodeBlocks
+        .forEach(createFrameworkButtons)
 }


--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'3.3.0'
+release = u'3.4.0'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -218,6 +218,52 @@ positional embeddings.
 Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
 use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.

+.. _labels:
+
+Labels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
+should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
+its predictions and the expected value (the label).
+
+These labels are different according to the model head, for example:
+
+- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
+  a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
+  entire sequence.
+- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
+  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  individual token.
+- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
+  a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
+  individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
+  -100).
+- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
+  :class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
+  :obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
+  input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
+  attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
+  Encoder-Decoder framework.
+  See the documentation of each model for more information on each specific model's labels.
+
+The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
+simply outputting features.
+
+.. _decoder-input-ids:
+
+Decoder input IDs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
+These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
+built in a way specific to each model.
+
+Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
+In such models, passing the :obj:`labels` is the preferred way to handle training.
+
+Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
+
 .. _feed-forward-chunking:

 Feed Forward Chunking
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -54,97 +54,113 @@ The documentation is organized in five parts:
 The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and
 conversion utilities for the following models:

-1. `BERT <https://github.com/google-research/bert>`_ (from Google) released with the paper `BERT: Pre-training of Deep
-   Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei
-   Chang, Kenton Lee, and Kristina Toutanova.
-2. `GPT <https://github.com/openai/finetune-transformer-lm>`_ (from OpenAI) released with the paper `Improving Language
-   Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised>`_ by Alec Radford, Karthik
-   Narasimhan, Tim Salimans, and Ilya Sutskever.
-3. `GPT-2 <https://blog.openai.com/better-language-models>`_ (from OpenAI) released with the paper `Language Models are
-   Unsupervised Multitask Learners <https://blog.openai.com/better-language-models>`_ by Alec Radford, Jeffrey Wu,
-   Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
-4. `Transformer-XL <https://github.com/kimiyoung/transformer-xl>`_ (from Google/CMU) released with the paper
-   `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`_ by
-   Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov.
-5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized
-   Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang, Zihang
-   Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
-6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual
-   Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
-7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with
-   the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle
-   Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin
-   Stoyanov.
-8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together
-   with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
-   <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut, and Thomas Wolf. The same method has been
-   applied to compress GPT2 into
-   `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.
-9. `CTRL <https://github.com/pytorch/fairseq/tree/master/examples/ctrl>`_ (from Salesforce), released together with the
-   paper `CTRL: A Conditional Transformer Language Model for Controllable Generation
-   <https://www.github.com/salesforce/ctrl>`_ by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong,
-   and Richard Socher.
-10. `CamemBERT <https://huggingface.co/transformers/model_doc/camembert.html>`_ (from FAIR, Inria, Sorbonne Université)
-    released together with the paper `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`_ by
-    Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la
-    Clergerie, Djame Seddah, and Benoît Sagot.
-11. `ALBERT <https://github.com/google-research/ALBERT>`_ (from Google Research), released together with the paper
-    `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
-    by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut.
-12. `T5 <https://github.com/google-research/text-to-text-transfer-transformer>`_ (from Google) released with the paper
-    `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
-    <https://arxiv.org/abs/1910.10683>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
-    Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
-13. `XLM-RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/xlmr>`_ (from Facebook AI), released together
-    with the paper `Unsupervised Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`_ by
-    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard
-    Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.
-14. `MMBT <https://github.com/facebookresearch/mmbt/>`_ (from Facebook), released together with the paper a `Supervised
-    Multimodal Bitransformers for Classifying Images and Text <https://arxiv.org/pdf/1909.02950.pdf>`_ by Douwe Kiela,
-    Suvrat Bhooshan, Hamed Firooz, and Davide Testuggine.
-15. `FlauBERT <https://github.com/getalp/Flaubert>`_ (from CNRS) released with the paper `FlauBERT: Unsupervised
-    Language Model Pre-training for French <https://arxiv.org/abs/1912.05372>`_ by Hang Le, Loïc Vial, Jibril Frej,
-    Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and
-    Didier Schwab.
-16. `BART <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_ (from Facebook) released with the paper
-    `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
-    <https://arxiv.org/pdf/1910.13461.pdf>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
-    Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.
-17. `ELECTRA <https://github.com/google-research/electra>`_ (from Google Research/Stanford University) released with
-    the paper `ELECTRA: Pre-training text encoders as discriminators rather than generators
-    <https://arxiv.org/abs/2003.10555>`_ by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning.
-18. `DialoGPT <https://github.com/microsoft/DialoGPT>`_ (from Microsoft Research) released with the paper `DialoGPT:
-    Large-Scale Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`_ by
-    Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu,
-    and Bill Dolan.
-19. `Reformer <https://github.com/google/trax/tree/master/trax/models/reformer>`_ (from Google Research) released with
-    the paper `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ by Nikita Kitaev, Łukasz
-    Kaiser, and Anselm Levskaya.
-20. `MarianMT <https://marian-nmt.github.io/>`_ (developed by the Microsoft Translator Team) machine translation models
-    trained using `OPUS <http://opus.nlpl.eu/>`_ pretrained_models data by Jörg Tiedemann.
-21. `Longformer <https://github.com/allenai/longformer>`_ (from AllenAI) released with the paper `Longformer: The
-    Long-Document Transformer <https://arxiv.org/abs/2004.05150>`_ by Iz Beltagy, Matthew E. Peters, and Arman Cohan.
-22. `DPR <https://github.com/facebookresearch/DPR>`_ (from Facebook) released with the paper `Dense Passage Retrieval
-    for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`_ by Vladimir Karpukhin, Barlas Oğuz, Sewon
-    Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
-23. `Pegasus <https://github.com/google-research/pegasus>`_ (from Google) released with the paper `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
-    <https://arxiv.org/abs/1912.08777>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
-24. `MBart <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`_ (from Facebook) released with the paper  `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
-    Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
-25. `LXMERT <https://github.com/airsplay/lxmert>`_ (from UNC Chapel Hill) released with the paper `LXMERT: Learning
-    Cross-Modality Encoder Representations from Transformers for Open-Domain Question
-    Answering <https://arxiv.org/abs/1908.07490>`_ by Hao Tan and Mohit Bansal.
-26. `Funnel Transformer <https://github.com/laiguokun/Funnel-Transformer>`_ (from CMU/Google Brain) released with the paper
-    `Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
-    <https://arxiv.org/abs/2006.03236>`_ by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
-27. `Bert For Sequence Generation <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`_ (from Google) released with the paper
-    `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
-    <https://arxiv.org/abs/1907.12461>`_ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
-28. `LayoutLM <https://github.com/microsoft/unilm/tree/master/layoutlm>`_ (from Microsoft Research Asia) released with the paper
-    `LayoutLM: Pre-training of Text and Layout for Document Image Understanding
-    <https://arxiv.org/abs/1912.13318>`_ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
-29. `Other community models <https://huggingface.co/models>`_, contributed by the `community
-    <https://huggingface.co/users>`_.
+..
+    This list is updated automatically from the README with `make fix-copies`. Do not update manually!
+
+1. :doc:`ALBERT <model_doc/albert>` (from Google Research and the Toyota Technological Institute at Chicago) released
+   with the paper `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
+   <https://arxiv.org/abs/1909.11942>`__, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush
+   Sharma, Radu Soricut.
+2. :doc:`BART <model_doc/bart>` (from Facebook) released with the paper `BART: Denoising Sequence-to-Sequence
+   Pre-training for Natural Language Generation, Translation, and Comprehension
+   <https://arxiv.org/pdf/1910.13461.pdf>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
+   Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
+3. :doc:`BERT <model_doc/bert>` (from Google) released with the paper `BERT: Pre-training of Deep Bidirectional
+   Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang,
+   Kenton Lee and Kristina Toutanova.
+4. :doc:`BERT For Sequence Generation <model_doc/bertgeneration>` (from Google) released with the paper `Leveraging
+   Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi
+   Narayan, Aliaksei Severyn.
+5. :doc:`Blenderbot <model_doc/blenderbot>` (from Facebook) released with the paper `Recipes for building an
+   open-domain chatbot <https://arxiv.org/abs/2004.13637>`__ by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary
+   Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.
+6. :doc:`CamemBERT <model_doc/camembert>` (from Inria/Facebook/Sorbonne) released with the paper `CamemBERT: a Tasty
+   French Language Model <https://arxiv.org/abs/1911.03894>`__ by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz
+   Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
+7. :doc:`CTRL <model_doc/ctrl>` (from Salesforce) released with the paper `CTRL: A Conditional Transformer Language
+   Model for Controllable Generation <https://arxiv.org/abs/1909.05858>`__ by Nitish Shirish Keskar*, Bryan McCann*,
+   Lav R. Varshney, Caiming Xiong and Richard Socher.
+8. :doc:`DeBERTa <model_doc/deberta>` (from Microsoft Research) released with the paper `DeBERTa: Decoding-enhanced
+   BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao,
+   Weizhu Chen.
+9. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
+   Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe Zhang,
+   Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
+10. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
+    distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
+    Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
+    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
+    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
+    `DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
+    version of DistilBERT.
+11. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
+    Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
+    Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
+12. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
+    Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
+    Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
+13. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
+    Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
+    Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
+14. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
+    Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
+    Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
+15. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
+    Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
+    and Ilya Sutskever.
+16. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
+    Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
+    Luan, Dario Amodei** and Ilya Sutskever**.
+17. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
+    of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
+    Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
+18. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
+    Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+19. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
+    Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
+    by Hao Tan and Mohit Bansal.
+20. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
+    Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
+    Translator Team.
+21. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper  `Multilingual Denoising Pre-training for
+    Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
+    Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
+22. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
+    Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
+    Mohammad Saleh and Peter J. Liu.
+23. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
+    Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
+    Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+24. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
+    Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+25. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
+    Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
+    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. ultilingual BERT into `DistilmBERT
+    <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German version of
+    DistilBERT.
+26. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
+    about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
+    Krishna, and Kurt W. Keutzer.
+27. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
+    Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
+    Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+28. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
+    Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
+    Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+29. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
+    Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
+30. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
+    Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
+    Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+31. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
+    Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
+    Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
+    Zettlemoyer and Veselin Stoyanov.
+32. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
+    Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
+    Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+33. `Other community models <https://huggingface.co/models>`__, contributed by the `community
+    <https://huggingface.co/users>`__.

 .. toctree::
    :maxdepth: 2
@@ -193,6 +209,7 @@ conversion utilities for the following models:
    :maxdepth: 2
    :caption: Main Classes

+    main_classes/callback
    main_classes/configuration
    main_classes/logging
    main_classes/model
@@ -212,8 +229,10 @@ conversion utilities for the following models:
    model_doc/bart
    model_doc/bert
    model_doc/bertgeneration
+    model_doc/blenderbot
    model_doc/camembert
    model_doc/ctrl
+    model_doc/deberta
    model_doc/dialogpt
    model_doc/distilbert
    model_doc/dpr
@@ -231,13 +250,16 @@ conversion utilities for the following models:
    model_doc/gpt
    model_doc/gpt2
    model_doc/pegasus
+    model_doc/prophetnet
    model_doc/rag
    model_doc/reformer
    model_doc/retribert
    model_doc/roberta
+    model_doc/squeezebert
    model_doc/t5
    model_doc/transformerxl
    model_doc/xlm
+    model_doc/xlmprophetnet
    model_doc/xlmroberta
    model_doc/xlnet

@@ -248,3 +270,4 @@ conversion utilities for the following models:
    internal/modeling_utils
    internal/pipelines_utils
    internal/tokenization_utils
+    internal/trainer_utils
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -37,13 +37,13 @@ pip install transformers[tf-cpu]
 To check 🤗 Transformers is properly installed, run the following command:

 ```bash
-python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"
+python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
 ```

 It should download a pretrained model then print something like

 ```bash
-[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]
+[{'label': 'POSITIVE', 'score': 0.9998704791069031}]
 ```

 (Note that TensorFlow will print additional stuff before that last statement.)
--- a/docs/source/internal/trainer_utils.rst
+++ b/docs/source/internal/trainer_utils.rst
@@ -0,0 +1,27 @@
+Utilities for Trainer
+-----------------------------------------------------------------------------------------------------------------------
+
+This page lists all the utility functions used by :class:`~transformers.Trainer`.
+
+Most of those are only useful if you are studying the code of the Trainer in the library.
+
+Utilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.EvalPrediction
+
+.. autofunction:: transformers.set_seed
+
+.. autofunction:: transformers.torch_distributed_zero_first
+
+
+Callbacks internals
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.trainer_callback.CallbackHandler
+
+Distributed Evaluation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.trainer_pt_utils.DistributedTensorGatherer
+    :members:
--- a/docs/source/main_classes/callback.rst
+++ b/docs/source/main_classes/callback.rst
@@ -0,0 +1,68 @@
+Callbacks
+-----------------------------------------------------------------------------------------------------------------------
+
+Callbacks are objects that can customize the behavior of the training loop in the PyTorch
+:class:`~transformers.Trainer` (this feature is not yet implemented in TensorFlow) that can inspect the training loop
+state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early
+stopping).
+
+Callbacks are "read only" pieces of code, apart from the :class:`~transformers.TrainerControl` object they return, they
+cannot change anything in the training loop. For customizations that require changes in the training loop, you should
+subclass :class:`~transformers.Trainer` and override the methods you need (see :doc:`trainer` for examples).
+
+By default a :class:`~transformers.Trainer` will use the following callbacks:
+
+- :class:`~transformers.DefaultFlowCallback` which handles the default behavior for logging, saving and evaluation.
+- :class:`~transformers.PrinterCallback` or :class:`~transformers.ProrgressCallback` to display progress and print the
+  logs (the first one is used if you deactivate tqdm through the :class:`~transformers.TrainingArguments`, otherwise
+  it's the second one).
+- :class:`~transformers.integrations.TensorBoardCallback` if tensorboard is accessible (either through PyTorch >= 1.4
+  or tensorboardX).
+- :class:`~transformers.integrations.WandbCallback` if `wandb <https://www.wandb.com/>`__ is installed.
+- :class:`~transformers.integrations.CometCallback` if `comet_ml <https://www.comet.ml/site/>`__ is installed.
+
+The main class that implements callbacks is :class:`~transformers.TrainerCallback`. It gets the
+:class:`~transformers.TrainingArguments` used to instantiate the :class:`~transformers.Trainer`, can access that
+Trainer's internal state via :class:`~transformers.TrainerState`, and can take some actions on the training loop via
+:class:`~transformers.TrainerControl`.
+
+
+Available Callbacks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here is the list of the available :class:`~transformers.TrainerCallback` in the library:
+
+.. autoclass:: transformers.integrations.CometCallback
+    :members: setup
+
+.. autoclass:: transformers.DefaultFlowCallback
+
+.. autoclass:: transformers.PrinterCallback
+
+.. autoclass:: transformers.ProgressCallback
+
+.. autoclass:: transformers.integrations.TensorBoardCallback
+
+.. autoclass:: transformers.integrations.WandbCallback
+    :members: setup
+
+
+TrainerCallback
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TrainerCallback
+    :members:
+
+
+TrainerState
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TrainerState
+    :members:
+
+
+TrainerControl
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TrainerControl
+    :members:
--- a/docs/source/main_classes/trainer.rst
+++ b/docs/source/main_classes/trainer.rst
@@ -15,10 +15,9 @@ Both :class:`~transformers.Trainer` and :class:`~transformers.TFTrainer` contain
 previous features. To inject custom behavior you can subclass them and override the following methods:

 - **get_train_dataloader**/**get_train_tfdataset** -- Creates the training DataLoader (PyTorch) or TF Dataset.
- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaulation DataLoader (PyTorch) or TF Dataset.
+- **get_eval_dataloader**/**get_eval_tfdataset** -- Creates the evaluation DataLoader (PyTorch) or TF Dataset.
 - **get_test_dataloader**/**get_test_tfdataset** -- Creates the test DataLoader (PyTorch) or TF Dataset.
 - **log** -- Logs information on the various objects watching training.
- **setup_wandb** -- Setups wandb (see `here <https://docs.wandb.com/huggingface>`__ for more information).
 - **create_optimizer_and_scheduler** -- Setups the optimizer and learning rate scheduler if they were not passed at
  init.
 - **compute_loss** - Computes the loss on a batch of training inputs.
@@ -40,6 +39,10 @@ Here is an example of how to customize :class:`~transformers.Trainer` using a cu
            logits = outputs[0]
            return my_custom_loss(logits, labels)

+Another way to customize the training loop behavior for the PyTorch :class:`~transformers.Trainer` is to use
+:doc:`callbacks <callback>` that can inspect the training loop state (for progress reporting, logging on TensorBoard or
+other ML platforms...) and take decisions (like early stopping).
+

 Trainer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -47,29 +50,23 @@ Trainer
 .. autoclass:: transformers.Trainer
    :members:

+
 TFTrainer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFTrainer
    :members:

+
 TrainingArguments
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TrainingArguments
    :members:

+
 TFTrainingArguments
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFTrainingArguments
    :members:
-
-Utilities
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.EvalPrediction
-
-.. autofunction:: transformers.set_seed
-
-.. autofunction:: transformers.torch_distributed_zero_first
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -1,38 +1,46 @@
-Bart
+BART
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
+The Bart model was proposed in `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
+Translation, and Comprehension <https://arxiv.org/abs/1910.13461>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
+Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
+
 According to the abstract,

- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.
+- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
+  left-to-right decoder (like GPT).
+- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
+  where spans of text are replaced with a single mask token.
+- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
+  matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
+  state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
+  of up to 6 ROUGE.

-The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_
+The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.


 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use BartTokenizer.encode to get the proper splitting.
- The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``)  if they are not passed. This is different than some other modeling APIs.
- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
- ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"facebook/bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
- for training/forward passes that don't involve beam search, pass ``use_cache=False``
-
-
-BartForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.BartForConditionalGeneration
-    :members: forward
+- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` 
+  or :meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
+- The forward pass of :class:`~transformers.BartModel` will create decoder inputs (using the helper function
+  :func:`transformers.modeling_bart._prepare_bart_decoder_inputs`)  if they are not passed. This is different than some
+  other modeling APIs.
+- Model predictions are intended to be identical to the original implementation. This only works, however, if the
+  string you pass to :func:`fairseq.encode` starts with a space.
+- :meth:`~transformers.BartForConditionalGeneration.generate` should be used for conditional generation tasks like
+  summarization, see the example in that docstrings.
+- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
+  mask-filling tasks.
+- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.


 BartConfig
@@ -59,6 +67,13 @@ BartModel
 .. autofunction:: transformers.modeling_bart._prepare_bart_decoder_inputs


+BartForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartForConditionalGeneration
+    :members: forward
+
+
 BartForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@@ -71,5 +86,3 @@ BartForQuestionAnswering

 .. autoclass:: transformers.BartForQuestionAnswering
    :members: forward
-
-
--- a/docs/source/model_doc/blenderbot.rst
+++ b/docs/source/model_doc/blenderbot.rst
@@ -0,0 +1,75 @@
+Blenderbot
+-----------------------------------------------------------------------------------------------------------------------
+**DISCLAIMER:** If you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot <https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
+
+The abstract of the paper is the following:
+
+*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.*
+
+The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
+
+
+Implementation Notes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
+- It inherits completely from :class:`~transformers.BartForConditionalGeneration`
+- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
+- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`, regardless of checkpoint. To use the 3B parameter checkpoint, you must call :class:`~transformers.BlenderbotTokenizer` directly.
+- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
+
+
+Usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Model Usage:
+
+        >>> from transformers import BlenderbotSmallTokenizer, BlenderbotForConditionalGeneration
+        >>> mname = 'facebook/blenderbot-90M'
+        >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
+        >>> tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname)
+        >>> UTTERANCE = "My friends are cool but they eat too many carbs."
+        >>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
+        >>> reply_ids = model.generate(**inputs)
+        >>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in reply_ids])
+
+
+See Config Values:
+
+        >>> from transformers import BlenderbotConfig
+        >>> config_90 = BlenderbotConfig.from_pretrained("facebook/blenderbot-90M")
+        >>> config_90.to_diff_dict()  # show interesting Values.
+        >>> configuration_3B = BlenderbotConfig("facebook/blenderbot-3B")
+        >>> configuration_3B.to_diff_dict()
+
+
+BlenderbotConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: transformers.BlenderbotConfig
+    :members:
+
+BlenderbotTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BlenderbotTokenizer
+    :members: build_inputs_with_special_tokens
+
+BlenderbotSmallTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BlenderbotSmallTokenizer
+    :members:
+
+
+BlenderbotForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
+
+.. autoclass:: transformers.BlenderbotForConditionalGeneration
+    :members:
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
@@ -0,0 +1,62 @@
+DeBERTa
+----------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+
+The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention <https://arxiv.org/abs/2006.03654>`__
+by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
+It is based on Google's BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
+
+It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.
+
+The abstract from the paper is the following:
+
+*Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. 
+In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa 
+models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode
+its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and 
+relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining.
+We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to 
+RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements 
+on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained 
+models will be made publicly available at https://github.com/microsoft/DeBERTa.*
+
+
+The original code can be found `here <https://github.com/microsoft/DeBERTa>`__.
+
+
+DebertaConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaConfig
+    :members:
+
+
+DebertaTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+DebertaModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaModel
+    :members:
+
+
+DebertaPreTrainedModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaPreTrainedModel
+    :members:
+
+
+DebertaForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.DebertaForSequenceClassification
+    :members:
--- a/docs/source/model_doc/encoderdecoder.rst
+++ b/docs/source/model_doc/encoderdecoder.rst
@@ -27,4 +27,4 @@ EncoderDecoderModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.EncoderDecoderModel
-    :members: forward
+    :members: forward, from_encoder_decoder_pretrained
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -104,6 +104,13 @@ OpenAIGPTDoubleHeadsModel
    :members: forward


+OpenAIGPTForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.OpenAIGPTForSequenceClassification
+    :members: forward
+
+
 TFOpenAIGPTModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -88,6 +88,13 @@ GPT2DoubleHeadsModel
    :members: forward


+GPT2ForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.GPT2ForSequenceClassification
+    :members: forward
+
+
 TFGPT2Model
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/layoutlm.rst
+++ b/docs/source/model_doc/layoutlm.rst
@@ -4,8 +4,8 @@ LayoutLM
 Overview
 ~~~~~~~~~~~~~~~~~~~~~

-The LayoutLM model was proposed in `LayoutLM: Pre-training of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__
-by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It's a simple but effective pre-training method 
+The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__
+by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. It's a simple but effective pre-training method
 of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding.

 The abstract from the paper is the following:
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -1,36 +1,51 @@
 MarianMT
 -----------------------------------------------------------------------------------------------------------------------
-**Bugs:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
-@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
+
+**Bugs:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
+and assign @sshleifer. 
+
+Translations should be similar, but not identical to, output in the test set linked to in each model card.

 Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Each model is about 298 MB on disk, there are 1,000+ models.
+
+- Each model is about 298 MB on disk, there are more than 1,000 models.
 - The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
- models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
+- Models were originally trained by 
+  `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the
+  `Marian <https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
+  in a model card.
 - The 80 opus models that require BPE preprocessing are not supported.
- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
-    - static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
-    - a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
-    - no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
-    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
+- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
+    - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
+    - a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
+    - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
+    - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
+      :obj:`<s/>`),
+- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.

 Naming
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
+
+- All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
+- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
+  <https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling
+  "language code {code}".
+- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.


 Multilingual Models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
-    - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
-    - if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
+All  model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
+
+    - If :obj:`src` is in all caps, the model supports multiple input languages, you can figure out which ones by
+      looking at the model card, or the Group Members `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
+    - If :obj:`tgt` is in all caps, the model can output multiple languages, and you should specify a language code by
+      prepending the desired output language to the :obj:`src_text`.
    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``

 Example of translating english to many romance languages, using language codes:
@@ -54,12 +69,20 @@ Example of translating english to many romance languages, using language codes:
    # 'Isto deve ir para o português.',
    # 'Y esto al español']

-Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
-There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
+Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a
+separator for src or tgt, as in :obj:`Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi`. These still require language
+codes.

-For Example:
-    - ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
-    - ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
+There are many supported regional language codes, like :obj:`>>es_ES<<` (Spain) and :obj:`>>es_AR<<` (Argentina), that
+do not seem to change translations. I have not found these to provide different results than just using :obj:`>>es<<`.
+
+For example:
+
+    - `Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU`: translates from all NORTH_EU languages (see `mapping
+      <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special
+      language code like :obj:`>>de<<` to specify output language.
+    - `Helsinki-NLP/opus-mt-ROMANCE-en`: translates from many romance languages to english, no codes needed since there
+      is only one target language.



@@ -86,13 +109,6 @@ Code to see available pretrained models:
    suffix = [x.split('/')[1] for x in model_ids]
    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]

-MarianMTModel
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
-Model API is identical to BartForConditionalGeneration.
-Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
-This class inherits nearly all functionality from ``BartForConditionalGeneration``, see that page for method signatures.

 MarianConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -107,5 +123,7 @@ MarianTokenizer
    :members: prepare_seq2seq_batch


+MarianMTModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-
+.. autoclass:: transformers.MarianMTModel
--- a/docs/source/model_doc/mbart.rst
+++ b/docs/source/model_doc/mbart.rst
@@ -1,15 +1,20 @@
 MBart
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer

 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
-Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. According to the abstract,
+The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
+<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov
+Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

-MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
+According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
+corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
+sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
+on the encoder, decoder, or reconstructing parts of the text.

 The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__

@@ -18,10 +23,11 @@ Training
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. 
 As the model is multilingual it expects the sequences in a different format. A special language id token 
-is added in both the source and target text. The source text format is ``X [eos, src_lang_code]`` 
-where ``X`` is the source text. The target text format is ```[tgt_lang_code] X [eos]```. ```bos``` is never used.
-The ```MBartTokenizer.prepare_seq2seq_batch``` handles this automatically and should be used to encode 
-the sequences for seq-2-seq fine-tuning.
+is added in both the source and target text. The source text format is :obj:`X [eos, src_lang_code]`
+where :obj:`X` is the source text. The target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
+
+The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode 
+the sequences for sequence-to-sequence fine-tuning.

 - Supervised training

@@ -38,8 +44,8 @@ the sequences for seq-2-seq fine-tuning.

 - Generation

-    While generating the target text set the `decoder_start_token_id` to the target language id. 
-    The following example shows how to translate English to Romanian using the ```facebook/mbart-large-en-ro``` model.
+    While generating the target text set the :obj:`decoder_start_token_id` to the target language id. 
+    The following example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.

 .. code-block::

@@ -71,6 +77,4 @@ MBartForConditionalGeneration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.MBartForConditionalGeneration
-    :members: generate, forward
-
-
+    :members: forward
--- a/docs/source/model_doc/pegasus.rst
+++ b/docs/source/model_doc/pegasus.rst
@@ -1,30 +1,40 @@
 Pegasus
 -----------------------------------------------------------------------------------------------------------------------
-**DISCLAIMER:** If you see something strange,
-file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__ and assign
-@sshleifer.
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
+and assign @sshleifer.


 Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The Pegasus model was proposed in `PEGASUS: Pre-training with Extracted Gap-sentences for
-Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
+Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and
+Peter J. Liu on Dec 18, 2019.
+
 According to the abstract,

- Pegasus' pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.
+- Pegasus' pretraining task is intentionally similar to summarization: important sentences are removed/masked from an
+  input document and are generated together as one output sequence from the remaining sentences, similar to an
+  extractive summary.
 - Pegasus achieves SOTA summarization performance on all 12 downstream tasks, as measured by ROUGE and human eval.

-The Authors' code can be found `here <https://github.com/google-research/pegasus>`_.
+The Authors' code can be found `here <https://github.com/google-research/pegasus>`__.


 Checkpoints
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-All the `checkpoints <https://huggingface.co/models?search=pegasus>`_ are finetuned for summarization, besides ``pegasus-large``, whence the other checkpoints are finetuned.
+
+All the `checkpoints <https://huggingface.co/models?search=pegasus>`__ are fine-tuned for summarization, besides 
+`pegasus-large`, whence the other checkpoints are fine-tuned:
+
 - Each checkpoint is 2.2 GB on disk and 568M parameters.
 - FP16 is not supported (help/ideas on this appreciated!).
 - Summarizing xsum in fp32 takes about 400ms/sample, with default parameters on a v100 GPU.
- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores 46.91/24.34/39.1.
+- For XSUM, The paper reports rouge1,rouge2, rougeL of paper: 47.21/24.56/39.25. As of Aug 9, this port scores
+  46.91/24.34/39.1.
+
 The gap is likely because of different alpha/length_penalty implementations in beam search.


@@ -32,14 +42,16 @@ Implementation Notes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - All models are transformer encoder-decoders with 16 layers in each component.
- The implementation is completely inherited from ``BartForConditionalGeneration``
+- The implementation is completely inherited from :class:`~transformers.BartForConditionalGeneration`
 - Some key configuration differences:
    - static, sinusoidal position embeddings
-    - no ``layernorm_embedding`` (``PegasusConfig.normalize_embedding=False``)
+    - no :obj:`layernorm_embedding` (:obj`PegasusConfig.normalize_embedding=False`)
    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
-    - ``num_beams=8``
- All pretrained pegasus checkpoints are the same besides three attributes: ``tokenizer.model_max_length`` (max input size),  ``max_length`` (max num tokens to generate) and ``length_penalty``
- Code to convert checkpoints trained in the author's `repo <https://github.com/google-research/pegasus>`_ can be found in ``convert_pegasus_tf_to_pytorch.py``
+    - more beams are used (:obj:`num_beams=8`)
+- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
+  input size), :obj:`max_length` (the maximum number of tokens to generate) and :obj:`length_penalty`.
+- The code to convert checkpoints trained in the author's `repo <https://github.com/google-research/pegasus>`_ can be
+  found in ``convert_pegasus_tf_to_pytorch.py``.


 Usage Example
@@ -62,48 +74,12 @@ Usage Example
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."

-PegasusForConditionalGeneration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
-Available models are listed at `Model List <https://huggingface.co/models?search=pegasus>`__
-
-.. autoclass:: transformers.PegasusForConditionalGeneration
-    :members:


 PegasusConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-This config fully inherits from ``BartConfig``, but pegasus uses different default values:
-Up to date parameter values can be seen in `S3 <https://s3.amazonaws.com/models.huggingface.co/bert/google/pegasus-xsum/config.json>`_.
-As of Aug 10, 2020, they are:

-.. code-block:: python
-
-    dict(
-    vocab_size=96103,
-    max_position_embeddings=512,
-    d_model=1024,
-    encoder_ffn_dim=4096,
-    decoder_ffn_dim=4096,
-    encoder_attention_heads=16,
-    decoder_attention_heads=16,
-    encoder_layers=16,
-    decoder_layers=16,
-    dropout=0.1,
-    attention_dropout=0.1,
-    activation_dropout=0.1,
-    pad_token_id=0,
-    eos_token_id=1,
-    is_encoder_decoder=True,
-    normalize_before=True,
-    scale_embedding=True,
-    normalize_embedding=False,
-    add_final_layer_norm=True,
-    static_position_embeddings=True,
-    num_beams=8,
-    activation_function="relu",
-    )
+.. autoclass:: transformers.PegasusConfig


 PegasusTokenizer
@@ -114,4 +90,7 @@ warning: ``add_tokens`` does not work at the moment.
    :members: __call__, prepare_seq2seq_batch


+PegasusForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+.. autoclass:: transformers.PegasusForConditionalGeneration
--- a/docs/source/model_doc/prophetnet.rst
+++ b/docs/source/model_doc/prophetnet.rst
@@ -0,0 +1,83 @@
+ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@patrickvonplaten
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+
+ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token.
+
+The abstract from the paper is the following:
+
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+
+The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
+
+
+ProphetNetConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetConfig
+    :members:
+
+
+ProphetNetTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetTokenizer
+    :members:
+
+
+ProphetNet specific outputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqLMOutput
+    :members:
+
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetSeq2SeqModelOutput
+    :members:
+
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderModelOutput
+    :members:
+
+.. autoclass:: transformers.modeling_prophetnet.ProphetNetDecoderLMOutput
+    :members:
+
+ProphetNetModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetModel
+    :members: forward
+
+
+ProphetNetEncoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetEncoder
+    :members: forward
+
+
+ProphetNetDecoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetDecoder
+    :members: forward
+
+
+ProphetNetForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetForConditionalGeneration
+    :members: forward
+
+
+ProphetNetForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.ProphetNetForCausalLM
+    :members: forward
--- a/docs/source/model_doc/rag.rst
+++ b/docs/source/model_doc/rag.rst
@@ -62,8 +62,7 @@ Rag specific outputs
 .. autoclass:: transformers.modeling_rag.RetrievAugLMOutput
    :members:

-
-RAGRetriever
+RagRetriever
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.RagRetriever
--- a/docs/source/model_doc/squeezebert.rst
+++ b/docs/source/model_doc/squeezebert.rst
@@ -0,0 +1,103 @@
+SqueezeBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The SqueezeBERT model was proposed in
+`SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
+<https://arxiv.org/abs/2006.11316>`__
+by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer.
+It's a bidirectional transformer similar to the BERT model.
+The key difference between the BERT architecture and the SqueezeBERT architecture
+is that SqueezeBERT uses `grouped convolutions <https://blog.yani.io/filter-group-tutorial>`__
+instead of fully-connected layers for the Q, K, V and FFN layers.
+
+The abstract from the paper is the following:
+
+*Humans read and write hundreds of billions of messages every day. Further, due to the availability of
+large datasets, large computing systems, and better neural network models, natural language processing (NLP)
+technology has made significant strides in understanding, proofreading, and organizing these messages.
+Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users,
+social networks, and businesses. In particular, we consider smartphones and other mobile devices as
+crucial platforms for deploying NLP models at scale. However, today's highly-accurate NLP neural network
+models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds
+to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped
+convolutions have yielded significant speedups for computer vision networks, but many of these techniques
+have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in
+self-attention layers with grouped convolutions, and we use this technique in a novel network architecture
+called SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive
+accuracy on the GLUE test set. The SqueezeBERT code will be released.*
+
+Tips:
+
+- SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
+  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
+  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- For best results when finetuning on sequence classification tasks, it is recommended to start with the
+  `squeezebert/squeezebert-mnli-headless` checkpoint.
+
+SqueezeBertConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertConfig
+    :members:
+
+
+SqueezeBertTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+SqueezeBertTokenizerFast
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertTokenizerFast
+    :members:
+
+
+SqueezeBertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertModel
+    :members:
+
+
+SqueezeBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMaskedLM
+    :members:
+
+
+SqueezeBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForSequenceClassification
+    :members:
+
+
+SqueezeBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForMultipleChoice
+    :members:
+
+
+SqueezeBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForTokenClassification
+    :members:
+
+
+SqueezeBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.SqueezeBertForQuestionAnswering
+    :members:
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -62,10 +62,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash

 .. code-block::

-  input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
-  labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
+  input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
+  labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
  # the forward function automatically creates the correct decoder_input_ids
-  model(input_ids=input_ids, labels=labels)
+  loss = model(input_ids=input_ids, labels=labels, return_dict=True).loss

 - Supervised training

@@ -75,10 +75,10 @@ token. T5 can be trained / fine-tuned both in a supervised and unsupervised fash
  
 .. code-block::

-  input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
-  labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
+  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
+  labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
  # the forward function automatically creates the correct decoder_input_ids
-  model(input_ids=input_ids, labels=labels)
+  loss = model(input_ids=input_ids, labels=labels, return_dict=True).loss


 T5Config
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -46,13 +46,6 @@ TransfoXLTokenizer
    :members: save_vocabulary


-TransfoXLTokenizerFast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: transformers.TransfoXLTokenizerFast
-    :members:
-
-
 TransfoXL specific outputs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/xlmprophetnet.rst
+++ b/docs/source/model_doc/xlmprophetnet.rst
@@ -0,0 +1,63 @@
+XLM-ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+
+**DISCLAIMER:** If you see something strange, file a `Github Issue
+<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@patrickvonplaten
+
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The XLM-ProphetNet model was proposed in `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou on 13 Jan, 2020.
+
+XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual "wiki100" Wikipedia dump.
+
+The abstract from the paper is the following:
+
+*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
+
+The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
+
+XLMProphetNetConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetConfig
+    :members:
+
+
+XLMProphetNetTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetTokenizer
+    :members:
+
+
+XLMProphetNetModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetModel
+
+
+XLMProphetNetEncoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetEncoder
+
+
+XLMProphetNetDecoder
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetDecoder
+
+
+XLMProphetNetForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetForConditionalGeneration
+
+
+XLMProphetNetForCausalLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XLMProphetNetForCausalLM
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -612,6 +612,43 @@ The `mbart-large-cc25 <https://huggingface.co/facebook/mbart-large-cc25>`_ check

 .. _multimodal-models:

+ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=prophetnet">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-prophetnet-blueviolet">
+   </a>
+   <a href="model_doc/prophetnet.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-prophetnet-blueviolet">
+   </a>
+
+`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
+
+ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations.
+The model architecture is based on the original Transformer, but replaces the "standard" self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism.
+
+The library provides a pre-trained version of this model for conditional generation and a fine-tuned version for summarization.
+
+XLM-ProphetNet
+-----------------------------------------------------------------------------------------------------------------------
+
+.. raw:: html
+
+   <a href="https://huggingface.co/models?filter=xprophetnet">
+       <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
+   </a>
+   <a href="model_doc/xlmprophetnet.html">
+       <img alt="Doc" src="https://img.shields.io/badge/Model_documentation-xprophetnet-blueviolet">
+   </a>
+
+`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
+
+XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
+
+The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned versions for headline generation and question generation, respectively.
+
 Multimodal models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

--- a/docs/source/philosophy.rst
+++ b/docs/source/philosophy.rst
@@ -66,7 +66,7 @@ The library is built around three types of classes for each model:
 All these classes can be instantiated from pretrained instances and saved locally using two methods:

 - :obj:`from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either
-  provided by the library itself (the suported models are provided in the list :doc:`here <pretrained_models>`
+  provided by the library itself (the supported models are provided in the list :doc:`here <pretrained_models>`
  or stored locally (or on a server) by the user,
 - :obj:`save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using
  :obj:`from_pretrained()`.
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -11,26 +11,26 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 | BERT               | ``bert-base-uncased``                                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                    |                                                            | | Trained on lower-cased English text.                                                                                                |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased``                                     | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                    | ``bert-large-uncased``                                     | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
 |                    |                                                            | | Trained on lower-cased English text.                                                                                                |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-cased``                                        | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``bert-base-cased``                                        | | 12-layer, 768-hidden, 12-heads, 109M parameters.                                                                                    |
 |                    |                                                            | | Trained on cased English text.                                                                                                      |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased``                                       | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                    | ``bert-large-cased``                                       | | 24-layer, 1024-hidden, 16-heads, 335M parameters.                                                                                   |
 |                    |                                                            | | Trained on cased English text.                                                                                                      |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-multilingual-uncased``                         | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.                                                        |
+|                    | ``bert-base-multilingual-uncased``                         | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 168M parameters.                                                        |
 |                    |                                                            | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias                                                    |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-multilingual-cased``                           | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters.                                                             |
+|                    | ``bert-base-multilingual-cased``                           | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 179M parameters.                                                             |
 |                    |                                                            | | Trained on cased text in the top 104 languages with the largest Wikipedias                                                          |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`__).                                              |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-base-chinese``                                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``bert-base-chinese``                                      | | 12-layer, 768-hidden, 12-heads, 103M parameters.                                                                                    |
 |                    |                                                            | | Trained on cased Chinese Simplified and Traditional text.                                                                           |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                    | ``bert-base-german-cased``                                 | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
@@ -38,22 +38,22 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`__).                                                             |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased-whole-word-masking``                  | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                    | ``bert-large-uncased-whole-word-masking``                  | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
 |                    |                                                            | | Trained on lower-cased English text using Whole-Word-Masking                                                                        |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased-whole-word-masking``                    | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                    | ``bert-large-cased-whole-word-masking``                    | | 24-layer, 1024-hidden, 16-heads, 335M parameters.                                                                                   |
 |                    |                                                            | | Trained on cased English text using Whole-Word-Masking                                                                              |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/google-research/bert/#bert>`__).                                                                    |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | | 24-layer, 1024-hidden, 16-heads, 340M parameters.                                                                                   |
+|                    | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | | 24-layer, 1024-hidden, 16-heads, 336M parameters.                                                                                   |
 |                    |                                                            | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                                                             |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see details of fine-tuning in the `example section <https://github.com/huggingface/transformers/tree/master/examples>`__).           |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``bert-large-cased-whole-word-masking-finetuned-squad``    | | 24-layer, 1024-hidden, 16-heads, 340M parameters                                                                                    |
+|                    | ``bert-large-cased-whole-word-masking-finetuned-squad``    | | 24-layer, 1024-hidden, 16-heads, 335M parameters                                                                                    |
 |                    |                                                            | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                                                               |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__)                           |
@@ -73,31 +73,31 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``cl-tohoku/bert-base-japanese``                           | | 12-layer, 768-hidden, 12-heads, 111M parameters.                                                                                    |
 |                    |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,                     |
 |                    |                                                            | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.               |
 |                    |                                                            | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them.                  |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-whole-word-masking``        | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``cl-tohoku/bert-base-japanese-whole-word-masking``        | | 12-layer, 768-hidden, 12-heads, 111M parameters.                                                                                    |
 |                    |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece and this requires some extra dependencies,                     |
 |                    |                                                            | | `fugashi <https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.               |
 |                    |                                                            | | Use ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install from source) to install them.                  |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-char``                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``cl-tohoku/bert-base-japanese-char``                      | | 12-layer, 768-hidden, 12-heads, 90M parameters.                                                                                     |
 |                    |                                                            | | Trained on Japanese text. Text is tokenized into characters.                                                                        |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``cl-tohoku/bert-base-japanese-char-whole-word-masking``   | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``cl-tohoku/bert-base-japanese-char-whole-word-masking``   | | 12-layer, 768-hidden, 12-heads, 90M parameters.                                                                                     |
 |                    |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.                                               |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``TurkuNLP/bert-base-finnish-cased-v1``                    | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                    | ``TurkuNLP/bert-base-finnish-cased-v1``                    | | 12-layer, 768-hidden, 12-heads, 125M parameters.                                                                                    |
 |                    |                                                            | | Trained on cased Finnish text.                                                                                                      |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
@@ -294,10 +294,10 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    | ``t5-11B``                                                 | | ~11B parameters with 24-layers, 1024-hidden-state, 65536 feed-forward hidden-state, 128-heads,                                      |
 |                    |                                                            | | Trained on English text: the Colossal Clean Crawled Corpus (C4)                                                                     |
 +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| XLM-RoBERTa        | ``xlm-roberta-base``                                       | | ~125M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,                                         |
+| XLM-RoBERTa        | ``xlm-roberta-base``                                       | | ~270M parameters with 12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 8-heads,                                         |
 |                    |                                                            | | Trained on on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                       |
 |                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                    | ``xlm-roberta-large``                                      | | ~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
+|                    | ``xlm-roberta-large``                                      | | ~550M parameters with 24-layers, 1024-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
 |                    |                                                            | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                          |
 +--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | FlauBERT           | ``flaubert/flaubert_small_cased``                          | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
@@ -415,4 +415,24 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                    | ``microsoft/layoutlm-large-uncased``                       | | 24 layers, 1024-hidden, 16-heads, 343M parameters                                                                                   |
 |                    |                                                            |                                                                                                                                       |
 |                    |                                                            | (see `details <https://github.com/microsoft/unilm/tree/master/layoutlm>`__)                                                           |
-+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| DeBERTa            | ``microsoft/deberta-base``                                 | | 12-layer, 768-hidden, 12-heads, ~125M parameters                                                                                    |
+|                    |                                                            | | DeBERTa using the BERT-base architecture                                                                                            |
+|                    |                                                            |                                                                                                                                       |
+|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``microsoft/deberta-large``                                | | 24-layer, 1024-hidden, 16-heads, ~390M parameters                                                                                   |
+|                    |                                                            | | DeBERTa using the BERT-large architecture                                                                                           |
+|                    |                                                            |                                                                                                                                       |
+|                    |                                                            | (see `details <https://github.com/microsoft/DeBERTa>`__)                                                                              |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| SqueezeBERT        | ``squeezebert/squeezebert-uncased``                        | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | SqueezeBERT architecture pretrained from scratch on masked language model (MLM) and sentence order prediction (SOP) tasks.          |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli``                           | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                    | ``squeezebert/squeezebert-mnli-headless``                  | | 12-layer, 768-hidden, 12-heads, 51M parameters, 4.3x faster than bert-base-uncased on a smartphone.                                 |
+|                    |                                                            | | This is the squeezebert-uncased model finetuned on MNLI sentence pair classification task with distillation from electra-base.      |
+|                    |                                                            | | The final classification layer is removed, so when you finetune, the final layer will be reinitialized.                             |
+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/docs/source/task_summary.rst
+++ b/docs/source/task_summary.rst
@@ -89,7 +89,7 @@ of each other. The process is the following:
    >>> import torch

    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
-    >>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
+    >>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=True)

    >>> classes = ["not paraphrase", "is paraphrase"]

@@ -122,7 +122,7 @@ of each other. The process is the following:
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
-    >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
+    >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=True)

    >>> classes = ["not paraphrase", "is paraphrase"]

@@ -213,7 +213,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    >>> import torch

    >>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-    >>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+    >>> model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)

    >>> text = r"""
    ... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
@@ -255,7 +255,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
-    >>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
+    >>> model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad", return_dict=True)

    >>> text = r"""
    ... 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
@@ -378,7 +378,7 @@ Here is an example of doing masked language modeling using a model and a tokeniz
    >>> import torch

    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
+    >>> model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased", return_dict=True)

    >>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

@@ -394,7 +394,7 @@ Here is an example of doing masked language modeling using a model and a tokeniz
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
-    >>> model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")
+    >>> model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased", return_dict=True)

    >>> sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

@@ -439,7 +439,7 @@ Here is an example of using the tokenizer and model and leveraging the :func:`~t
    >>> from torch.nn import functional as F

    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = AutoModelWithLMHead.from_pretrained("gpt2")
+    >>> model = AutoModelWithLMHead.from_pretrained("gpt2", return_dict=True)

    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and "

@@ -463,7 +463,7 @@ Here is an example of using the tokenizer and model and leveraging the :func:`~t
    >>> import tensorflow as tf

    >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    >>> model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+    >>> model = TFAutoModelWithLMHead.from_pretrained("gpt2", return_dict=True)

    >>> sequence = f"Hugging Face is based in DUMBO, New York City, and "

@@ -517,7 +517,7 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
    >>> ## PYTORCH CODE
    >>> from transformers import AutoModelWithLMHead, AutoTokenizer

-    >>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    >>> model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -542,7 +542,7 @@ Here is an example of text generation using ``XLNet`` and its tokenzier.
    >>> ## TENSORFLOW CODE
    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer

-    >>> model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    >>> model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

    >>> # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
@@ -659,7 +659,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni
    >>> from transformers import AutoModelForTokenClassification, AutoTokenizer
    >>> import torch

-    >>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
+    >>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    >>> label_list = [
@@ -687,7 +687,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni
    >>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
    >>> import tensorflow as tf

-    >>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
+    >>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    >>> label_list = [
@@ -758,8 +758,8 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
    ... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
    ... """

-Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
-of ``PretrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
+Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
+of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
 This outputs the following summary:

 .. code-block::
@@ -772,7 +772,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarized.
 3. Add the T5 specific prefix "summarize: ".
-4. Use the ``PretrainedModel.generate()`` method to generate the summary.
+4. Use the ``PreTrainedModel.generate()`` method to generate the summary.

 In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.

@@ -781,7 +781,7 @@ In this example we use Google`s T5 model. Even though it was pre-trained only on
    >>> ## PYTORCH CODE
    >>> from transformers import AutoModelWithLMHead, AutoTokenizer

-    >>> model = AutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
@@ -790,7 +790,7 @@ In this example we use Google`s T5 model. Even though it was pre-trained only on
    >>> ## TENSORFLOW CODE
    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer

-    >>> model = TFAutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = TFAutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> # T5 uses a max_length of 512 so we cut the article to 512 tokens.
@@ -819,22 +819,22 @@ translation results.
    >>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
    [{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

-Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
-of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
+Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
+of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.

 Here is an example of doing translation using a model and a tokenizer. The process is the following:

 1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
 2. Define the article that should be summarizaed.
 3. Add the T5 specific prefix "translate English to German: "
-4. Use the ``PretrainedModel.generate()`` method to perform the translation.
+4. Use the ``PreTrainedModel.generate()`` method to perform the translation.

 .. code-block::

    >>> ## PYTORCH CODE
    >>> from transformers import AutoModelWithLMHead, AutoTokenizer

-    >>> model = AutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = AutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
@@ -842,7 +842,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce
    >>> ## TENSORFLOW CODE
    >>> from transformers import TFAutoModelWithLMHead, AutoTokenizer

-    >>> model = TFAutoModelWithLMHead.from_pretrained("t5-base")
+    >>> model = TFAutoModelWithLMHead.from_pretrained("t5-base", return_dict=True)
    >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")

    >>> inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="tf")
--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
@@ -22,12 +22,12 @@ How transformers are tested

   * `self-hosted (push) <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-push.yml>`__: runs fast tests on GPU only on commits on ``master``. It only runs if a commit on ``master`` has updated the code in one of the following folders: ``src``, ``tests``, ``.github`` (to prevent running on added model cards, notebooks, etc.)
     
-   * `self-hosted runner <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs slow tests on ``tests`` and ``examples``:
+   * `self-hosted runner <https://github.com/huggingface/transformers/blob/master/.github/workflows/self-scheduled.yml>`__: runs normal and slow tests on GPU in ``tests`` and ``examples``:

   .. code-block:: bash

-    RUN_SLOW=1 USE_CUDA=1 pytest tests/
-    RUN_SLOW=1 USE_CUDA=1 pytest examples/
+    RUN_SLOW=1 pytest tests/
+    RUN_SLOW=1 pytest examples/

   The results can be observed `here <https://github.com/huggingface/transformers/actions>`__.

@@ -393,36 +393,53 @@ On a GPU-enabled setup, to test in CPU-only mode add ``CUDA_VISIBLE_DEVICES=""``
                
    CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py

-or if you have multiple gpus, you can tell which one to use in this test session, e.g. to use only the second gpu if you have gpus ``0`` and ``1``, you can run:
+or if you have multiple gpus, you can specify which one is to be used by ``pytest``. For example, to use only the second gpu if you have gpus ``0`` and ``1``, you can run:

 .. code-block:: bash
                
    CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py

 This is handy when you want to run different tasks on different GPUs.
-    
-And we have these decorators that require the condition described by the marker.

-``
-@require_torch
-@require_tf
-@require_multigpu
-@require_non_multigpu
-@require_torch_tpu
-@require_torch_and_cuda
-``
+Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
+
+* ``require_torch`` - this test will run only under torch
+* ``require_torch_gpu`` - as ``require_torch`` plus requires at least 1 GPU
+* ``require_torch_multigpu`` - as ``require_torch`` plus requires at least 2 GPUs
+* ``require_torch_non_multigpu`` - as ``require_torch`` plus requires 0 or 1 GPUs
+* ``require_torch_tpu`` - as ``require_torch`` plus requires at least 1 TPU
+
+For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
+
+.. code-block:: python
+
+    @require_torch_multigpu
+    def test_example_with_multigpu():
+
+If a test requires ``tensorflow`` use the ``require_tf`` decorator. For example:
+
+.. code-block:: python
+
+    @require_tf
+    def test_tf_thing_with_tensorflow():
+
+These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up:
+
+.. code-block:: python
+
+    @require_torch_gpu
+    @slow
+    def test_example_slow_on_gpu():

 Some decorators like ``@parametrized`` rewrite test names, therefore ``@require_*`` skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:

 .. code-block:: python

    @parameterized.expand(...)
-    @require_multigpu
+    @require_torch_multigpu
    def test_integration_foo():
-    
-There is no problem whatsoever with ``@pytest.mark.parametrize`` (but it only works with non-unittests) - can use it in any order.

-This section will be expanded soon once our work in progress on those decorators is finished.
+This order problem doesn't exist with ``@pytest.mark.parametrize``, you can put it first or last and it will still work. But it only works with non-unittests.

 Inside tests:

--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -109,9 +109,9 @@ The following is equivalent to the previous example:
 .. code-block:: python

    from torch.nn import functional as F
-    labels = torch.tensor([1,0]).unsqueeze(0)
+    labels = torch.tensor([1,0])
    outputs = model(input_ids, attention_mask=attention_mask)
-    loss = F.cross_entropy(labels, outputs.logitd)
+    loss = F.cross_entropy(outputs.logits, labels)
    loss.backward()
    optimizer.step()

--- a/examples/README.md
+++ b/examples/README.md
@@ -47,9 +47,7 @@ pip install -r ./examples/requirements.txt

 ## One-click Deploy to Cloud (wip)

-#### Azure
-
-[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
+**Coming soon!**

 ## Running on TPUs

--- a/examples/conftest.py
+++ b/examples/conftest.py
@@ -2,6 +2,7 @@
 # by pytest before any tests are run

 import sys
+import warnings
 from os.path import abspath, dirname, join


@@ -9,3 +10,7 @@ from os.path import abspath, dirname, join
 # 'pip install -e .[dev]' when switching between checkouts and running tests.
 git_repo_path = abspath(join(dirname(dirname(__file__)), "src"))
 sys.path.insert(1, git_repo_path)
+
+# silence FutureWarning warnings in tests since often we can't act on them until
+# they become normal warnings - i.e. the tests still need to test the current functionality
+warnings.simplefilter(action="ignore", category=FutureWarning)
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
@@ -24,8 +24,11 @@ import logging
 import math
 import os
 from dataclasses import dataclass, field
+from glob import glob
 from typing import Optional

+from torch.utils.data import ConcatDataset
+
 from transformers import (
    CONFIG_MAPPING,
    MODEL_WITH_LM_HEAD_MAPPING,
@@ -87,6 +90,13 @@ class DataTrainingArguments:
    train_data_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a text file)."}
    )
+    train_data_files: Optional[str] = field(
+        default=None,
+        metadata={
+            "help": "The input training data files (multiple files in glob format). "
+            "Very often splitting large files to smaller files can prevent tokenizer going out of memory"
+        },
+    )
    eval_data_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
@@ -131,17 +141,24 @@ def get_dataset(
    evaluate: bool = False,
    cache_dir: Optional[str] = None,
 ):
-    file_path = args.eval_data_file if evaluate else args.train_data_file
-    if args.line_by_line:
-        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
+    def _dataset(file_path):
+        if args.line_by_line:
+            return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
+        else:
+            return TextDataset(
+                tokenizer=tokenizer,
+                file_path=file_path,
+                block_size=args.block_size,
+                overwrite_cache=args.overwrite_cache,
+                cache_dir=cache_dir,
+            )
+
+    if evaluate:
+        return _dataset(args.eval_data_file)
+    elif args.train_data_files:
+        return ConcatDataset([_dataset(f) for f in glob(args.train_data_files)])
    else:
-        return TextDataset(
-            tokenizer=tokenizer,
-            file_path=file_path,
-            block_size=args.block_size,
-            overwrite_cache=args.overwrite_cache,
-            cache_dir=cache_dir,
-        )
+        return _dataset(args.train_data_file)


 def main():
--- a/examples/lightning_base.py
+++ b/examples/lightning_base.py
@@ -119,7 +119,7 @@ class BaseTransformer(pl.LightningModule):
    def get_lr_scheduler(self):
        get_schedule_func = arg_to_scheduler[self.hparams.lr_scheduler]
        scheduler = get_schedule_func(
-            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=self.total_steps
+            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=self.total_steps()
        )
        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
        return scheduler
@@ -159,19 +159,20 @@ class BaseTransformer(pl.LightningModule):
    def test_epoch_end(self, outputs):
        return self.validation_end(outputs)

-    @property
    def total_steps(self) -> int:
        """The number of total training steps that will be run. Used for lr scheduler purposes."""
        num_devices = max(1, self.hparams.gpus)  # TODO: consider num_tpu_cores
        effective_batch_size = self.hparams.train_batch_size * self.hparams.accumulate_grad_batches * num_devices
-        dataset_size = len(self.train_loader.dataset)
-        return (dataset_size / effective_batch_size) * self.hparams.max_epochs
+        return (self.dataset_size / effective_batch_size) * self.hparams.max_epochs

    def setup(self, mode):
-        if mode == "fit":
+        if mode == "test":
+            self.dataset_size = len(self.test_dataloader().dataset)
+        else:
            self.train_loader = self.get_dataloader("train", self.hparams.train_batch_size, shuffle=True)
+            self.dataset_size = len(self.train_loader.dataset)

-    def get_dataloader(self, type_path, batch_size, shuffle=False):
+    def get_dataloader(self, type_path: str, batch_size: int, shuffle: bool = False):
        raise NotImplementedError("You must implement this for your task")

    def train_dataloader(self):
@@ -290,7 +291,8 @@ class LoggingCallback(pl.Callback):


 def add_generic_args(parser, root_dir) -> None:
-    #  TODO(SS): allow all pl args? parser = pl.Trainer.add_argparse_args(parser)
+    #  To allow all pl args uncomment the following line
+    #  parser = pl.Trainer.add_argparse_args(parser)
    parser.add_argument(
        "--output_dir",
        default=None,
--- a/examples/rag/README.md
+++ b/examples/rag/README.md
@@ -65,26 +65,41 @@ Does He Love You	Does He Love You	Red Sandy Spika dress of Reba McEntire	Greates
 We demonstrate how to evaluate retrieval against DPR evaluation data. You can download respective files from links listed [here](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py#L39-L45).

 1. Download and unzip the gold data file. We use the `biencoder-nq-dev` from https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz.
+    ```bash
+    wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz && gzip -d biencoder-nq-dev.json.gz
+   ```
+
 2. Parse the unziped file using the `parse_dpr_relevance_data.py`
    ```bash
+    mkdir output # or wherever you want to save this
    python examples/rag/parse_dpr_relevance_data.py \
-        --src_path path/to/unziped/biencoder-nq-dev.json \
-        --evaluation_set path/to/output/biencoder-nq-dev.questions \
-        --gold_data_path path/to/output/biencoder-nq-dev.pages
+        --src_path biencoder-nq-dev.json \
+        --evaluation_set output/biencoder-nq-dev.questions \
+        --gold_data_path output/biencoder-nq-dev.pages
    ```
 3. Run evaluation:
-    ```bash
+    ```bash    
+    python examples/rag/eval_rag.py \
+        --model_name_or_path facebook/rag-sequence-nq \
+        --model_type rag_sequence \
+        --evaluation_set output/biencoder-nq-dev.questions \
+        --gold_data_path output/biencoder-nq-dev.pages \
+        --predictions_path output/retrieval_preds.tsv  \
+        --eval_mode retrieval \
+        --k 1
+    ```
+   ```bash
+   # EXPLANATION
    python examples/rag/eval_rag.py \
        --model_name_or_path facebook/rag-sequence-nq \ # model name or path of the model we're evaluating
        --model_type rag_sequence \ # RAG model type (rag_token or rag_sequence)
-        --evaluation_set path/to/output/biencoder-nq-dev.questions \ # an input dataset for evaluation
-        --gold_data_path path/to/output/biencoder-nq-dev.pages \ # a dataset containing ground truth answers for samples from the evaluation_set
-        --predictions_path path/to/retrieval_preds.tsv  \ # name of file where predictions will be stored
+        --evaluation_set output/biencoder-nq-dev.questions \ # an input dataset for evaluation
+        --gold_data_path poutput/biencoder-nq-dev.pages \ # a dataset containing ground truth answers for samples from the evaluation_set
+        --predictions_path output/retrieval_preds.tsv  \ # name of file where predictions will be stored
        --eval_mode retrieval \ # indicates whether we're performing retrieval evaluation or e2e evaluation
        --k 1 # parameter k for the precision@k metric
+   
    ```
-
-
 ## End-to-end evaluation

 We support two formats of the gold data file (controlled by the `gold_data_mode` parameter):
@@ -97,7 +112,9 @@ who is the owner of reading football club	['Xiu Li Dai', 'Dai Yongge', 'Dai Xiul
 Xiu Li Dai
 ```

-Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter. If this path already exists, the script will use saved predictions to calculate metrics. Add `--recalculate` parameter to force the script to perform inference from scratch.
+Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter. 
+If this path already exists, the script will use saved predictions to calculate metrics. 
+Add `--recalculate` parameter to force the script to perform inference from scratch.

 An example e2e evaluation run could look as follows:
 ```bash
--- a/examples/rag/init.py
+++ b/examples/rag/init.py
@@ -0,0 +1,5 @@
+import os
+import sys
+
+
+sys.path.insert(1, os.path.dirname(os.path.realpath(__file__)))
--- a/examples/rag/distributed_retriever.py
+++ b/examples/rag/distributed_retriever.py
@@ -27,13 +27,18 @@ class RagPyTorchDistributedRetriever(RagRetriever):
            It is used to decode the question and then use the generator_tokenizer.
        generator_tokenizer (:class:`~transformers.PretrainedTokenizer`):
            The tokenizer used for the generator part of the RagModel.
+        index (:class:`~transformers.retrieval_rag.Index`, optional, defaults to the one defined by the configuration):
+            If specified, use this index instead of the one built using the configuration
    """

    _init_retrieval = False

-    def __init__(self, config, question_encoder_tokenizer, generator_tokenizer):
+    def __init__(self, config, question_encoder_tokenizer, generator_tokenizer, index=None):
        super().__init__(
-            config, question_encoder_tokenizer=question_encoder_tokenizer, generator_tokenizer=generator_tokenizer
+            config,
+            question_encoder_tokenizer=question_encoder_tokenizer,
+            generator_tokenizer=generator_tokenizer,
+            index=index,
        )

        self.process_group = None
--- a/examples/rag/eval_rag.py
+++ b/examples/rag/eval_rag.py
@@ -15,7 +15,7 @@ from transformers import logging as transformers_logging


 sys.path.append(os.path.join(os.getcwd()))  # noqa: E402 # isort:skip
-from examples.rag.utils import exact_match_score, f1_score  # noqa: E402 # isort:skip
+from utils import exact_match_score, f1_score  # noqa: E402 # isort:skip


 logger = logging.getLogger(__name__)
@@ -72,7 +72,7 @@ def get_precision_at_k(args, preds_path, gold_data_path):
    em = total = 0
    for hypo, reference in zip(hypos, references):
        hypo_provenance = set(hypo.split("\t")[:k])
-        ref_provenance = set(reference.split("\t")[1 : (k + 1)])
+        ref_provenance = set(reference.split("\t"))
        total += 1
        em += len(hypo_provenance & ref_provenance) / k

--- a/examples/rag/finetune.py
+++ b/examples/rag/finetune.py
@@ -31,16 +31,13 @@ from transformers import (
 from transformers import logging as transformers_logging


-sys.path.append(os.path.join(os.getcwd()))  # noqa: E402 # noqa: E402 # isort:skip
-
-from examples.lightning_base import BaseTransformer, add_generic_args, generic_train  # noqa: E402 # isort:skip
-from examples.rag.callbacks import (  # noqa: E402 # isort:skip
+from callbacks import (  # noqa: E402 # isort:skipq
    get_checkpoint_callback,
    get_early_stopping_callback,
    Seq2SeqLoggingCallback,
 )
-from examples.rag.distributed_retriever import RagPyTorchDistributedRetriever  # noqa: E402 # isort:skip
-from examples.rag.utils import (  # noqa: E402 # isort:skip
+from distributed_retriever import RagPyTorchDistributedRetriever  # noqa: E402 # isort:skip
+from utils import (  # noqa: E402 # isort:skip
    calculate_exact_match,
    flatten_list,
    get_git_info,
@@ -53,6 +50,11 @@ from examples.rag.utils import (  # noqa: E402 # isort:skip
    Seq2SeqDataset,
 )

+# need the parent dir module
+sys.path.insert(2, str(Path(__file__).resolve().parents[1]))
+from lightning_base import BaseTransformer, add_generic_args, generic_train  # noqa
+
+
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)

@@ -88,6 +90,11 @@ class GenerativeQAModule(BaseTransformer):
        config_class = RagConfig if self.is_rag_model else AutoConfig
        config = config_class.from_pretrained(hparams.model_name_or_path)

+        # set retriever parameters
+        config.index_name = args.index_name or config.index_name
+        config.passages_path = args.passages_path or config.passages_path
+        config.index_path = args.index_path or config.index_path
+
        # set extra_model_params for generator configs and load_model
        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "attention_dropout", "dropout")
        if self.is_rag_model:
@@ -95,7 +102,7 @@ class GenerativeQAModule(BaseTransformer):
                config.generator.prefix = args.prefix
            config.label_smoothing = hparams.label_smoothing
            hparams, config.generator = set_extra_model_params(extra_model_params, hparams, config.generator)
-            retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path)
+            retriever = RagPyTorchDistributedRetriever.from_pretrained(hparams.model_name_or_path, config=config)
            model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
            prefix = config.question_encoder.prefix
        else:
@@ -403,6 +410,28 @@ class GenerativeQAModule(BaseTransformer):
        )
        return parser

+    @staticmethod
+    def add_retriever_specific_args(parser):
+        parser.add_argument(
+            "--index_name",
+            type=str,
+            default=None,
+            help="Name of the index to use: 'hf' for a canonical dataset from the datasets library (default), 'custom' for a local index, or 'legacy' for the orignal one)",
+        )
+        parser.add_argument(
+            "--passages_path",
+            type=str,
+            default=None,
+            help="Path to the dataset of passages for custom index. More info about custom indexes in the RagRetriever documentation as well as in `examples/rag/use_own_knowledge_dataset.py`",
+        )
+        parser.add_argument(
+            "--index_path",
+            type=str,
+            default=None,
+            help="Path to the faiss index for custom index. More info about custom indexes in the RagRetriever documentation as well as in `examples/rag/use_own_knowledge_dataset.py`",
+        )
+        return parser
+

 def main(args, model=None) -> GenerativeQAModule:
    Path(args.output_dir).mkdir(exist_ok=True)
@@ -463,6 +492,7 @@ if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser = pl.Trainer.add_argparse_args(parser)
    parser = GenerativeQAModule.add_model_specific_args(parser, os.getcwd())
+    parser = GenerativeQAModule.add_retriever_specific_args(parser)

    args = parser.parse_args()

--- a/examples/rag/test_data/my_knowledge_dataset.csv
+++ b/examples/rag/test_data/my_knowledge_dataset.csv
@@ -0,0 +1,2 @@
+Aaron	Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother's spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from God at Sinai granted Aaron the priesthood for himself and his male descendants, and he became the first High Priest of the Israelites. Aaron died before the Israelites crossed the North Jordan river and he was buried on Mount Hor (Numbers 33:39; Deuteronomy 10:6 says he died and was buried at Moserah). Aaron is also mentioned in the New Testament of the Bible. According to the Book of Exodus, Aaron first functioned as Moses' assistant. Because Moses complained that he could not speak well, God appointed Aaron as Moses' "prophet" (Exodus 4:10-17; 7:1). At the command of Moses, he let his rod turn into a snake. Then he stretched out his rod in order to bring on the first three plagues. After that, Moses tended to act and speak for himself. During the journey in the wilderness, Aaron was not always prominent or active. At the battle with Amalek, he was chosen with Hur to support the hand of Moses that held the "rod of God". When the revelation was given to Moses at biblical Mount Sinai, he headed the elders of Israel who accompanied Moses on the way to the summit.
+"Pokémon"	Pokémon , also known as in Japan, is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures. The franchise copyright is shared by all three companies, but Nintendo is the sole owner of the trademark. The franchise was created by Satoshi Tajiri in 1995, and is centered on fictional creatures called "Pokémon", which humans, known as Pokémon Trainers, catch and train to battle each other for sport. The English slogan for the franchise is "Gotta Catch 'Em All". Works within the franchise are set in the Pokémon universe. The franchise began as "Pokémon Red" and "Green" (released outside of Japan as "Pokémon Red" and "Blue"), a pair of video games for the original Game Boy that were developed by Game Freak and published by Nintendo in February 1996. "Pokémon" has since gone on to become the highest-grossing media franchise of all time, with over in revenue up until March 2017. The original video game series is the second best-selling video game franchise (behind Nintendo's "Mario" franchise) with more than 300million copies sold and over 800million mobile downloads. In addition, the "Pokémon" franchise includes the world's top-selling toy brand, the top-selling trading card game with over 25.7billion cards sold, an anime television series that has become the most successful video game adaptation with over 20 seasons and 1,000 episodes in 124 countries, as well as an anime film series, a , books, manga comics, music, and merchandise. The franchise is also represented in other Nintendo media, such as the "Super Smash Bros." series. In November 2005, 4Kids Entertainment, which had managed the non-game related licensing of "Pokémon", announced that it had agreed not to renew the "Pokémon" representation agreement. The Pokémon Company International oversees all "Pokémon" licensing outside Asia.
--- a/examples/rag/test_distributed_retriever.py
+++ b/examples/rag/test_distributed_retriever.py
@@ -15,6 +15,7 @@ from transformers.configuration_bart import BartConfig
 from transformers.configuration_dpr import DPRConfig
 from transformers.configuration_rag import RagConfig
 from transformers.file_utils import is_datasets_available, is_faiss_available, is_psutil_available, is_torch_available
+from transformers.retrieval_rag import CustomHFIndex
 from transformers.tokenization_bart import BartTokenizer
 from transformers.tokenization_bert import VOCAB_FILES_NAMES as DPR_VOCAB_FILES_NAMES
 from transformers.tokenization_dpr import DPRQuestionEncoderTokenizer
@@ -23,7 +24,7 @@ from transformers.tokenization_roberta import VOCAB_FILES_NAMES as BART_VOCAB_FI

 sys.path.append(os.path.join(os.getcwd()))  # noqa: E402 # noqa: E402 # isort:skip

-from examples.rag.distributed_retriever import RagPyTorchDistributedRetriever  # noqa: E402 # isort:skip
+from distributed_retriever import RagPyTorchDistributedRetriever  # noqa: E402 # isort:skip


 def require_distributed_retrieval(test_case):
@@ -114,7 +115,7 @@ class RagRetrieverTest(TestCase):
    def tearDown(self):
        shutil.rmtree(self.tmpdirname)

-    def get_dummy_pytorch_distributed_retriever(self, init_retrieval, port=12345) -> RagPyTorchDistributedRetriever:
+    def get_dummy_dataset(self):
        dataset = Dataset.from_dict(
            {
                "id": ["0", "1"],
@@ -124,6 +125,12 @@ class RagRetrieverTest(TestCase):
            }
        )
        dataset.add_faiss_index("embeddings", string_factory="Flat", metric_type=faiss.METRIC_INNER_PRODUCT)
+        return dataset
+
+    def get_dummy_pytorch_distributed_retriever(
+        self, init_retrieval: bool, port=12345
+    ) -> RagPyTorchDistributedRetriever:
+        dataset = self.get_dummy_dataset()
        config = RagConfig(
            retrieval_vector_size=self.retrieval_vector_size,
            question_encoder=DPRConfig().to_dict(),
@@ -140,6 +147,37 @@ class RagRetrieverTest(TestCase):
                retriever.init_retrieval(port)
        return retriever

+    def get_dummy_custom_hf_index_retriever(self, init_retrieval: bool, from_disk: bool, port=12345):
+        dataset = self.get_dummy_dataset()
+        config = RagConfig(
+            retrieval_vector_size=self.retrieval_vector_size,
+            question_encoder=DPRConfig().to_dict(),
+            generator=BartConfig().to_dict(),
+            index_name="custom",
+        )
+        if from_disk:
+            config.passages_path = os.path.join(self.tmpdirname, "dataset")
+            config.index_path = os.path.join(self.tmpdirname, "index.faiss")
+            dataset.get_index("embeddings").save(os.path.join(self.tmpdirname, "index.faiss"))
+            dataset.drop_index("embeddings")
+            dataset.save_to_disk(os.path.join(self.tmpdirname, "dataset"))
+            del dataset
+            retriever = RagPyTorchDistributedRetriever(
+                config,
+                question_encoder_tokenizer=self.get_dpr_tokenizer(),
+                generator_tokenizer=self.get_bart_tokenizer(),
+            )
+        else:
+            retriever = RagPyTorchDistributedRetriever(
+                config,
+                question_encoder_tokenizer=self.get_dpr_tokenizer(),
+                generator_tokenizer=self.get_bart_tokenizer(),
+                index=CustomHFIndex(config.retrieval_vector_size, dataset),
+            )
+        if init_retrieval:
+            retriever.init_retrieval(port)
+        return retriever
+
    def test_pytorch_distributed_retriever_retrieve(self):
        n_docs = 1
        retriever = self.get_dummy_pytorch_distributed_retriever(init_retrieval=True)
@@ -154,3 +192,33 @@ class RagRetrieverTest(TestCase):
        self.assertEqual(doc_dicts[0]["id"][0], "1")  # max inner product is reached with second doc
        self.assertEqual(doc_dicts[1]["id"][0], "0")  # max inner product is reached with first doc
        self.assertListEqual(doc_ids.tolist(), [[1], [0]])
+
+    def test_custom_hf_index_retriever_retrieve(self):
+        n_docs = 1
+        retriever = self.get_dummy_custom_hf_index_retriever(init_retrieval=True, from_disk=False)
+        hidden_states = np.array(
+            [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32
+        )
+        retrieved_doc_embeds, doc_ids, doc_dicts = retriever.retrieve(hidden_states, n_docs=n_docs)
+        self.assertEqual(retrieved_doc_embeds.shape, (2, n_docs, self.retrieval_vector_size))
+        self.assertEqual(len(doc_dicts), 2)
+        self.assertEqual(sorted(doc_dicts[0]), ["embeddings", "id", "text", "title"])
+        self.assertEqual(len(doc_dicts[0]["id"]), n_docs)
+        self.assertEqual(doc_dicts[0]["id"][0], "1")  # max inner product is reached with second doc
+        self.assertEqual(doc_dicts[1]["id"][0], "0")  # max inner product is reached with first doc
+        self.assertListEqual(doc_ids.tolist(), [[1], [0]])
+
+    def test_custom_pytorch_distributed_retriever_retrieve_from_disk(self):
+        n_docs = 1
+        retriever = self.get_dummy_custom_hf_index_retriever(init_retrieval=True, from_disk=True)
+        hidden_states = np.array(
+            [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32
+        )
+        retrieved_doc_embeds, doc_ids, doc_dicts = retriever.retrieve(hidden_states, n_docs=n_docs)
+        self.assertEqual(retrieved_doc_embeds.shape, (2, n_docs, self.retrieval_vector_size))
+        self.assertEqual(len(doc_dicts), 2)
+        self.assertEqual(sorted(doc_dicts[0]), ["embeddings", "id", "text", "title"])
+        self.assertEqual(len(doc_dicts[0]["id"]), n_docs)
+        self.assertEqual(doc_dicts[0]["id"][0], "1")  # max inner product is reached with second doc
+        self.assertEqual(doc_dicts[1]["id"][0], "0")  # max inner product is reached with first doc
+        self.assertListEqual(doc_ids.tolist(), [[1], [0]])
--- a/examples/rag/use_own_knowledge_dataset.py
+++ b/examples/rag/use_own_knowledge_dataset.py
@@ -0,0 +1,199 @@
+import logging
+import os
+from dataclasses import dataclass, field
+from functools import partial
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from typing import List, Optional
+
+import torch
+from datasets import load_dataset
+
+import faiss
+from transformers import (
+    DPRContextEncoder,
+    DPRContextEncoderTokenizerFast,
+    HfArgumentParser,
+    RagRetriever,
+    RagSequenceForGeneration,
+    RagTokenizer,
+)
+
+
+logger = logging.getLogger(__name__)
+torch.set_grad_enabled(False)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+def split_text(text: str, n=100, character=" ") -> List[str]:
+    """Split the text every ``n``-th occurence of ``character``"""
+    text = text.split(character)
+    return [character.join(text[i : i + n]).strip() for i in range(0, len(text), n)]
+
+
+def split_documents(documents: dict) -> dict:
+    """Split documents into passages"""
+    titles, texts = [], []
+    for title, text in zip(documents["title"], documents["text"]):
+        for passage in split_text(text):
+            titles.append(title)
+            texts.append(passage)
+    return {"title": titles, "text": texts}
+
+
+def embed(documents: dict, ctx_encoder: DPRContextEncoder, ctx_tokenizer: DPRContextEncoderTokenizerFast) -> dict:
+    """Compute the DPR embeddings of document passages"""
+    input_ids = ctx_tokenizer(
+        documents["title"], documents["text"], truncation=True, padding="longest", return_tensors="pt"
+    )["input_ids"]
+    embeddings = ctx_encoder(input_ids.to(device=device), return_dict=True).pooler_output
+    return {"embeddings": embeddings.detach().cpu().numpy()}
+
+
+def main(
+    rag_example_args: "RagExampleArguments",
+    processing_args: "ProcessingArguments",
+    index_hnsw_args: "IndexHnswArguments",
+):
+
+    ######################################
+    logger.info("Step 1 - Create the dataset")
+    ######################################
+
+    # The dataset needed for RAG must have three columns:
+    # - title (string): title of the document
+    # - text (string): text of a passage of the document
+    # - embeddings (array of dimension d): DPR representation of the passage
+
+    # Let's say you have documents in tab-separated csv files with columns "title" and "text"
+    assert os.path.isfile(rag_example_args.csv_path), "Please provide a valid path to a csv file"
+
+    # You can load a Dataset object this way
+    dataset = load_dataset(
+        "csv", data_files=[rag_example_args.csv_path], split="train", delimiter="\t", column_names=["title", "text"]
+    )
+
+    # More info about loading csv files in the documentation: https://huggingface.co/docs/datasets/loading_datasets.html?highlight=csv#csv-files
+
+    # Then split the documents into passages of 100 words
+    dataset = dataset.map(split_documents, batched=True, num_proc=processing_args.num_proc)
+
+    # And compute the embeddings
+    ctx_encoder = DPRContextEncoder.from_pretrained(rag_example_args.dpr_ctx_encoder_model_name).to(device=device)
+    ctx_tokenizer = DPRContextEncoderTokenizerFast.from_pretrained(rag_example_args.dpr_ctx_encoder_model_name)
+    dataset = dataset.map(
+        partial(embed, ctx_encoder=ctx_encoder, ctx_tokenizer=ctx_tokenizer),
+        batched=True,
+        batch_size=processing_args.batch_size,
+    )
+
+    # And finally save your dataset
+    passages_path = os.path.join(rag_example_args.output_dir, "my_knowledge_dataset")
+    dataset.save_to_disk(passages_path)
+    # from datasets import load_from_disk
+    # dataset = load_from_disk(passages_path)  # to reload the dataset
+
+    ######################################
+    logger.info("Step 2 - Index the dataset")
+    ######################################
+
+    # Let's use the Faiss implementation of HNSW for fast approximate nearest neighbor search
+    index = faiss.IndexHNSWFlat(index_hnsw_args.d, index_hnsw_args.m, faiss.METRIC_INNER_PRODUCT)
+    dataset.add_faiss_index("embeddings", custom_index=index)
+
+    # And save the index
+    index_path = os.path.join(rag_example_args.output_dir, "my_knowledge_dataset_hnsw_index.faiss")
+    dataset.get_index("embeddings").save(index_path)
+    # dataset.load_faiss_index("embeddings", index_path)  # to reload the index
+
+    ######################################
+    logger.info("Step 3 - Load RAG")
+    ######################################
+
+    # Easy way to load the model
+    retriever = RagRetriever.from_pretrained(
+        rag_example_args.rag_model_name, index_name="custom", indexed_dataset=dataset
+    )
+    model = RagSequenceForGeneration.from_pretrained(rag_example_args.rag_model_name, retriever=retriever)
+    tokenizer = RagTokenizer.from_pretrained(rag_example_args.rag_model_name)
+
+    # For distributed fine-tuning you'll need to provide the paths instead, as the dataset and the index are loaded separately.
+    # retriever = RagRetriever.from_pretrained(rag_model_name, index_name="custom", passages_path=passages_path, index_path=index_path)
+
+    ######################################
+    logger.info("Step 4 - Have fun")
+    ######################################
+
+    question = rag_example_args.question or "What does Moses' rod turn into ?"
+    input_ids = tokenizer.question_encoder(question, return_tensors="pt")["input_ids"]
+    generated = model.generate(input_ids)
+    generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
+    logger.info("Q: " + question)
+    logger.info("A: " + generated_string)
+
+
+@dataclass
+class RagExampleArguments:
+    csv_path: str = field(
+        default=str(Path(__file__).parent / "test_data" / "my_knowledge_dataset.csv"),
+        metadata={"help": "Path to a tab-separated csv file with columns 'title' and 'text'"},
+    )
+    question: Optional[str] = field(
+        default=None,
+        metadata={"help": "Question that is passed as input to RAG. Default is 'What does Moses' rod turn into ?'."},
+    )
+    rag_model_name: str = field(
+        default="facebook/rag-sequence-nq",
+        metadata={"help": "The RAG model to use. Either 'facebook/rag-sequence-nq' or 'facebook/rag-token-nq'"},
+    )
+    dpr_ctx_encoder_model_name: str = field(
+        default="facebook/dpr-ctx_encoder-multiset-base",
+        metadata={
+            "help": "The DPR context encoder model to use. Either 'facebook/dpr-ctx_encoder-single-nq-base' or 'facebook/dpr-ctx_encoder-multiset-base'"
+        },
+    )
+    output_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to a directory where the dataset passages and the index will be saved"},
+    )
+
+
+@dataclass
+class ProcessingArguments:
+    num_proc: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "The number of processes to use to split the documents into passages. Default is single process."
+        },
+    )
+    batch_size: int = field(
+        default=16,
+        metadata={
+            "help": "The batch size to use when computing the passages embeddings using the DPR context encoder."
+        },
+    )
+
+
+@dataclass
+class IndexHnswArguments:
+    d: int = field(
+        default=768,
+        metadata={"help": "The dimension of the embeddings to pass to the HNSW Faiss index."},
+    )
+    m: int = field(
+        default=128,
+        metadata={
+            "help": "The number of bi-directional links created for every new element during the HNSW index construction."
+        },
+    )
+
+
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.WARNING)
+    logger.setLevel(logging.INFO)
+
+    parser = HfArgumentParser((RagExampleArguments, ProcessingArguments, IndexHnswArguments))
+    rag_example_args, processing_args, index_hnsw_args = parser.parse_args_into_dataclasses()
+    with TemporaryDirectory() as tmp_dir:
+        rag_example_args.output_dir = rag_example_args.output_dir or tmp_dir
+        main(rag_example_args, processing_args, index_hnsw_args)
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -5,7 +5,7 @@ psutil
 sacrebleu
 rouge-score
 tensorflow_datasets
-pytorch-lightning==0.8.5
+pytorch-lightning==0.9.0
 matplotlib
 git-python==1.0.3
 faiss-cpu
@@ -17,3 +17,4 @@ datasets
 fire
 pytest
 conllu
+sentencepiece != 0.1.92
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -12,14 +12,13 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
 - `MBartForConditionalGeneration`
 - `FSMTForConditionalGeneration`
 - `T5ForConditionalGeneration`
-    

 ## Datasets

 #### XSUM:
 ```bash
 cd examples/seq2seq
-wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
+wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
 tar -xzvf xsum.tar.gz
 export XSUM_DIR=${PWD}/xsum
 ```
@@ -29,7 +28,7 @@ To use your own data, copy that files format. Each article to be summarized is o
 #### CNN/DailyMail
 ```bash
 cd examples/seq2seq
-wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm_v2.tgz
+wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
 tar -xzvf cnn_dm_v2.tgz  # empty lines removed
 mv cnn_cln cnn_dm
 export CNN_DIR=${PWD}/cnn_dm
@@ -39,7 +38,7 @@ this should make a directory called `cnn_dm/` with 6 files.
 #### WMT16 English-Romanian Translation Data:
 download with this command:
 ```bash
-wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
+wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
 tar -xzvf wmt_en_ro.tar.gz
 export ENRO_DIR=${PWD}/wmt_en_ro
 ```
@@ -47,7 +46,7 @@ this should make a directory called `wmt_en_ro/` with 6 files.

 #### WMT English-German:
 ```bash
-wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
+wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
 tar -xzvf wmt_en_de.tgz
 export DATA_DIR=${PWD}/wmt_en_de
 ```
@@ -100,7 +99,7 @@ All finetuning bash scripts call finetune.py (or distillation.py) with reasonabl
 To see all the possible command line options, run:

 ```bash
- ./finetune.py --help 
+./finetune.py --help
 ```

 ### Finetuning Training Params
@@ -192,7 +191,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
 ### Fine-tuning using Seq2SeqTrainer
 To use `Seq2SeqTrainer` for fine-tuning you should use the `finetune_trainer.py` script. It subclasses `Trainer` to extend it for seq2seq training. Except the `Trainer` releated `TrainingArguments`, it shares the same argument names as that of `finetune.py` file. One notable difference is that, calculating generative metrics (BLEU, ROUGE) is optional and is controlled using the `--predict_with_generate` argument, set this argument to calculate BLEU and ROUGE metrics.

-With PyTorch 1.6+ it'll automatically use `native AMP` when `--fp16` is set. 
+With PyTorch 1.6+ it'll automatically use `native AMP` when `--fp16` is set.

 To see all the possible command line options, run:

@@ -265,6 +264,7 @@ export DATA_DIR=cnn_dm
    --fp16 \
    --bs 32
 ```
+
 ### Multi-GPU Evaluation
 here is a command to run xsum evaluation on 8 GPUS. It is more than linearly faster than run_eval.py in some cases 
 because it uses SortishSampler to minimize padding. You can also use it on 1 GPU. `data_dir` must have 
@@ -391,6 +391,17 @@ runtime: 13H on V-100 16GB GPU.
 pytest examples/seq2seq/
 ```

+### Converting pytorch-lightning checkpoints
+pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it.
+
+This should be done for you, with a file called `{save_dir}/best_tfmr`. 
+
+If that file doesn't exist but you have a lightning `.ckpt` file, you can run
+```bash
+python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT  randomly_initialized_hf_model_path save_dir/best_tfmr
+```
+Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections)
+

 ## Experimental Features 
 These features are harder to use and not always useful.
@@ -419,4 +430,3 @@ uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes.
 The feature is still experimental, because:
 + we can make it much more robust if we have memory mapped/preprocessed datasets.
 + The speedup over sortish sampler is not that large at the moment.
-
--- a/examples/seq2seq/bertabs/README.md
+++ b/examples/seq2seq/bertabs/README.md
@@ -39,7 +39,7 @@ python run_summarization.py \
    --compute_rouge true
 ```

-The scripts executes on GPU if one is available and if `no_cuda` is not set to `true`. Inference on multiple GPUs is not suported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).
+The scripts executes on GPU if one is available and if `no_cuda` is not set to `true`. Inference on multiple GPUs is not supported yet. The ROUGE scores will be displayed in the console at the end of evaluation and written in a `rouge_scores.txt` file. The script takes 30 hours to compute with a single Tesla V100 GPU and a batch size of 10 (300,000 texts to summarize).

 ## Summarize any text

--- a/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh
+++ b/examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh
@@ -19,5 +19,4 @@ python finetune_trainer.py \
    --do_train --do_eval --do_predict --evaluate_during_training\
    --predict_with_generate --logging_first_step \
    --task translation --label_smoothing 0.1 \
-    --run_name marian_en_ro_6_3 \
    "$@"
--- a/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh
+++ b/examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh
@@ -20,5 +20,4 @@ python xla_spawn.py --num_cores $TPU_NUM_CORES \
    --do_train --do_eval --evaluate_during_training \
    --prediction_loss_only \
    --task translation --label_smoothing 0.1 \
-    --run_name marian_en_ro_6_3 \
    "$@"
--- a/examples/seq2seq/builtin_trainer/train_distilbart_cnn.sh
+++ b/examples/seq2seq/builtin_trainer/train_distilbart_cnn.sh
@@ -19,8 +19,7 @@ python finetune_trainer.py \
    --num_train_epochs=2 \
    --save_steps 3000 --eval_steps 3000 \
    --logging_first_step \
-    --max_target_length $MAX_TGT_LEN --val_max_target_length $MAX_TGT_LEN --test_max_target_length $MAX_TGT_LEN \
+    --max_target_length 56 --val_max_target_length $MAX_TGT_LEN --test_max_target_length $MAX_TGT_LEN \
    --do_train --do_eval --do_predict --evaluate_during_training \
    --predict_with_generate \
-    --run_name distilbart-cnn-12-6 \
    "$@"
--- a/examples/seq2seq/builtin_trainer/train_mbart_cc25_enro.sh
+++ b/examples/seq2seq/builtin_trainer/train_mbart_cc25_enro.sh
@@ -18,5 +18,4 @@ python finetune_trainer.py \
    --do_train --do_eval --do_predict --evaluate_during_training \
    --predict_with_generate --logging_first_step 
    --task translation \
-    --run_name mbart_en_ro \
    "$@"
--- a/examples/seq2seq/distillation.py
+++ b/examples/seq2seq/distillation.py
@@ -17,7 +17,7 @@ from finetune import main as ft_main
 from make_student import create_student_by_copying_alternating_layers, get_layers_to_supervise
 from transformers import AutoModelForSeq2SeqLM, MBartTokenizer, T5ForConditionalGeneration
 from transformers.modeling_bart import shift_tokens_right
-from utils import calculate_bleu, freeze_params, label_smoothed_nll_loss, pickle_load, use_task_specific_params
+from utils import calculate_bleu, freeze_params, label_smoothed_nll_loss, use_task_specific_params


 # need the parent dir module
@@ -28,7 +28,7 @@ from lightning_base import generic_train  # noqa
 class BartSummarizationDistiller(SummarizationModule):
    """Supports Bart, Pegasus and other models that inherit from Bart."""

-    loss_names = ["loss", "ce_loss", "mlm_loss", "enc_mse_loss", "hid_loss_enc", "hid_loss_dec"]
+    loss_names = ["loss", "ce_loss", "mlm_loss", "hid_loss_enc", "hid_loss_dec"]

    def __init__(self, hparams):
        assert Path(hparams.data_dir).exists()
@@ -46,9 +46,19 @@ class BartSummarizationDistiller(SummarizationModule):
        if hparams.length_penalty != -1:
            student.config.length_penalty = hparams.length_penalty
        super().__init__(hparams, model=student, config=student.config)
+        model_type = student.config.model_type
        self.e_layer_ids, self.d_layer_ids = e_layer_ids, d_layer_ids  # type: List[int], List[int]
-        self.different_encoder = hparams.student_encoder_layers != teacher.config.encoder_layers
-        self.different_decoder = hparams.student_decoder_layers != teacher.config.decoder_layers
+
+        if model_type == "t5":
+            teacher_encoder_layers = len(teacher.get_encoder().block)
+            teacher_decoder_layers = len(teacher.get_decoder().block)
+        else:
+            teacher_encoder_layers = teacher.config.encoder_layers
+            teacher_decoder_layers = teacher.config.decoder_layers
+
+        self.different_encoder = hparams.student_encoder_layers != teacher_encoder_layers
+        self.different_decoder = hparams.student_decoder_layers != teacher_decoder_layers
+
        self.teacher = teacher
        freeze_params(self.teacher)

@@ -59,17 +69,17 @@ class BartSummarizationDistiller(SummarizationModule):
                del self.teacher.encoder
        # Intermediate supervision: Decide which layers to supervise
        if hparams.supervise_forward:
-            self.d_matches = get_layers_to_supervise(
-                n_student=len(self.d_layer_ids), n_teacher=self.teacher.config.decoder_layers
-            )
-        else:
+            self.e_matches = get_layers_to_supervise(n_student=len(self.e_layer_ids), n_teacher=teacher_encoder_layers)
+            self.d_matches = get_layers_to_supervise(n_student=len(self.d_layer_ids), n_teacher=teacher_decoder_layers)
+        else:  # student layer should emulate hidden states of the teacher layer it was copied from
+            self.e_matches = self.e_layer_ids
            self.d_matches = self.d_layer_ids
+
        self.ce_loss_fct = nn.KLDivLoss(reduction="batchmean")
        self.temperature = 2.0
        self.alpha_mlm = hparams.alpha_mlm
        self.alpha_ce = hparams.alpha_ce
        self.alpha_hid = hparams.alpha_hid
-        self.alpha_encoder_loss = hparams.alpha_encoder_loss
        gc.collect()
        torch.cuda.empty_cache()

@@ -129,7 +139,7 @@ class BartSummarizationDistiller(SummarizationModule):
            output_hidden_states=True,
            output_attentions=False,
            use_cache=False,
-        )  # TODO(@sshleifer): return_dict=True cleanup
+        )

        # Same cross entropy vs. label smoothing logic as finetune.py
        assert lm_logits.shape[-1] == self.model.config.vocab_size
@@ -146,30 +156,32 @@ class BartSummarizationDistiller(SummarizationModule):
        def zero_tensor():
            return torch.tensor(0.0).type_as(student_lm_loss)

-        loss_encoder, hid_loss_enc, hid_loss_dec = zero_tensor(), zero_tensor(), zero_tensor()
-        if self.different_encoder:
+        hid_loss_enc, hid_loss_dec = zero_tensor(), zero_tensor()
+        if self.different_encoder:  # compute encoder hidden state loss
            with torch.no_grad():
-                teacher_enc_outputs, teacher_enc_hid, _ = self.teacher.get_encoder()(
-                    input_ids, attention_mask=src_mask, output_hidden_states=True
-                )
-            # DEPRECATE THIS
-            if self.hparams.alpha_encoder_loss > 0:
-                loss_encoder = self.calc_mse_loss(enc_outputs, teacher_enc_outputs, src_mask)
+                teacher_enc_hid = self.teacher.get_encoder()(
+                    input_ids, attention_mask=src_mask, output_hidden_states=True, return_dict=True
+                ).hidden_states

-            hid_loss_enc = self.calc_hidden_loss(src_mask, enc_hidden_state, teacher_enc_hid, self.e_layer_ids)
-
-        teacher_enc_outputs = (enc_outputs,)
-        assert isinstance(teacher_enc_outputs, tuple), type(teacher_enc_outputs)
+            hid_loss_enc = self.calc_hidden_loss(
+                src_mask,
+                enc_hidden_state,
+                teacher_enc_hid,
+                self.e_matches,
+                normalize_hidden=self.hparams.normalize_hidden,
+            )

        with torch.no_grad():
-            tloss, tlogits, tdec_hidden, _ = self.teacher(
+            outputs = self.teacher(
                input_ids,
                attention_mask=src_mask,
-                encoder_outputs=teacher_enc_outputs,
+                encoder_outputs=(enc_outputs,),
                decoder_input_ids=decoder_input_ids,
                lm_labels=labels,
                output_hidden_states=True,
+                return_dict=True,
            )
+            tlogits, tdec_hidden = outputs.logits, outputs.decoder_hidden_states
        dec_mask = decoder_input_ids.ne(pad_token_id)
        loss_ce = self.calc_ce_loss(dec_mask, lm_logits, tlogits)
        if self.alpha_hid > 0:  # Intermediate supervision of decoder hidden states
@@ -180,10 +192,9 @@ class BartSummarizationDistiller(SummarizationModule):
        blended_loss = (
            self.alpha_ce * loss_ce
            + self.alpha_mlm * student_lm_loss
-            + self.hparams.alpha_encoder_loss * loss_encoder
            + self.hparams.alpha_hid * (hid_loss_enc + hid_loss_dec)
        )
-        return blended_loss, loss_ce, student_lm_loss, loss_encoder, hid_loss_enc, hid_loss_dec
+        return blended_loss, loss_ce, student_lm_loss, hid_loss_enc, hid_loss_dec

    @staticmethod
    def calc_hidden_loss(attention_mask, hidden_states, hidden_states_T, matches, normalize_hidden):
@@ -207,7 +218,6 @@ def add_distill_args(parser):
    parser.add_argument("--teacher", type=str)
    parser.add_argument("--alpha_ce", default=0.8, type=float)
    parser.add_argument("--alpha_mlm", default=0.2, type=float)
-    parser.add_argument("--alpha_encoder_loss", default=0.0, type=float)
    parser.add_argument("--alpha_hid", default=0.0, type=float, required=False)
    parser.add_argument("--student_decoder_layers", default=12, type=int, required=False)
    parser.add_argument("--student_encoder_layers", default=12, type=int, required=False)
@@ -254,30 +264,6 @@ def create_module(args):
    return model


-def evaluate_checkpoint(ckpt_path: Path, dest_dir=None):
-    # TODO(SS): DELETE? Better to convert_pl_ckpt_to_hf and run_eval.py
-    exp_dir = ckpt_path.parent
-    if dest_dir is None:
-        dest_dir = exp_dir
-    clash = list(dest_dir.glob("test_generations*"))
-    if clash:
-        print(f"SKIPPING to avoid overwriting {clash}")
-    ckpt = torch.load(ckpt_path, map_location="cpu")
-    if "hparams" in ckpt:
-        args = argparse.Namespace(**ckpt["hparams"])
-    else:
-        args = argparse.Namespace(**pickle_load(exp_dir / "hparams.pkl"))
-    args.resume_from_checkpoint = str(ckpt_path)
-    args.do_train = False
-    args.output_dir = str(dest_dir)
-    args.n_gpu = 1
-    args.eval_batch_size = 16
-    Path(args.output_dir).mkdir(exist_ok=True)
-    model = create_module(args)
-    trainer: pl.Trainer = generic_train(model, args, early_stopping_callback=False)
-    trainer.test(model)
-
-
 def distill_main(args):
    Path(args.output_dir).mkdir(exist_ok=True)
    if len(os.listdir(args.output_dir)) > 3 and args.do_train:
--- a/examples/seq2seq/finetune.py
+++ b/examples/seq2seq/finetune.py
@@ -26,12 +26,14 @@ from utils import (
    calculate_bleu,
    calculate_rouge,
    flatten_list,
+    freeze_embeds,
    freeze_params,
    get_git_info,
    label_smoothed_nll_loss,
    lmap,
    pickle_save,
    save_git_info,
+    save_json,
    use_task_specific_params,
 )

@@ -90,7 +92,7 @@ class SummarizationModule(BaseTransformer):
        assert self.target_lens["train"] <= self.target_lens["val"], f"target_lens: {self.target_lens}"
        assert self.target_lens["train"] <= self.target_lens["test"], f"target_lens: {self.target_lens}"
        if self.hparams.freeze_embeds:
-            self.freeze_embeds()
+            freeze_embeds(self.model)
        if self.hparams.freeze_encoder:
            freeze_params(self.model.get_encoder())
            assert_all_frozen(self.model.get_encoder())
@@ -104,29 +106,24 @@ class SummarizationModule(BaseTransformer):
        self.dataset_class = (
            Seq2SeqDataset if hasattr(self.tokenizer, "prepare_seq2seq_batch") else LegacySeq2SeqDataset
        )
+        self.already_saved_batch = False
        self.eval_beams = self.model.config.num_beams if self.hparams.eval_beams is None else self.hparams.eval_beams
-        assert self.eval_beams >= 1, f"got self.eval_beams={self.eval_beams}. Need an integer > 1"
        if self.hparams.eval_max_gen_length is not None:
            self.eval_max_length = self.hparams.eval_max_gen_length
        else:
            self.eval_max_length = self.model.config.max_length
        self.val_metric = self.default_val_metric if self.hparams.val_metric is None else self.hparams.val_metric

-    def freeze_embeds(self):
-        """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
-        if self.model_type == "t5":
-            freeze_params(self.model.shared)
-            for d in [self.model.encoder, self.model.decoder]:
-                freeze_params(d.embed_tokens)
-        elif self.model_type == "fsmt":
-            for d in [self.model.model.encoder, self.model.model.decoder]:
-                freeze_params(d.embed_positions)
-                freeze_params(d.embed_tokens)
-        else:
-            freeze_params(self.model.model.shared)
-            for d in [self.model.model.encoder, self.model.model.decoder]:
-                freeze_params(d.embed_positions)
-                freeze_params(d.embed_tokens)
+    def save_readable_batch(self, batch: Dict[str, torch.Tensor]) -> Dict[str, List[str]]:
+        """A debugging utility"""
+        readable_batch = {
+            k: self.tokenizer.batch_decode(v.tolist()) if "mask" not in k else v.shape for k, v in batch.items()
+        }
+        save_json(readable_batch, Path(self.output_dir) / "text_batch.json")
+        save_json({k: v.tolist() for k, v in batch.items()}, Path(self.output_dir) / "tok_batch.json")
+
+        self.already_saved_batch = True
+        return readable_batch

    def forward(self, input_ids, **kwargs):
        return self.model(input_ids, **kwargs)
@@ -145,6 +142,9 @@ class SummarizationModule(BaseTransformer):
            decoder_input_ids = self.model._shift_right(tgt_ids)
        else:
            decoder_input_ids = shift_tokens_right(tgt_ids, pad_token_id)
+        if not self.already_saved_batch:  # This would be slightly better if it only happened on rank zero
+            batch["decoder_input_ids"] = decoder_input_ids
+            self.save_readable_batch(batch)

        outputs = self(src_ids, attention_mask=src_mask, decoder_input_ids=decoder_input_ids, use_cache=False)
        lm_logits = outputs[0]
@@ -181,6 +181,7 @@ class SummarizationModule(BaseTransformer):
        return self._generative_step(batch)

    def validation_epoch_end(self, outputs, prefix="val") -> Dict:
+
        self.step_count += 1
        losses = {k: torch.stack([x[k] for x in outputs]).mean() for k in self.loss_names}
        loss = losses["loss"]
--- a/examples/seq2seq/finetune_bart_tiny.sh
+++ b/examples/seq2seq/finetune_bart_tiny.sh
@@ -1,7 +1,7 @@
 # Script for verifying that run_bart_sum can be invoked from its directory

 # Get tiny dataset with cnn_dm format (4 examples for train, val, test)
-wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_tiny.tgz
+wget https://cdn-datasets.huggingface.co/summarization/cnn_tiny.tgz
 tar -xzvf cnn_tiny.tgz
 rm cnn_tiny.tgz

--- a/examples/seq2seq/finetune_trainer.py
+++ b/examples/seq2seq/finetune_trainer.py
@@ -1,111 +1,38 @@
-import json
 import logging
 import os
 import sys
 from dataclasses import dataclass, field
-from typing import Callable, Dict, List, Optional, Tuple
+from typing import Optional

-import numpy as np
-import torch
-
-from seq2seq_trainer import Seq2SeqTrainer
+from seq2seq_trainer import Seq2SeqTrainer, arg_to_scheduler_choices
 from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
-    BartTokenizer,
-    EvalPrediction,
    HfArgumentParser,
    MBartTokenizer,
-    T5Tokenizer,
    TrainingArguments,
    set_seed,
 )
-from transformers.modeling_bart import shift_tokens_right
+from transformers.trainer_utils import EvaluationStrategy
 from utils import (
    LegacySeq2SeqDataset,
+    Seq2SeqDataCollator,
    Seq2SeqDataset,
    assert_all_frozen,
-    calculate_bleu,
-    calculate_rouge,
+    build_compute_metrics_fn,
+    freeze_embeds,
    freeze_params,
    lmap,
-    trim_batch,
+    save_json,
    use_task_specific_params,
+    write_txt_file,
 )


 logger = logging.getLogger(__name__)


-class Seq2SeqDataCollator:
-    def __init__(self, tokenizer, data_args, tpu_num_cores=None):
-        self.tokenizer = tokenizer
-        self.pad_token_id = tokenizer.pad_token_id
-        self.data_args = data_args
-        self.tpu_num_cores = tpu_num_cores
-        self.add_prefix_space = isinstance(tokenizer, BartTokenizer)
-
-    def __call__(self, batch) -> Dict[str, torch.Tensor]:
-        if hasattr(self.tokenizer, "prepare_seq2seq_batch"):
-            batch = self._encode(batch)
-            input_ids, attention_mask, labels = (
-                batch["input_ids"],
-                batch["attention_mask"],
-                batch["labels"],
-            )
-        else:
-            input_ids = torch.stack([x["input_ids"] for x in batch])
-            attention_mask = torch.stack([x["attention_mask"] for x in batch])
-            labels = torch.stack([x["labels"] for x in batch])
-
-            labels = trim_batch(labels, self.pad_token_id)
-            input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
-
-        if isinstance(self.tokenizer, T5Tokenizer):
-            decoder_input_ids = self._shift_right_t5(labels)
-            labels = labels
-        else:
-            decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
-            labels = labels
-
-        batch = {
-            "input_ids": input_ids,
-            "attention_mask": attention_mask,
-            "decoder_input_ids": decoder_input_ids,
-            "labels": labels,
-        }
-        return batch
-
-    def _shift_right_t5(self, input_ids):
-        decoder_start_token_id = self.pad_token_id
-
-        assert (
-            decoder_start_token_id is not None
-        ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
-
-        # shift inputs to the right
-        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
-        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
-        shifted_input_ids[..., 0] = decoder_start_token_id
-
-        return shifted_input_ids
-
-    def _encode(self, batch) -> Dict[str, torch.Tensor]:
-        batch_encoding = self.tokenizer.prepare_seq2seq_batch(
-            [x["src_texts"] for x in batch],
-            src_lang=self.data_args.src_lang,
-            tgt_texts=[x["tgt_texts"] for x in batch],
-            tgt_lang=self.data_args.tgt_lang,
-            max_length=self.data_args.max_source_length,
-            max_target_length=self.data_args.max_target_length,
-            padding="max_length" if self.tpu_num_cores is not None else "longest",  # TPU hack
-            return_tensors="pt",
-            add_prefix_space=self.add_prefix_space,
-        )
-        return batch_encoding.data
-
-
@dataclass
 class Seq2SeqTrainingArguments(TrainingArguments):
    """
@@ -125,6 +52,20 @@ class Seq2SeqTrainingArguments(TrainingArguments):
    predict_with_generate: bool = field(
        default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."}
    )
+    adafactor: bool = field(default=False, metadata={"help": "whether to use adafactor"})
+    encoder_layerdrop: Optional[float] = field(
+        default=None, metadata={"help": "Encoder layer dropout probability. Goes into model.config."}
+    )
+    decoder_layerdrop: Optional[float] = field(
+        default=None, metadata={"help": "Decoder layer dropout probability. Goes into model.config."}
+    )
+    dropout: Optional[float] = field(default=None, metadata={"help": "Dropout probability. Goes into model.config."})
+    attention_dropout: Optional[float] = field(
+        default=None, metadata={"help": "Attention dropout probability. Goes into model.config."}
+    )
+    lr_scheduler: Optional[str] = field(
+        default="linear", metadata={"help": f"Which lr scheduler to use. Selected in {arg_to_scheduler_choices}"}
+    )


@dataclass
@@ -251,6 +192,13 @@ def main():
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
    )
+
+    extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
+    for p in extra_model_params:
+        if getattr(training_args, p, None):
+            assert hasattr(config, p), f"({config.__class__.__name__}) doesn't have a `{p}` attribute"
+            setattr(config, p, getattr(training_args, p))
+
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
@@ -266,57 +214,15 @@ def main():
    use_task_specific_params(model, data_args.task)

    # set num_beams for evaluation
-    if data_args.eval_beams is not None:
-        model.config.num_beams = data_args.eval_beams
-    assert model.config.num_beams >= 1, f"got eval_beams={model.config.num_beams}. Need an integer >= 1"
-
-    # set max length for generation
-    model.config.max_generate_length = data_args.val_max_target_length
+    if data_args.eval_beams is None:
+        data_args.eval_beams = model.config.num_beams

    # set decoder_start_token_id for MBart
    if model.config.decoder_start_token_id is None and isinstance(tokenizer, MBartTokenizer):
-        decoder_start_token_id = tokenizer.lang_code_to_id[data_args.tgt_lang]
-        model.config.decoder_start_token_id = decoder_start_token_id
-
-    def build_compute_metrics_fn(task_name: str) -> Callable[[EvalPrediction], Dict]:
-        def non_pad_len(tokens: np.ndarray) -> int:
-            return np.count_nonzero(tokens != tokenizer.pad_token_id)
-
-        def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]:
-            pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
-            label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
-            pred_str = lmap(str.strip, pred_str)
-            label_str = lmap(str.strip, label_str)
-            return pred_str, label_str
-
-        def summarization_metrics(pred: EvalPrediction) -> Dict:
-            pred_str, label_str = decode_pred(pred)
-            rouge: Dict = calculate_rouge(pred_str, label_str)
-            summ_len = np.mean(lmap(non_pad_len, pred.predictions))
-            rouge.update({"gen_len": summ_len})
-            return rouge
-
-        def translation_metrics(pred: EvalPrediction) -> Dict:
-            pred_str, label_str = decode_pred(pred)
-            bleu: Dict = calculate_bleu(pred_str, label_str)
-            gen_len = np.mean(lmap(non_pad_len, pred.predictions))
-            bleu.update({"gen_len": gen_len})
-            return bleu
-
-        compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics
-        return compute_metrics_fn
-
-    def freeze_embeds(model: torch.nn.Module):
-        """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
-        try:
-            freeze_params(model.model.shared)
-            for d in [model.model.encoder, model.model.decoder]:
-                freeze_params(d.embed_positions)
-                freeze_params(d.embed_tokens)
-        except AttributeError:
-            freeze_params(model.shared)
-            for d in [model.encoder, model.decoder]:
-                freeze_params(d.embed_tokens)
+        assert (
+            data_args.tgt_lang is not None and data_args.src_lang is not None
+        ), "mBart requires --tgt_lang and --src_lang"
+        model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.tgt_lang]

    if model_args.freeze_embeds:
        freeze_embeds(model)
@@ -350,7 +256,7 @@ def main():
            max_source_length=data_args.max_source_length,
            prefix=model.config.prefix or "",
        )
-        if training_args.do_eval
+        if training_args.do_eval or training_args.evaluation_strategy != EvaluationStrategy.NO
        else None
    )
    test_dataset = (
@@ -368,13 +274,18 @@ def main():
    )

    # Initialize our Trainer
+    compute_metrics_fn = (
+        build_compute_metrics_fn(data_args.task, tokenizer) if training_args.predict_with_generate else None
+    )
    trainer = Seq2SeqTrainer(
        model=model,
+        config=config,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=Seq2SeqDataCollator(tokenizer, data_args, training_args.tpu_num_cores),
-        compute_metrics=build_compute_metrics_fn(data_args.task) if training_args.predict_with_generate else None,
+        compute_metrics=compute_metrics_fn,
+        data_args=data_args,
    )

    # Training
@@ -386,6 +297,7 @@ def main():
        # For convenience, we also re-save the tokenizer to the same directory,
        # so that you can share your model easily on huggingface.co/models =)
        if trainer.is_world_process_zero():
+            trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
@@ -395,41 +307,36 @@ def main():

        result = trainer.evaluate()

-        output_eval_file = os.path.join(training_args.output_dir, "eval_results.json")
        if trainer.is_world_process_zero():
            logger.info("***** Eval results *****")
            for key, value in result.items():
                logger.info("  %s = %s", key, value)
-
-            with open(output_eval_file, "w") as f:
-                json.dump(result, f)
-
+            save_json(result, os.path.join(training_args.output_dir, "eval_results.json"))
            eval_results.update(result)

    if training_args.do_predict:
        logging.info("*** Test ***")

        test_output = trainer.predict(test_dataset=test_dataset)
-        test_metrics = test_output.metrics
-        test_metrics = {k.replace("eval", "test"): v for k, v in test_metrics.items()}
-
-        output_test_file = os.path.join(training_args.output_dir, "test_results.json")
+        test_metrics = {k.replace("eval", "test"): v for k, v in test_output.metrics.items()}

        if trainer.is_world_process_zero():
            logger.info("***** Test results *****")
            for key, value in test_metrics.items():
                logger.info("  %s = %s", key, value)

-            with open(output_test_file, "w") as f:
-                json.dump(test_metrics, f)
+            save_json(test_metrics, os.path.join(training_args.output_dir, "test_results.json"))
+            eval_results.update(test_metrics)

            if training_args.predict_with_generate:
-                test_preds = tokenizer.batch_decode(test_output.predictions, skip_special_tokens=True)
+                test_preds = tokenizer.batch_decode(
+                    test_output.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
+                )
                test_preds = lmap(str.strip, test_preds)
-                output_test_pred_file = os.path.join(training_args.output_dir, "test_generations.txt")
-                with open(output_test_pred_file, "w") as f:
-                    f.write("\n".join(test_preds))
+                write_txt_file(test_preds, os.path.join(training_args.output_dir, "test_generations.txt"))

+    if trainer.is_world_process_zero():
+        save_json(eval_results, "all_results.json")
    return eval_results


--- a/examples/seq2seq/make_student.py
+++ b/examples/seq2seq/make_student.py
@@ -13,7 +13,7 @@ logger = logging.get_logger(__name__)


 def copy_layers(src_layers: nn.ModuleList, dest_layers: nn.ModuleList, layers_to_copy: List[int]) -> None:
-    layers_to_copy = nn.ModuleList([l for i, l in enumerate(src_layers) if i in layers_to_copy])
+    layers_to_copy = nn.ModuleList([src_layers[i] for i in layers_to_copy])
    assert len(dest_layers) == len(layers_to_copy), f"{len(dest_layers)} != {len(layers_to_copy)}"
    dest_layers.load_state_dict(layers_to_copy.state_dict())

@@ -32,7 +32,7 @@ LAYERS_TO_COPY = {
    },
    16: {  # maps  num layers in student -> which teacher layers to copy
        1: [0],
-        2: [0, 8],
+        2: [0, 15],
        3: [0, 8, 15],
        4: [0, 5, 10, 15],
        6: [0, 3, 6, 9, 12, 15],
@@ -81,6 +81,8 @@ def create_student_by_copying_alternating_layers(
    e: Union[int, None] = None,
    d: Union[int, None] = None,
    copy_first_teacher_layers=False,
+    e_layers_to_copy=None,
+    d_layers_to_copy=None,
    **extra_config_kwargs
 ) -> Tuple[PreTrainedModel, List[int], List[int]]:
    """Make a student by copying alternating layers from a teacher, save it to save_path.
@@ -142,8 +144,10 @@ def create_student_by_copying_alternating_layers(
        return student, e_layers_to_copy, d_layers_to_copy

    # Decide which layers of the teacher to copy. Not exactly alternating -- we try to keep first and last layer.
-    e_layers_to_copy: List[int] = pick_layers_to_copy(e, teacher_e)
-    d_layers_to_copy: List[int] = pick_layers_to_copy(d, teacher_d)
+    if e_layers_to_copy is None:
+        e_layers_to_copy: List[int] = pick_layers_to_copy(e, teacher_e)
+    if d_layers_to_copy is None:
+        d_layers_to_copy: List[int] = pick_layers_to_copy(d, teacher_d)

    try:
        copy_layers(teacher.model.encoder.layers, student.model.encoder.layers, e_layers_to_copy)
--- a/examples/seq2seq/precomputed_pseudo_labels.md
+++ b/examples/seq2seq/precomputed_pseudo_labels.md
@@ -0,0 +1,43 @@
+### Saved Pseudo-Labels
+These are the generations of various large models on various large **training** sets. All in all they took about 200 GPU hours to produce.
+
+### Available Pseudo-labels
+| Dataset | Model                       | Link                                                                                   | Rouge Scores       | Notes                                                                                                       
+|---------|-----------------------------|----------------------------------------------------------------------------------------|--------------------|-------------------------------------------------------------------------------------------------------------
+| XSUM    | `facebook/bart-large-xsum`    | [download](https://cdn-datasets.huggingface.co/pseudo/xsum/bart_xsum_pl.tgz)          | 49.8/28.0/42.5     |                                                                                                             
+| XSUM    | `google/pegasus-xsum`         | [download](https://cdn-datasets.huggingface.co/pseudo/xsum/pegasus_xsum.tgz)          | 53.3/32.7/46.5     |                                                                                                             
+| XSUM    | `facebook/bart-large-xsum`    | [download](https://cdn-datasets.huggingface.co/pseudo/xsum/xsum_pl2_bart.tgz)         |                   | Bart pseudolabels filtered to those with Rouge2 > 10.0 w GT.                                                 
+| CNN/DM  | `sshleifer/pegasus-cnn-ft-v2` | [download](https://cdn-datasets.huggingface.co/pseudo/cnn_dm/pegasus_cnn_cnn_pls.tgz) | 47.316/26.65/44.56 | do not worry about the fact that train.source is one line shorter.                                          
+| CNN/DM  | `facebook/bart-large-cnn`     | [download](https://cdn-datasets.huggingface.co/pseudo/cnn_dm/cnn_bart_pl.tgz)         |                    | 5K (2%) are missing, there should be 282173                                                                 
+| CNN/DM  | `google/pegasus-xsum`         | [download](https://cdn-datasets.huggingface.co/pseudo/cnn_dm/pegasus_xsum_on_cnn.tgz) | 21.5/6.76/25       | extra labels for xsum distillation  Used max_source_length=512, (and all other pegasus-xsum configuration). 
+| EN-RO   | `Helsinki-NLP/opus-mt-en-ro`  | [download](https://cdn-datasets.huggingface.co/pseudo/wmt_en_ro/opus_mt_en_ro.tgz) |       |  
+| EN-RO   | `facebook/mbart-large-en-ro`  | [download](https://cdn-datasets.huggingface.co/pseudo/wmt_en_ro/mbart_large_en_ro.tgz) |       |  
+
+
+(EN_RO = WMT 2016 English-Romanian).
+
+Example Download Command:
+```bash
+curl -S https://cdn-datasets.huggingface.co/pseudo/xsum/bart_xsum_pl.tgz | tar -xvz -C .
+```
+### Generating New Pseudolabels
+Here is the command I used to generate the pseudolabels in the second row of the table, after downloading XSUM from [here](https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz). 
+
+```bash                                                                         
+python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \
+    --model_name google/pegasus-xsum \ 
+    --save_dir pegasus_xsum \ 
+    --data_dir xsum \
+    --bs 8 --sync_timeout 60000 \
+    --max_source_length 512 \
+    --type_path train
+```
+
+ These command takes a while to run. For example,  pegasus_cnn_cnn_pls.tgz took 8 hours on 8 GPUs.
+ Pegasus does not work in fp16 :(, Bart, mBART and Marian do.
+ Even if you have 1 GPU, `run_distributed_eval.py` is 10-20% faster than `run_eval.py` because it uses `SortishSampler` to minimize padding computation.
+
+### Contributions
+Feel free to contribute your own pseudolabels via PR. Add a row to this table with a new google drive link (or other command line downloadable link).
+
+
--- a/examples/seq2seq/rouge_cli.py
+++ b/examples/seq2seq/rouge_cli.py
@@ -9,7 +9,7 @@ def calculate_rouge_path(pred_path, tgt_path, save_path=None, **kwargs):
    tgt_lns = [x.strip() for x in open(tgt_path).readlines()][: len(pred_lns)]
    metrics = calculate_rouge(pred_lns, tgt_lns, **kwargs)
    if save_path is not None:
-        save_json(metrics, save_path)
+        save_json(metrics, save_path, indent=None)
    return metrics  # these print nicely


--- a/examples/seq2seq/run_distributed_eval.py
+++ b/examples/seq2seq/run_distributed_eval.py
@@ -42,8 +42,7 @@ def eval_data_dir(
    task="summarization",
    local_rank=None,
    num_return_sequences=1,
-    src_lang=None,
-    tgt_lang=None,
+    dataset_kwargs: Dict = None,
    prefix="",
    **generate_kwargs,
 ) -> Dict:
@@ -78,9 +77,8 @@ def eval_data_dir(
        max_target_length=1024,
        type_path=type_path,
        n_obs=n_obs,
-        src_lang=src_lang,
-        tgt_lang=tgt_lang,
        prefix=prefix,
+        **dataset_kwargs,
    )
    # I set shuffle=True for a more accurate progress bar.
    # If all the longest samples are first, the prog bar estimate is too high at the beginning.
@@ -158,6 +156,11 @@ def run_generate():
    if intermediate_files:
        raise ValueError(f"Found files at {json_save_dir} please move or remove them.")
        # In theory, a node could finish and save before another node hits this. If this happens, we can address later.
+    dataset_kwargs = {}
+    if args.src_lang is not None:
+        dataset_kwargs["src_lang"] = args.src_lang
+    if args.tgt_lang is not None:
+        dataset_kwargs["tgt_lang"] = args.tgt_lang

    Path(args.save_dir).mkdir(exist_ok=True)
    results, num_replicas = eval_data_dir(
@@ -173,8 +176,7 @@ def run_generate():
        max_source_length=args.max_source_length,
        num_return_sequences=args.num_return_sequences,
        prefix=args.prefix,
-        src_lang=args.src_lang,
-        tgt_lang=args.tgt_lang,
+        dataset_kwargs=dataset_kwargs,
        **generate_kwargs,
    )

--- a/examples/seq2seq/run_eval.py
+++ b/examples/seq2seq/run_eval.py
@@ -152,8 +152,7 @@ def run_generate(verbose=True):
        print(scores)

    if args.score_path is not None:
-        path = args.score_path
-        json.dump(scores, open(path, "w"))
+        json.dump(scores, open(args.score_path, "w"))

    return scores

--- a/examples/seq2seq/sentence_splitter.py
+++ b/examples/seq2seq/sentence_splitter.py
@@ -1,5 +1,7 @@
 import re

+from filelock import FileLock
+

 try:
    import nltk
@@ -9,13 +11,12 @@ except (ImportError, ModuleNotFoundError):
    NLTK_AVAILABLE = False

 if NLTK_AVAILABLE:
-    try:
+    with FileLock(".lock") as lock:
        nltk.download("punkt", quiet=True)
-    except FileExistsError:  # multiprocessing race condition
-        pass


 def add_newline_to_end_of_each_sentence(x: str) -> str:
+    """This was added to get rougeLsum scores matching published rougeL scores for BART and PEGASUS."""
    re.sub("<n>", "", x)  # remove pegasus newline char
-    assert NLTK_AVAILABLE, "nltk must be installed to separate newlines betwee sentences. (pip install nltk)"
+    assert NLTK_AVAILABLE, "nltk must be installed to separate newlines between sentences. (pip install nltk)"
    return "\n".join(nltk.sent_tokenize(x))
--- a/examples/seq2seq/seq2seq_trainer.py
+++ b/examples/seq2seq/seq2seq_trainer.py
@@ -6,8 +6,19 @@ from torch import nn
 from torch.utils.data import DistributedSampler, RandomSampler

 from transformers import Trainer
+from transformers.configuration_fsmt import FSMTConfig
 from transformers.file_utils import is_torch_tpu_available
-from transformers.trainer import get_tpu_sampler
+from transformers.optimization import (
+    Adafactor,
+    AdamW,
+    get_constant_schedule,
+    get_constant_schedule_with_warmup,
+    get_cosine_schedule_with_warmup,
+    get_cosine_with_hard_restarts_schedule_with_warmup,
+    get_linear_schedule_with_warmup,
+    get_polynomial_decay_schedule_with_warmup,
+)
+from transformers.trainer_pt_utils import get_tpu_sampler


 try:
@@ -18,8 +29,74 @@ except ImportError:

 logger = logging.getLogger(__name__)

+arg_to_scheduler = {
+    "linear": get_linear_schedule_with_warmup,
+    "cosine": get_cosine_schedule_with_warmup,
+    "cosine_w_restarts": get_cosine_with_hard_restarts_schedule_with_warmup,
+    "polynomial": get_polynomial_decay_schedule_with_warmup,
+    "constant": get_constant_schedule,
+    "constant_w_warmup": get_constant_schedule_with_warmup,
+}
+arg_to_scheduler_choices = sorted(arg_to_scheduler.keys())
+

 class Seq2SeqTrainer(Trainer):
+    def __init__(self, config, data_args, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.config = config
+        self.data_args = data_args
+        self.max_gen_length = data_args.val_max_target_length
+        self.vocab_size = self.config.tgt_vocab_size if isinstance(self.config, FSMTConfig) else self.config.vocab_size
+
+    def create_optimizer_and_scheduler(self, num_training_steps: int):
+        """
+        Setup the optimizer and the learning rate scheduler.
+
+        We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
+        Trainer's init through :obj:`optimizers`, or subclass and override this method in a subclass.
+        """
+        if self.optimizer is None:
+            no_decay = ["bias", "LayerNorm.weight"]
+            optimizer_grouped_parameters = [
+                {
+                    "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
+                    "weight_decay": self.args.weight_decay,
+                },
+                {
+                    "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
+                    "weight_decay": 0.0,
+                },
+            ]
+            if self.args.adafactor:
+                self.optimizer = Adafactor(
+                    optimizer_grouped_parameters,
+                    lr=self.args.learning_rate,
+                    scale_parameter=False,
+                    relative_step=False,
+                )
+
+            else:
+                self.optimizer = AdamW(
+                    optimizer_grouped_parameters, lr=self.args.learning_rate, eps=self.args.adam_epsilon
+                )
+
+        if self.lr_scheduler is None:
+            self.lr_scheduler = self._get_lr_scheduler(num_training_steps)
+        else:  # ignoring --lr_scheduler
+            logger.warn("scheduler is passed to `Seq2SeqTrainer`, `--lr_scheduler` arg is ignored.")
+
+    def _get_lr_scheduler(self, num_training_steps):
+        schedule_func = arg_to_scheduler[self.args.lr_scheduler]
+        if self.args.lr_scheduler == "constant":
+            scheduler = schedule_func(self.optimizer)
+        elif self.args.lr_scheduler == "constant_w_warmup":
+            scheduler = schedule_func(self.optimizer, num_warmup_steps=self.args.warmup_steps)
+        else:
+            scheduler = schedule_func(
+                self.optimizer, num_warmup_steps=self.args.warmup_steps, num_training_steps=num_training_steps
+            )
+        return scheduler
+
    def _get_train_sampler(self) -> Optional[torch.utils.data.sampler.Sampler]:
        if isinstance(self.train_dataset, torch.utils.data.IterableDataset):
            return None
@@ -41,18 +118,18 @@ class Seq2SeqTrainer(Trainer):
        labels = inputs.pop("labels")
        outputs = model(**inputs, use_cache=False)
        logits = outputs[0]
-        return self._compute_loss(logits, labels, ignore_index=model.config.pad_token_id)
+        return self._compute_loss(logits, labels)

-    def _compute_loss(self, logits, labels, ignore_index):
+    def _compute_loss(self, logits, labels):
        if self.args.label_smoothing == 0:
            # Same behavior as modeling_bart.py
-            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=ignore_index)
-            assert logits.shape[-1] == self.model.config.vocab_size
+            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=self.config.pad_token_id)
+            assert logits.shape[-1] == self.vocab_size
            loss = loss_fct(logits.view(-1, logits.shape[-1]), labels.view(-1))
        else:
            lprobs = torch.nn.functional.log_softmax(logits, dim=-1)
            loss, nll_loss = label_smoothed_nll_loss(
-                lprobs, labels, self.args.label_smoothing, ignore_index=ignore_index
+                lprobs, labels, self.args.label_smoothing, ignore_index=self.config.pad_token_id
            )
        return loss

@@ -81,45 +158,34 @@ class Seq2SeqTrainer(Trainer):
        """
        inputs = self._prepare_inputs(inputs)

-        max_length = (
-            model.config.max_generate_length
-            if hasattr(model.config, "max_generate_length")
-            else model.config.max_position_embeddings
-        )
-
        with torch.no_grad():
            if self.args.predict_with_generate and not self.args.prediction_loss_only:
                generated_tokens = model.generate(
                    inputs["input_ids"],
                    attention_mask=inputs["attention_mask"],
                    use_cache=True,
-                    num_beams=model.config.num_beams,
-                    max_length=max_length,
+                    num_beams=self.data_args.eval_beams,
+                    max_length=self.max_gen_length,
                )
                # in case the batch is shorter than max length, the output should be padded
-                generated_tokens = self._pad_tensors_to_max_len(
-                    generated_tokens, max_length, model.config.pad_token_id
-                )
+                generated_tokens = self._pad_tensors_to_max_len(generated_tokens, self.max_gen_length)

            labels_out = inputs.get("labels")
-            outputs = model(**inputs)
-            logits = outputs[1]
-            loss = self._compute_loss(logits, labels_out, model.config.pad_token_id)
-            loss = loss.mean().item()
+            # Call forward again to get loss # TODO: avoidable?
+            outputs = model(**inputs, use_cache=False)
+            loss = self._compute_loss(outputs[1], labels_out)
+            loss = loss.mean().detach()
            if self.args.prediction_loss_only:
-                logits = None
-            else:
-                logits = generated_tokens if self.args.predict_with_generate else logits
+                return (loss, None, None)

-        if self.args.prediction_loss_only:
-            return (loss, None, None)
+            logits = generated_tokens if self.args.predict_with_generate else outputs[1]

        labels_out = labels_out.detach()
-        labels = self._pad_tensors_to_max_len(labels_out, max_length, model.config.pad_token_id)
+        labels = self._pad_tensors_to_max_len(labels_out, self.max_gen_length)
        return (loss, logits.detach(), labels)

-    def _pad_tensors_to_max_len(self, tensor, max_length, pad_token_id):
-        padded_tensor = pad_token_id * torch.ones(
+    def _pad_tensors_to_max_len(self, tensor, max_length):
+        padded_tensor = self.config.pad_token_id * torch.ones(
            (tensor.shape[0], max_length), dtype=tensor.dtype, device=tensor.device
        )
        padded_tensor[:, : tensor.shape[-1]] = tensor
--- a/examples/seq2seq/test_bash_script.py
+++ b/examples/seq2seq/test_bash_script.py
@@ -3,7 +3,6 @@
 import argparse
 import os
 import sys
-import tempfile
 from pathlib import Path
 from unittest.mock import patch

@@ -16,172 +15,172 @@ from distillation import BartSummarizationDistiller, distill_main
 from finetune import SummarizationModule, main
 from test_seq2seq_examples import CUDA_AVAILABLE, MBART_TINY
 from transformers import BartForConditionalGeneration, MarianMTModel
-from transformers.testing_utils import slow
+from transformers.testing_utils import TestCasePlus, slow
 from utils import load_json


 MODEL_NAME = MBART_TINY
-# TODO(SS): MODEL_NAME = "sshleifer/student_mbart_en_ro_1_1"
 MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"


-@slow
-@pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
-def test_model_download():
-    """This warms up the cache so that we can time the next test without including download time, which varies between machines."""
-    BartForConditionalGeneration.from_pretrained(MODEL_NAME)
-    MarianMTModel.from_pretrained(MARIAN_MODEL)
+class TestAll(TestCasePlus):
+    @slow
+    @pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
+    def test_model_download(self):
+        """This warms up the cache so that we can time the next test without including download time, which varies between machines."""
+        BartForConditionalGeneration.from_pretrained(MODEL_NAME)
+        MarianMTModel.from_pretrained(MARIAN_MODEL)

+    @timeout_decorator.timeout(120)
+    @slow
+    @pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
+    def test_train_mbart_cc25_enro_script(self):
+        data_dir = "examples/seq2seq/test_data/wmt_en_ro"
+        env_vars_to_replace = {
+            "--fp16_opt_level=O1": "",
+            "$MAX_LEN": 128,
+            "$BS": 4,
+            "$GAS": 1,
+            "$ENRO_DIR": data_dir,
+            "facebook/mbart-large-cc25": MODEL_NAME,
+            # Download is 120MB in previous test.
+            "val_check_interval=0.25": "val_check_interval=1.0",
+        }

-@timeout_decorator.timeout(120)
-@slow
-@pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
-def test_train_mbart_cc25_enro_script():
-    data_dir = "examples/seq2seq/test_data/wmt_en_ro"
-    env_vars_to_replace = {
-        "--fp16_opt_level=O1": "",
-        "$MAX_LEN": 128,
-        "$BS": 4,
-        "$GAS": 1,
-        "$ENRO_DIR": data_dir,
-        "facebook/mbart-large-cc25": MODEL_NAME,
-        # Download is 120MB in previous test.
-        "val_check_interval=0.25": "val_check_interval=1.0",
-    }
+        # Clean up bash script
+        bash_script = Path("examples/seq2seq/train_mbart_cc25_enro.sh").open().read().split("finetune.py")[1].strip()
+        bash_script = bash_script.replace("\\\n", "").strip().replace('"$@"', "")
+        for k, v in env_vars_to_replace.items():
+            bash_script = bash_script.replace(k, str(v))
+        output_dir = self.get_auto_remove_tmp_dir()

-    # Clean up bash script
-    bash_script = Path("examples/seq2seq/train_mbart_cc25_enro.sh").open().read().split("finetune.py")[1].strip()
-    bash_script = bash_script.replace("\\\n", "").strip().replace('"$@"', "")
-    for k, v in env_vars_to_replace.items():
-        bash_script = bash_script.replace(k, str(v))
-    output_dir = tempfile.mkdtemp(prefix="output_mbart")
+        bash_script = bash_script.replace("--fp16 ", "")
+        testargs = (
+            ["finetune.py"]
+            + bash_script.split()
+            + [
+                f"--output_dir={output_dir}",
+                "--gpus=1",
+                "--learning_rate=3e-1",
+                "--warmup_steps=0",
+                "--val_check_interval=1.0",
+                "--tokenizer_name=facebook/mbart-large-en-ro",
+            ]
+        )
+        with patch.object(sys, "argv", testargs):
+            parser = argparse.ArgumentParser()
+            parser = pl.Trainer.add_argparse_args(parser)
+            parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
+            args = parser.parse_args()
+            args.do_predict = False
+            # assert args.gpus == gpus THIS BREAKS for multigpu
+            model = main(args)

-    bash_script = bash_script.replace("--fp16 ", "")
-    testargs = (
-        ["finetune.py"]
-        + bash_script.split()
-        + [
-            f"--output_dir={output_dir}",
-            "--gpus=1",
-            "--learning_rate=3e-1",
-            "--warmup_steps=0",
-            "--val_check_interval=1.0",
-            "--tokenizer_name=facebook/mbart-large-en-ro",
-        ]
-    )
-    with patch.object(sys, "argv", testargs):
-        parser = argparse.ArgumentParser()
-        parser = pl.Trainer.add_argparse_args(parser)
-        parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
-        args = parser.parse_args()
-        args.do_predict = False
-        # assert args.gpus == gpus THIS BREAKS for multigpu
-        model = main(args)
+        # Check metrics
+        metrics = load_json(model.metrics_save_path)
+        first_step_stats = metrics["val"][0]
+        last_step_stats = metrics["val"][-1]
+        assert (
+            len(metrics["val"]) == (args.max_epochs / args.val_check_interval) + 1
+        )  # +1 accounts for val_sanity_check

-    # Check metrics
-    metrics = load_json(model.metrics_save_path)
-    first_step_stats = metrics["val"][0]
-    last_step_stats = metrics["val"][-1]
-    assert len(metrics["val"]) == (args.max_epochs / args.val_check_interval) + 1  # +1 accounts for val_sanity_check
+        assert last_step_stats["val_avg_gen_time"] >= 0.01

-    assert last_step_stats["val_avg_gen_time"] >= 0.01
+        assert first_step_stats["val_avg_bleu"] < last_step_stats["val_avg_bleu"]  # model learned nothing
+        assert 1.0 >= last_step_stats["val_avg_gen_time"]  # model hanging on generate. Maybe bad config was saved.
+        assert isinstance(last_step_stats[f"val_avg_{model.val_metric}"], float)

-    assert first_step_stats["val_avg_bleu"] < last_step_stats["val_avg_bleu"]  # model learned nothing
-    assert 1.0 >= last_step_stats["val_avg_gen_time"]  # model hanging on generate. Maybe bad config was saved.
-    assert isinstance(last_step_stats[f"val_avg_{model.val_metric}"], float)
+        # check lightning ckpt can be loaded and has a reasonable statedict
+        contents = os.listdir(output_dir)
+        ckpt_path = [x for x in contents if x.endswith(".ckpt")][0]
+        full_path = os.path.join(args.output_dir, ckpt_path)
+        ckpt = torch.load(full_path, map_location="cpu")
+        expected_key = "model.model.decoder.layers.0.encoder_attn_layer_norm.weight"
+        assert expected_key in ckpt["state_dict"]
+        assert ckpt["state_dict"]["model.model.decoder.layers.0.encoder_attn_layer_norm.weight"].dtype == torch.float32

-    # check lightning ckpt can be loaded and has a reasonable statedict
-    contents = os.listdir(output_dir)
-    ckpt_path = [x for x in contents if x.endswith(".ckpt")][0]
-    full_path = os.path.join(args.output_dir, ckpt_path)
-    ckpt = torch.load(full_path, map_location="cpu")
-    expected_key = "model.model.decoder.layers.0.encoder_attn_layer_norm.weight"
-    assert expected_key in ckpt["state_dict"]
-    assert ckpt["state_dict"]["model.model.decoder.layers.0.encoder_attn_layer_norm.weight"].dtype == torch.float32
+        # TODO: turn on args.do_predict when PL bug fixed.
+        if args.do_predict:
+            contents = {os.path.basename(p) for p in contents}
+            assert "test_generations.txt" in contents
+            assert "test_results.txt" in contents
+            # assert len(metrics["val"]) ==  desired_n_evals
+            assert len(metrics["test"]) == 1

-    # TODO(SS): turn on args.do_predict when PL bug fixed.
-    if args.do_predict:
-        contents = {os.path.basename(p) for p in contents}
-        assert "test_generations.txt" in contents
-        assert "test_results.txt" in contents
-        # assert len(metrics["val"]) ==  desired_n_evals
-        assert len(metrics["test"]) == 1
+    @timeout_decorator.timeout(600)
+    @slow
+    @pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
+    def test_opus_mt_distill_script(self):
+        data_dir = "examples/seq2seq/test_data/wmt_en_ro"
+        env_vars_to_replace = {
+            "--fp16_opt_level=O1": "",
+            "$MAX_LEN": 128,
+            "$BS": 16,
+            "$GAS": 1,
+            "$ENRO_DIR": data_dir,
+            "$m": "sshleifer/student_marian_en_ro_6_1",
+            "val_check_interval=0.25": "val_check_interval=1.0",
+        }

+        # Clean up bash script
+        bash_script = (
+            Path("examples/seq2seq/distil_marian_no_teacher.sh").open().read().split("distillation.py")[1].strip()
+        )
+        bash_script = bash_script.replace("\\\n", "").strip().replace('"$@"', "")
+        bash_script = bash_script.replace("--fp16 ", " ")

-@timeout_decorator.timeout(600)
-@slow
-@pytest.mark.skipif(not CUDA_AVAILABLE, reason="too slow to run on CPU")
-def test_opus_mt_distill_script():
-    data_dir = "examples/seq2seq/test_data/wmt_en_ro"
-    env_vars_to_replace = {
-        "--fp16_opt_level=O1": "",
-        "$MAX_LEN": 128,
-        "$BS": 16,
-        "$GAS": 1,
-        "$ENRO_DIR": data_dir,
-        "$m": "sshleifer/student_marian_en_ro_6_1",
-        "val_check_interval=0.25": "val_check_interval=1.0",
-    }
+        for k, v in env_vars_to_replace.items():
+            bash_script = bash_script.replace(k, str(v))
+        output_dir = self.get_auto_remove_tmp_dir()
+        bash_script = bash_script.replace("--fp16", "")
+        epochs = 6
+        testargs = (
+            ["distillation.py"]
+            + bash_script.split()
+            + [
+                f"--output_dir={output_dir}",
+                "--gpus=1",
+                "--learning_rate=1e-3",
+                f"--num_train_epochs={epochs}",
+                "--warmup_steps=10",
+                "--val_check_interval=1.0",
+            ]
+        )
+        with patch.object(sys, "argv", testargs):
+            parser = argparse.ArgumentParser()
+            parser = pl.Trainer.add_argparse_args(parser)
+            parser = BartSummarizationDistiller.add_model_specific_args(parser, os.getcwd())
+            args = parser.parse_args()
+            args.do_predict = False
+            # assert args.gpus == gpus THIS BREAKS for multigpu

-    # Clean up bash script
-    bash_script = (
-        Path("examples/seq2seq/distil_marian_no_teacher.sh").open().read().split("distillation.py")[1].strip()
-    )
-    bash_script = bash_script.replace("\\\n", "").strip().replace('"$@"', "")
-    bash_script = bash_script.replace("--fp16 ", " ")
+            model = distill_main(args)

-    for k, v in env_vars_to_replace.items():
-        bash_script = bash_script.replace(k, str(v))
-    output_dir = tempfile.mkdtemp(prefix="marian_output")
-    bash_script = bash_script.replace("--fp16", "")
-    epochs = 6
-    testargs = (
-        ["distillation.py"]
-        + bash_script.split()
-        + [
-            f"--output_dir={output_dir}",
-            "--gpus=1",
-            "--learning_rate=1e-3",
-            f"--num_train_epochs={epochs}",
-            "--warmup_steps=10",
-            "--val_check_interval=1.0",
-        ]
-    )
-    with patch.object(sys, "argv", testargs):
-        parser = argparse.ArgumentParser()
-        parser = pl.Trainer.add_argparse_args(parser)
-        parser = BartSummarizationDistiller.add_model_specific_args(parser, os.getcwd())
-        args = parser.parse_args()
-        args.do_predict = False
-        # assert args.gpus == gpus THIS BREAKS for multigpu
+        # Check metrics
+        metrics = load_json(model.metrics_save_path)
+        first_step_stats = metrics["val"][0]
+        last_step_stats = metrics["val"][-1]
+        assert len(metrics["val"]) >= (args.max_epochs / args.val_check_interval)  # +1 accounts for val_sanity_check

-        model = distill_main(args)
+        assert last_step_stats["val_avg_gen_time"] >= 0.01

-    # Check metrics
-    metrics = load_json(model.metrics_save_path)
-    first_step_stats = metrics["val"][0]
-    last_step_stats = metrics["val"][-1]
-    assert len(metrics["val"]) >= (args.max_epochs / args.val_check_interval)  # +1 accounts for val_sanity_check
+        assert first_step_stats["val_avg_bleu"] < last_step_stats["val_avg_bleu"]  # model learned nothing
+        assert 1.0 >= last_step_stats["val_avg_gen_time"]  # model hanging on generate. Maybe bad config was saved.
+        assert isinstance(last_step_stats[f"val_avg_{model.val_metric}"], float)

-    assert last_step_stats["val_avg_gen_time"] >= 0.01
+        # check lightning ckpt can be loaded and has a reasonable statedict
+        contents = os.listdir(output_dir)
+        ckpt_path = [x for x in contents if x.endswith(".ckpt")][0]
+        full_path = os.path.join(args.output_dir, ckpt_path)
+        ckpt = torch.load(full_path, map_location="cpu")
+        expected_key = "model.model.decoder.layers.0.encoder_attn_layer_norm.weight"
+        assert expected_key in ckpt["state_dict"]
+        assert ckpt["state_dict"]["model.model.decoder.layers.0.encoder_attn_layer_norm.weight"].dtype == torch.float32

-    assert first_step_stats["val_avg_bleu"] < last_step_stats["val_avg_bleu"]  # model learned nothing
-    assert 1.0 >= last_step_stats["val_avg_gen_time"]  # model hanging on generate. Maybe bad config was saved.
-    assert isinstance(last_step_stats[f"val_avg_{model.val_metric}"], float)
-
-    # check lightning ckpt can be loaded and has a reasonable statedict
-    contents = os.listdir(output_dir)
-    ckpt_path = [x for x in contents if x.endswith(".ckpt")][0]
-    full_path = os.path.join(args.output_dir, ckpt_path)
-    ckpt = torch.load(full_path, map_location="cpu")
-    expected_key = "model.model.decoder.layers.0.encoder_attn_layer_norm.weight"
-    assert expected_key in ckpt["state_dict"]
-    assert ckpt["state_dict"]["model.model.decoder.layers.0.encoder_attn_layer_norm.weight"].dtype == torch.float32
-
-    # TODO(SS): turn on args.do_predict when PL bug fixed.
-    if args.do_predict:
-        contents = {os.path.basename(p) for p in contents}
-        assert "test_generations.txt" in contents
-        assert "test_results.txt" in contents
-        # assert len(metrics["val"]) ==  desired_n_evals
-        assert len(metrics["test"]) == 1
+        # TODO: turn on args.do_predict when PL bug fixed.
+        if args.do_predict:
+            contents = {os.path.basename(p) for p in contents}
+            assert "test_generations.txt" in contents
+            assert "test_results.txt" in contents
+            # assert len(metrics["val"]) ==  desired_n_evals
+            assert len(metrics["test"]) == 1
--- a/examples/seq2seq/test_data/wmt_en_ro/train.len
+++ b/examples/seq2seq/test_data/wmt_en_ro/train.len
--- a/examples/seq2seq/test_data/wmt_en_ro/val.len
+++ b/examples/seq2seq/test_data/wmt_en_ro/val.len
--- a/examples/seq2seq/test_datasets.py
+++ b/examples/seq2seq/test_datasets.py
@@ -1,5 +1,4 @@
 import os
-import tempfile
 from pathlib import Path

 import numpy as np
@@ -7,11 +6,12 @@ import pytest
 from torch.utils.data import DataLoader

 from pack_dataset import pack_data_dir
+from parameterized import parameterized
 from save_len_file import save_len_file
 from test_seq2seq_examples import ARTICLES, BART_TINY, MARIAN_TINY, MBART_TINY, SUMMARIES, T5_TINY, make_test_data_dir
 from transformers import AutoTokenizer
 from transformers.modeling_bart import shift_tokens_right
-from transformers.testing_utils import slow
+from transformers.testing_utils import TestCasePlus, slow
 from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDataset, Seq2SeqDataset


@@ -19,169 +19,198 @@ BERT_BASE_CASED = "bert-base-cased"
 PEGASUS_XSUM = "google/pegasus-xsum"


-@slow
-@pytest.mark.parametrize(
-    "tok_name",
-    [
-        MBART_TINY,
-        MARIAN_TINY,
-        T5_TINY,
-        BART_TINY,
-        PEGASUS_XSUM,
-    ],
-)
-def test_seq2seq_dataset_truncation(tok_name):
-    tokenizer = AutoTokenizer.from_pretrained(tok_name)
-    tmp_dir = make_test_data_dir()
-    max_len_source = max(len(tokenizer.encode(a)) for a in ARTICLES)
-    max_len_target = max(len(tokenizer.encode(a)) for a in SUMMARIES)
-    max_src_len = 4
-    max_tgt_len = 8
-    assert max_len_target > max_src_len  # Will be truncated
-    assert max_len_source > max_src_len  # Will be truncated
-    src_lang, tgt_lang = "ro_RO", "de_DE"  # ignored for all but mbart, but never causes error.
-    train_dataset = Seq2SeqDataset(
-        tokenizer,
-        data_dir=tmp_dir,
-        type_path="train",
-        max_source_length=max_src_len,
-        max_target_length=max_tgt_len,  # ignored
-        src_lang=src_lang,
-        tgt_lang=tgt_lang,
+class TestAll(TestCasePlus):
+    @parameterized.expand(
+        [
+            MBART_TINY,
+            MARIAN_TINY,
+            T5_TINY,
+            BART_TINY,
+            PEGASUS_XSUM,
+        ],
    )
-    dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=train_dataset.collate_fn)
-    for batch in dataloader:
-        assert isinstance(batch, dict)
-        assert batch["attention_mask"].shape == batch["input_ids"].shape
-        # show that articles were trimmed.
-        assert batch["input_ids"].shape[1] == max_src_len
-        # show that targets are the same len
-        assert batch["labels"].shape[1] == max_tgt_len
-        if tok_name != MBART_TINY:
-            continue
-        # check language codes in correct place
-        batch["decoder_input_ids"] = shift_tokens_right(batch["labels"], tokenizer.pad_token_id)
-        assert batch["decoder_input_ids"][0, 0].item() == tokenizer.lang_code_to_id[tgt_lang]
-        assert batch["decoder_input_ids"][0, -1].item() == tokenizer.eos_token_id
-        assert batch["input_ids"][0, -2].item() == tokenizer.eos_token_id
-        assert batch["input_ids"][0, -1].item() == tokenizer.lang_code_to_id[src_lang]
+    @slow
+    def test_seq2seq_dataset_truncation(self, tok_name):
+        tokenizer = AutoTokenizer.from_pretrained(tok_name)
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        max_len_source = max(len(tokenizer.encode(a)) for a in ARTICLES)
+        max_len_target = max(len(tokenizer.encode(a)) for a in SUMMARIES)
+        max_src_len = 4
+        max_tgt_len = 8
+        assert max_len_target > max_src_len  # Will be truncated
+        assert max_len_source > max_src_len  # Will be truncated
+        src_lang, tgt_lang = "ro_RO", "de_DE"  # ignored for all but mbart, but never causes error.
+        train_dataset = Seq2SeqDataset(
+            tokenizer,
+            data_dir=tmp_dir,
+            type_path="train",
+            max_source_length=max_src_len,
+            max_target_length=max_tgt_len,  # ignored
+            src_lang=src_lang,
+            tgt_lang=tgt_lang,
+        )
+        dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=train_dataset.collate_fn)
+        for batch in dataloader:
+            assert isinstance(batch, dict)
+            assert batch["attention_mask"].shape == batch["input_ids"].shape
+            # show that articles were trimmed.
+            assert batch["input_ids"].shape[1] == max_src_len
+            # show that targets are the same len
+            assert batch["labels"].shape[1] == max_tgt_len
+            if tok_name != MBART_TINY:
+                continue
+            # check language codes in correct place
+            batch["decoder_input_ids"] = shift_tokens_right(batch["labels"], tokenizer.pad_token_id)
+            assert batch["decoder_input_ids"][0, 0].item() == tokenizer.lang_code_to_id[tgt_lang]
+            assert batch["decoder_input_ids"][0, -1].item() == tokenizer.eos_token_id
+            assert batch["input_ids"][0, -2].item() == tokenizer.eos_token_id
+            assert batch["input_ids"][0, -1].item() == tokenizer.lang_code_to_id[src_lang]

-        break  # No need to test every batch
+            break  # No need to test every batch

+    @parameterized.expand([BART_TINY, BERT_BASE_CASED])
+    def test_legacy_dataset_truncation(self, tok):
+        tokenizer = AutoTokenizer.from_pretrained(tok)
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        max_len_source = max(len(tokenizer.encode(a)) for a in ARTICLES)
+        max_len_target = max(len(tokenizer.encode(a)) for a in SUMMARIES)
+        trunc_target = 4
+        train_dataset = LegacySeq2SeqDataset(
+            tokenizer,
+            data_dir=tmp_dir,
+            type_path="train",
+            max_source_length=20,
+            max_target_length=trunc_target,
+        )
+        dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=train_dataset.collate_fn)
+        for batch in dataloader:
+            assert batch["attention_mask"].shape == batch["input_ids"].shape
+            # show that articles were trimmed.
+            assert batch["input_ids"].shape[1] == max_len_source
+            assert 20 >= batch["input_ids"].shape[1]  # trimmed significantly
+            # show that targets were truncated
+            assert batch["labels"].shape[1] == trunc_target  # Truncated
+            assert max_len_target > trunc_target  # Truncated
+            break  # No need to test every batch

-@pytest.mark.parametrize("tok", [BART_TINY, BERT_BASE_CASED])
-def test_legacy_dataset_truncation(tok):
-    tokenizer = AutoTokenizer.from_pretrained(tok)
-    tmp_dir = make_test_data_dir()
-    max_len_source = max(len(tokenizer.encode(a)) for a in ARTICLES)
-    max_len_target = max(len(tokenizer.encode(a)) for a in SUMMARIES)
-    trunc_target = 4
-    train_dataset = LegacySeq2SeqDataset(
-        tokenizer,
-        data_dir=tmp_dir,
-        type_path="train",
-        max_source_length=20,
-        max_target_length=trunc_target,
-    )
-    dataloader = DataLoader(train_dataset, batch_size=2, collate_fn=train_dataset.collate_fn)
-    for batch in dataloader:
-        assert batch["attention_mask"].shape == batch["input_ids"].shape
-        # show that articles were trimmed.
-        assert batch["input_ids"].shape[1] == max_len_source
-        assert 20 >= batch["input_ids"].shape[1]  # trimmed significantly
-        # show that targets were truncated
-        assert batch["labels"].shape[1] == trunc_target  # Truncated
-        assert max_len_target > trunc_target  # Truncated
-        break  # No need to test every batch
+    def test_pack_dataset(self):
+        tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

+        tmp_dir = Path(make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()))
+        orig_examples = tmp_dir.joinpath("train.source").open().readlines()
+        save_dir = Path(make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()))
+        pack_data_dir(tokenizer, tmp_dir, 128, save_dir)
+        orig_paths = {x.name for x in tmp_dir.iterdir()}
+        new_paths = {x.name for x in save_dir.iterdir()}
+        packed_examples = save_dir.joinpath("train.source").open().readlines()
+        # orig: [' Sam ate lunch today.\n', 'Sams lunch ingredients.']
+        # desired_packed: [' Sam ate lunch today.\n Sams lunch ingredients.']
+        assert len(packed_examples) < len(orig_examples)
+        assert len(packed_examples) == 1
+        assert len(packed_examples[0]) == sum(len(x) for x in orig_examples)
+        assert orig_paths == new_paths

-def test_pack_dataset():
-    tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")
+    @pytest.mark.skipif(not FAIRSEQ_AVAILABLE, reason="This test requires fairseq")
+    def test_dynamic_batch_size(self):
+        if not FAIRSEQ_AVAILABLE:
+            return
+        ds, max_tokens, tokenizer = self._get_dataset(max_len=64)
+        required_batch_size_multiple = 64
+        batch_sampler = ds.make_dynamic_sampler(max_tokens, required_batch_size_multiple=required_batch_size_multiple)
+        batch_sizes = [len(x) for x in batch_sampler]
+        assert len(set(batch_sizes)) > 1  # it's not dynamic batch size if every batch is the same length
+        assert sum(batch_sizes) == len(ds)  # no dropped or added examples
+        data_loader = DataLoader(ds, batch_sampler=batch_sampler, collate_fn=ds.collate_fn, num_workers=2)
+        failures = []
+        num_src_per_batch = []
+        for batch in data_loader:
+            src_shape = batch["input_ids"].shape
+            bs = src_shape[0]
+            assert bs % required_batch_size_multiple == 0 or bs < required_batch_size_multiple
+            num_src_tokens = np.product(batch["input_ids"].shape)
+            num_src_per_batch.append(num_src_tokens)
+            if num_src_tokens > (max_tokens * 1.1):
+                failures.append(num_src_tokens)
+        assert num_src_per_batch[0] == max(num_src_per_batch)
+        if failures:
+            raise AssertionError(f"too many tokens in {len(failures)} batches")

-    tmp_dir = Path(make_test_data_dir())
-    orig_examples = tmp_dir.joinpath("train.source").open().readlines()
-    save_dir = Path(tempfile.mkdtemp(prefix="packed_"))
-    pack_data_dir(tokenizer, tmp_dir, 128, save_dir)
-    orig_paths = {x.name for x in tmp_dir.iterdir()}
-    new_paths = {x.name for x in save_dir.iterdir()}
-    packed_examples = save_dir.joinpath("train.source").open().readlines()
-    # orig: [' Sam ate lunch today.\n', 'Sams lunch ingredients.']
-    # desired_packed: [' Sam ate lunch today.\n Sams lunch ingredients.']
-    assert len(packed_examples) < len(orig_examples)
-    assert len(packed_examples) == 1
-    assert len(packed_examples[0]) == sum(len(x) for x in orig_examples)
-    assert orig_paths == new_paths
+    def test_sortish_sampler_reduces_padding(self):
+        ds, _, tokenizer = self._get_dataset(max_len=512)
+        bs = 2
+        sortish_sampler = ds.make_sortish_sampler(bs, shuffle=False)

+        naive_dl = DataLoader(ds, batch_size=bs, collate_fn=ds.collate_fn, num_workers=2)
+        sortish_dl = DataLoader(ds, batch_size=bs, collate_fn=ds.collate_fn, num_workers=2, sampler=sortish_sampler)

-@pytest.mark.skipif(not FAIRSEQ_AVAILABLE, reason="This test requires fairseq")
-def test_dynamic_batch_size():
-    if not FAIRSEQ_AVAILABLE:
-        return
-    ds, max_tokens, tokenizer = _get_dataset(max_len=64)
-    required_batch_size_multiple = 64
-    batch_sampler = ds.make_dynamic_sampler(max_tokens, required_batch_size_multiple=required_batch_size_multiple)
-    batch_sizes = [len(x) for x in batch_sampler]
-    assert len(set(batch_sizes)) > 1  # it's not dynamic batch size if every batch is the same length
-    assert sum(batch_sizes) == len(ds)  # no dropped or added examples
-    data_loader = DataLoader(ds, batch_sampler=batch_sampler, collate_fn=ds.collate_fn, num_workers=2)
-    failures = []
-    num_src_per_batch = []
-    for batch in data_loader:
-        src_shape = batch["input_ids"].shape
-        bs = src_shape[0]
-        assert bs % required_batch_size_multiple == 0 or bs < required_batch_size_multiple
-        num_src_tokens = np.product(batch["input_ids"].shape)
-        num_src_per_batch.append(num_src_tokens)
-        if num_src_tokens > (max_tokens * 1.1):
-            failures.append(num_src_tokens)
-    assert num_src_per_batch[0] == max(num_src_per_batch)
-    if failures:
-        raise AssertionError(f"too many tokens in {len(failures)} batches")
+        pad = tokenizer.pad_token_id

+        def count_pad_tokens(data_loader, k="input_ids"):
+            return [batch[k].eq(pad).sum().item() for batch in data_loader]

-def test_sortish_sampler_reduces_padding():
-    ds, _, tokenizer = _get_dataset(max_len=512)
-    bs = 2
-    sortish_sampler = ds.make_sortish_sampler(bs, shuffle=False)
+        assert sum(count_pad_tokens(sortish_dl, k="labels")) < sum(count_pad_tokens(naive_dl, k="labels"))
+        assert sum(count_pad_tokens(sortish_dl)) < sum(count_pad_tokens(naive_dl))
+        assert len(sortish_dl) == len(naive_dl)

-    naive_dl = DataLoader(ds, batch_size=bs, collate_fn=ds.collate_fn, num_workers=2)
-    sortish_dl = DataLoader(ds, batch_size=bs, collate_fn=ds.collate_fn, num_workers=2, sampler=sortish_sampler)
-
-    pad = tokenizer.pad_token_id
-
-    def count_pad_tokens(data_loader, k="input_ids"):
-        return [batch[k].eq(pad).sum().item() for batch in data_loader]
-
-    assert sum(count_pad_tokens(sortish_dl, k="labels")) < sum(count_pad_tokens(naive_dl, k="labels"))
-    assert sum(count_pad_tokens(sortish_dl)) < sum(count_pad_tokens(naive_dl))
-    assert len(sortish_dl) == len(naive_dl)
-
-
-def _get_dataset(n_obs=1000, max_len=128):
-    if os.getenv("USE_REAL_DATA", False):
-        data_dir = "examples/seq2seq/wmt_en_ro"
-        max_tokens = max_len * 2 * 64
-        if not Path(data_dir).joinpath("train.len").exists():
+    def _get_dataset(self, n_obs=1000, max_len=128):
+        if os.getenv("USE_REAL_DATA", False):
+            data_dir = "examples/seq2seq/wmt_en_ro"
+            max_tokens = max_len * 2 * 64
+            if not Path(data_dir).joinpath("train.len").exists():
+                save_len_file(MARIAN_TINY, data_dir)
+        else:
+            data_dir = "examples/seq2seq/test_data/wmt_en_ro"
+            max_tokens = max_len * 4
            save_len_file(MARIAN_TINY, data_dir)
-    else:
-        data_dir = "examples/seq2seq/test_data/wmt_en_ro"
-        max_tokens = max_len * 4
-        save_len_file(MARIAN_TINY, data_dir)

-    tokenizer = AutoTokenizer.from_pretrained(MARIAN_TINY)
-    ds = Seq2SeqDataset(
-        tokenizer,
-        data_dir=data_dir,
-        type_path="train",
-        max_source_length=max_len,
-        max_target_length=max_len,
-        n_obs=n_obs,
+        tokenizer = AutoTokenizer.from_pretrained(MARIAN_TINY)
+        ds = Seq2SeqDataset(
+            tokenizer,
+            data_dir=data_dir,
+            type_path="train",
+            max_source_length=max_len,
+            max_target_length=max_len,
+            n_obs=n_obs,
+        )
+        return ds, max_tokens, tokenizer
+
+    def test_distributed_sortish_sampler_splits_indices_between_procs(self):
+        ds, max_tokens, tokenizer = self._get_dataset()
+        ids1 = set(DistributedSortishSampler(ds, 256, num_replicas=2, rank=0, add_extra_examples=False))
+        ids2 = set(DistributedSortishSampler(ds, 256, num_replicas=2, rank=1, add_extra_examples=False))
+        assert ids1.intersection(ids2) == set()
+
+    @parameterized.expand(
+        [
+            MBART_TINY,
+            MARIAN_TINY,
+            T5_TINY,
+            BART_TINY,
+            PEGASUS_XSUM,
+        ],
    )
-    return ds, max_tokens, tokenizer
-
-
-def test_distributed_sortish_sampler_splits_indices_between_procs():
-    ds, max_tokens, tokenizer = _get_dataset()
-    ids1 = set(DistributedSortishSampler(ds, 256, num_replicas=2, rank=0, add_extra_examples=False))
-    ids2 = set(DistributedSortishSampler(ds, 256, num_replicas=2, rank=1, add_extra_examples=False))
-    assert ids1.intersection(ids2) == set()
+    def test_dataset_kwargs(self, tok_name):
+        tokenizer = AutoTokenizer.from_pretrained(tok_name)
+        if tok_name == MBART_TINY:
+            train_dataset = Seq2SeqDataset(
+                tokenizer,
+                data_dir=make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()),
+                type_path="train",
+                max_source_length=4,
+                max_target_length=8,
+                src_lang="EN",
+                tgt_lang="FR",
+            )
+            kwargs = train_dataset.dataset_kwargs
+            assert "src_lang" in kwargs and "tgt_lang" in kwargs
+        else:
+            train_dataset = Seq2SeqDataset(
+                tokenizer,
+                data_dir=make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()),
+                type_path="train",
+                max_source_length=4,
+                max_target_length=8,
+            )
+            kwargs = train_dataset.dataset_kwargs
+            assert "add_prefix_space" not in kwargs if tok_name != BART_TINY else "add_prefix_space" in kwargs
+            assert len(kwargs) == 1 if tok_name == BART_TINY else len(kwargs) == 0
--- a/examples/seq2seq/test_finetune_trainer.py
+++ b/examples/seq2seq/test_finetune_trainer.py
@@ -1,96 +1,84 @@
 import os
 import sys
-import tempfile
 from unittest.mock import patch

-from transformers import BartForConditionalGeneration, MarianMTModel
-from transformers.testing_utils import slow
+from transformers.testing_utils import TestCasePlus, slow
+from transformers.trainer_callback import TrainerState
+from transformers.trainer_utils import set_seed

 from .finetune_trainer import main
 from .test_seq2seq_examples import MBART_TINY
-from .utils import load_json


-MODEL_NAME = MBART_TINY
-# TODO(SS): MODEL_NAME = "sshleifer/student_mbart_en_ro_1_1"
+set_seed(42)
 MARIAN_MODEL = "sshleifer/student_marian_en_ro_6_1"


-@slow
-def test_model_download():
-    """This warms up the cache so that we can time the next test without including download time, which varies between machines."""
-    BartForConditionalGeneration.from_pretrained(MODEL_NAME)
-    MarianMTModel.from_pretrained(MARIAN_MODEL)
+class TestFinetuneTrainer(TestCasePlus):
+    def test_finetune_trainer(self):
+        output_dir = self.run_trainer(1, "12", MBART_TINY, 1)
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log.keys()]
+        first_step_stats = eval_metrics[0]
+        assert "eval_bleu" in first_step_stats

+    @slow
+    def test_finetune_trainer_slow(self):
+        # There is a missing call to __init__process_group somewhere
+        output_dir = self.run_trainer(eval_steps=2, max_len="128", model_name=MARIAN_MODEL, num_train_epochs=3)

-@slow
-def test_finetune_trainer():
-    data_dir = "examples/seq2seq/test_data/wmt_en_ro"
-    output_dir = tempfile.mkdtemp(prefix="marian_output")
-    max_len = "128"
-    num_train_epochs = 4
-    eval_steps = 2
-    argv = [
-        "--model_name_or_path",
-        MARIAN_MODEL,
-        "--data_dir",
-        data_dir,
-        "--output_dir",
-        output_dir,
-        "--overwrite_output_dir",
-        "--n_train",
-        "8",
-        "--n_val",
-        "8",
-        "--max_source_length",
-        max_len,
-        "--max_target_length",
-        max_len,
-        "--val_max_target_length",
-        max_len,
-        "--do_train",
-        "--do_eval",
-        "--do_predict",
-        "--num_train_epochs",
-        str(num_train_epochs),
-        "--per_device_train_batch_size",
-        "4",
-        "--per_device_eval_batch_size",
-        "4",
-        "--learning_rate",
-        "3e-4",
-        "--warmup_steps",
-        "8",
-        "--evaluate_during_training",
-        "--predict_with_generate",
-        "--logging_steps",
-        0,
-        "--save_steps",
-        str(eval_steps),
-        "--eval_steps",
-        str(eval_steps),
-        "--sortish_sampler",
-        "--label_smoothing",
-        "0.1",
-        "--task",
-        "translation",
-    ]
+        # Check metrics
+        logs = TrainerState.load_from_json(os.path.join(output_dir, "trainer_state.json")).log_history
+        eval_metrics = [log for log in logs if "eval_loss" in log.keys()]
+        first_step_stats = eval_metrics[0]
+        last_step_stats = eval_metrics[-1]

-    testargs = ["finetune_trainer.py"] + argv
-    with patch.object(sys, "argv", testargs):
-        main()
+        assert first_step_stats["eval_bleu"] < last_step_stats["eval_bleu"]  # model learned nothing
+        assert isinstance(last_step_stats["eval_bleu"], float)

-    # Check metrics
-    logs = load_json(os.path.join(output_dir, "log_history.json"))
-    eval_metrics = [log for log in logs if "eval_loss" in log.keys()]
-    first_step_stats = eval_metrics[0]
-    last_step_stats = eval_metrics[-1]
+        # test if do_predict saves generations and metrics
+        contents = os.listdir(output_dir)
+        contents = {os.path.basename(p) for p in contents}
+        assert "test_generations.txt" in contents
+        assert "test_results.json" in contents

-    assert first_step_stats["eval_bleu"] < last_step_stats["eval_bleu"]  # model learned nothing
-    assert isinstance(last_step_stats["eval_bleu"], float)
+    def run_trainer(self, eval_steps: int, max_len: str, model_name: str, num_train_epochs: int):
+        data_dir = "examples/seq2seq/test_data/wmt_en_ro"
+        output_dir = self.get_auto_remove_tmp_dir()
+        argv = f"""
+            --model_name_or_path {model_name}
+            --data_dir {data_dir}
+            --output_dir {output_dir}
+            --overwrite_output_dir
+            --n_train 8
+            --n_val 8
+            --max_source_length {max_len}
+            --max_target_length {max_len}
+            --val_max_target_length {max_len}
+            --do_train
+            --do_eval
+            --do_predict
+            --num_train_epochs {str(num_train_epochs)}
+            --per_device_train_batch_size 4
+            --per_device_eval_batch_size 4
+            --learning_rate 3e-4
+            --warmup_steps 8
+            --evaluate_during_training
+            --predict_with_generate
+            --logging_steps 0
+            --save_steps {str(eval_steps)}
+            --eval_steps {str(eval_steps)}
+            --sortish_sampler
+            --label_smoothing 0.1
+            --adafactor
+            --task translation
+            --tgt_lang ro_RO
+            --src_lang en_XX
+        """.split()
+        # --eval_beams  2

-    # test if do_predict saves generations and metrics
-    contents = os.listdir(output_dir)
-    contents = {os.path.basename(p) for p in contents}
-    assert "test_generations.txt" in contents
-    assert "test_results.json" in contents
+        testargs = ["finetune_trainer.py"] + argv
+        with patch.object(sys, "argv", testargs):
+            main()
+
+        return output_dir
--- a/examples/seq2seq/test_seq2seq_examples.py
+++ b/examples/seq2seq/test_seq2seq_examples.py
@@ -3,7 +3,6 @@ import logging
 import os
 import sys
 import tempfile
-import unittest
 from pathlib import Path
 from unittest.mock import patch

@@ -13,13 +12,14 @@ import torch

 import lightning_base
 from convert_pl_checkpoint_to_hf import convert_pl_to_hf
-from distillation import distill_main, evaluate_checkpoint
+from distillation import distill_main
 from finetune import SummarizationModule, main
+from parameterized import parameterized
 from run_eval import generate_summaries_or_translations, run_generate
 from run_eval_search import run_search
 from transformers import AutoConfig, AutoModelForSeq2SeqLM
 from transformers.hf_api import HfApi
-from transformers.testing_utils import CaptureStderr, CaptureStdout, require_multigpu, require_torch_and_cuda, slow
+from transformers.testing_utils import CaptureStderr, CaptureStdout, TestCasePlus, require_torch_gpu, slow
 from utils import ROUGE_KEYS, label_smoothed_nll_loss, lmap, load_json


@@ -86,9 +86,9 @@ CHEAP_ARGS = {
    "n_val": -1,
    "n_test": -1,
    "student_encoder_layers": 1,
-    "alpha_encoder_loss": 0.0,
    "freeze_encoder": False,
    "auto_scale_batch_size": False,
+    "overwrite_output_dir": False,
 }


@@ -111,24 +111,23 @@ logger.addHandler(stream_handler)
 logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks


-def make_test_data_dir(**kwargs):
-    tmp_dir = Path(tempfile.mkdtemp(**kwargs))
+def make_test_data_dir(tmp_dir):
    for split in ["train", "val", "test"]:
-        _dump_articles((tmp_dir / f"{split}.source"), ARTICLES)
-        _dump_articles((tmp_dir / f"{split}.target"), SUMMARIES)
+        _dump_articles(os.path.join(tmp_dir, f"{split}.source"), ARTICLES)
+        _dump_articles(os.path.join(tmp_dir, f"{split}.target"), SUMMARIES)
    return tmp_dir


-class TestSummarizationDistiller(unittest.TestCase):
+class TestSummarizationDistiller(TestCasePlus):
    @classmethod
    def setUpClass(cls):
        logging.disable(logging.CRITICAL)  # remove noisy download output from tracebacks
        return cls

    @slow
-    @require_torch_and_cuda
+    @require_torch_gpu
    def test_hub_configs(self):
-        """I put require_torch_and_cuda cause I only want this to run with self-scheduled."""
+        """I put require_torch_gpu cause I only want this to run with self-scheduled."""

        model_list = HfApi().model_list()
        org = "sshleifer"
@@ -144,17 +143,6 @@ class TestSummarizationDistiller(unittest.TestCase):
                failures.append(m)
        assert not failures, f"The following models could not be loaded through AutoConfig: {failures}"

-    @require_multigpu
-    @unittest.skip("Broken at the moment")
-    def test_multigpu(self):
-        updates = dict(
-            no_teacher=True,
-            freeze_encoder=True,
-            gpus=2,
-            sortish_sampler=True,
-        )
-        self._test_distiller_cli(updates, check_contents=False)
-
    def test_distill_no_teacher(self):
        updates = dict(student_encoder_layers=2, student_decoder_layers=1, no_teacher=True)
        self._test_distiller_cli(updates)
@@ -174,13 +162,12 @@ class TestSummarizationDistiller(unittest.TestCase):
        self.assertEqual(1, len(ckpts))
        transformer_ckpts = list(Path(model.output_dir).glob("**/*.bin"))
        self.assertEqual(len(transformer_ckpts), 2)
-        examples = lmap(str.strip, model.hparams.data_dir.joinpath("test.source").open().readlines())
-        out_path = tempfile.mktemp()
+        examples = lmap(str.strip, Path(model.hparams.data_dir).joinpath("test.source").open().readlines())
+        out_path = tempfile.mktemp()  # XXX: not being cleaned up
        generate_summaries_or_translations(examples, out_path, str(model.output_dir / "best_tfmr"))
        self.assertTrue(Path(out_path).exists())

-        evaluate_checkpoint(ckpts[0], dest_dir=Path(tempfile.mkdtemp()))
-        out_path_new = tempfile.mkdtemp()
+        out_path_new = self.get_auto_remove_tmp_dir()
        convert_pl_to_hf(ckpts[0], transformer_ckpts[0].parent, out_path_new)
        assert os.path.exists(os.path.join(out_path_new, "pytorch_model.bin"))

@@ -228,9 +215,6 @@ class TestSummarizationDistiller(unittest.TestCase):
        assert len(all_files) > 2
        self.assertEqual(len(transformer_ckpts), 2)

-        evaluate_checkpoint(ckpts[0], dest_dir=Path(tempfile.mkdtemp()))
-
-    @unittest.skip("T5 distillation is broken at the moment")
    def test_distill_t5(self):
        updates = dict(
            student_encoder_layers=1,
@@ -255,12 +239,11 @@ class TestSummarizationDistiller(unittest.TestCase):
            model_name_or_path="sshleifer/tinier_bart",
            teacher=CHEAP_ARGS["model_name_or_path"],
            val_check_interval=0.5,
-            alpha_encoder_loss=0.4,
        )
        default_updates.update(updates)
        args_d: dict = CHEAP_ARGS.copy()
-        tmp_dir = make_test_data_dir()
-        output_dir = tempfile.mkdtemp(prefix="output_")
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        output_dir = self.get_auto_remove_tmp_dir()

        args_d.update(data_dir=tmp_dir, output_dir=output_dir, **default_updates)
        model = distill_main(argparse.Namespace(**args_d))
@@ -285,252 +268,253 @@ class TestSummarizationDistiller(unittest.TestCase):
        return model


-def run_eval_tester(model):
-    input_file_name = Path(tempfile.mkdtemp()) / "utest_input.source"
-    output_file_name = input_file_name.parent / "utest_output.txt"
-    assert not output_file_name.exists()
-    articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
-    _dump_articles(input_file_name, articles)
-    score_path = str(Path(tempfile.mkdtemp()) / "scores.json")
-    task = "translation_en_to_de" if model == T5_TINY else "summarization"
-    testargs = f"""
-        run_eval_search.py
-        {model}
-        {input_file_name}
-        {output_file_name}
-        --score_path {score_path}
-        --task {task}
-        --num_beams 2
-        --length_penalty 2.0
-        """.split()
+class TestTheRest(TestCasePlus):
+    def run_eval_tester(self, model):
+        input_file_name = Path(self.get_auto_remove_tmp_dir()) / "utest_input.source"
+        output_file_name = input_file_name.parent / "utest_output.txt"
+        assert not output_file_name.exists()
+        articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
+        _dump_articles(input_file_name, articles)

-    with patch.object(sys, "argv", testargs):
-        run_generate()
-        assert Path(output_file_name).exists()
-        os.remove(Path(output_file_name))
+        score_path = str(Path(self.get_auto_remove_tmp_dir()) / "scores.json")
+        task = "translation_en_to_de" if model == T5_TINY else "summarization"
+        testargs = f"""
+            run_eval_search.py
+            {model}
+            {input_file_name}
+            {output_file_name}
+            --score_path {score_path}
+            --task {task}
+            --num_beams 2
+            --length_penalty 2.0
+            """.split()

+        with patch.object(sys, "argv", testargs):
+            run_generate()
+            assert Path(output_file_name).exists()
+            # os.remove(Path(output_file_name))

-# test one model to quickly (no-@slow) catch simple problems and do an
-# extensive testing of functionality with multiple models as @slow separately
-def test_run_eval():
-    run_eval_tester(T5_TINY)
+    # test one model to quickly (no-@slow) catch simple problems and do an
+    # extensive testing of functionality with multiple models as @slow separately
+    def test_run_eval(self):
+        self.run_eval_tester(T5_TINY)

+    # any extra models should go into the list here - can be slow
+    @parameterized.expand([BART_TINY, MBART_TINY])
+    @slow
+    def test_run_eval_slow(self, model):
+        self.run_eval_tester(model)

-# any extra models should go into the list here - can be slow
-@slow
-@pytest.mark.parametrize("model", [BART_TINY, MBART_TINY])
-def test_run_eval_slow(model):
-    run_eval_tester(model)
+    # testing with 2 models to validate: 1. translation (t5) 2. summarization (mbart)
+    @parameterized.expand([T5_TINY, MBART_TINY])
+    @slow
+    def test_run_eval_search(self, model):
+        input_file_name = Path(self.get_auto_remove_tmp_dir()) / "utest_input.source"
+        output_file_name = input_file_name.parent / "utest_output.txt"
+        assert not output_file_name.exists()

+        text = {
+            "en": ["Machine learning is great, isn't it?", "I like to eat bananas", "Tomorrow is another great day!"],
+            "de": [
+                "Maschinelles Lernen ist großartig, oder?",
+                "Ich esse gerne Bananen",
+                "Morgen ist wieder ein toller Tag!",
+            ],
+        }

-# testing with 2 models to validate: 1. translation (t5) 2. summarization (mbart)
-@slow
-@pytest.mark.parametrize("model", [T5_TINY, MBART_TINY])
-def test_run_eval_search(model):
-    input_file_name = Path(tempfile.mkdtemp()) / "utest_input.source"
-    output_file_name = input_file_name.parent / "utest_output.txt"
-    assert not output_file_name.exists()
+        tmp_dir = Path(self.get_auto_remove_tmp_dir())
+        score_path = str(tmp_dir / "scores.json")
+        reference_path = str(tmp_dir / "val.target")
+        _dump_articles(input_file_name, text["en"])
+        _dump_articles(reference_path, text["de"])
+        task = "translation_en_to_de" if model == T5_TINY else "summarization"
+        testargs = f"""
+            run_eval_search.py
+            {model}
+            {str(input_file_name)}
+            {str(output_file_name)}
+            --score_path {score_path}
+            --reference_path {reference_path}
+            --task {task}
+            """.split()
+        testargs.extend(["--search", "num_beams=1:2 length_penalty=0.9:1.0"])

-    text = {
-        "en": ["Machine learning is great, isn't it?", "I like to eat bananas", "Tomorrow is another great day!"],
-        "de": [
-            "Maschinelles Lernen ist großartig, oder?",
-            "Ich esse gerne Bananen",
-            "Morgen ist wieder ein toller Tag!",
-        ],
-    }
+        with patch.object(sys, "argv", testargs):
+            with CaptureStdout() as cs:
+                run_search()
+            expected_strings = [" num_beams | length_penalty", model, "Best score args"]
+            un_expected_strings = ["Info"]
+            if "translation" in task:
+                expected_strings.append("bleu")
+            else:
+                expected_strings.extend(ROUGE_KEYS)
+            for w in expected_strings:
+                assert w in cs.out
+            for w in un_expected_strings:
+                assert w not in cs.out
+            assert Path(output_file_name).exists()
+            os.remove(Path(output_file_name))

-    tmp_dir = Path(tempfile.mkdtemp())
-    score_path = str(tmp_dir / "scores.json")
-    reference_path = str(tmp_dir / "val.target")
-    _dump_articles(input_file_name, text["en"])
-    _dump_articles(reference_path, text["de"])
-    task = "translation_en_to_de" if model == T5_TINY else "summarization"
-    testargs = f"""
-        run_eval_search.py
-        {model}
-        {str(input_file_name)}
-        {str(output_file_name)}
-        --score_path {score_path}
-        --reference_path {reference_path}
-        --task {task}
-        """.split()
-    testargs.extend(["--search", "num_beams=1:2 length_penalty=0.9:1.0"])
+    @parameterized.expand(
+        [T5_TINY, BART_TINY, MBART_TINY, MARIAN_TINY, FSMT_TINY],
+    )
+    def test_finetune(self, model):
+        args_d: dict = CHEAP_ARGS.copy()
+        task = "translation" if model in [MBART_TINY, MARIAN_TINY, FSMT_TINY] else "summarization"
+        args_d["label_smoothing"] = 0.1 if task == "translation" else 0

-    with patch.object(sys, "argv", testargs):
-        with CaptureStdout() as cs:
-            run_search()
-        expected_strings = [" num_beams | length_penalty", model, "Best score args"]
-        un_expected_strings = ["Info"]
-        if "translation" in task:
-            expected_strings.append("bleu")
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d.update(
+            data_dir=tmp_dir,
+            model_name_or_path=model,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            output_dir=output_dir,
+            do_predict=True,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )
+        assert "n_train" in args_d
+        args = argparse.Namespace(**args_d)
+        module = main(args)
+
+        input_embeds = module.model.get_input_embeddings()
+        assert not input_embeds.weight.requires_grad
+        if model == T5_TINY:
+            lm_head = module.model.lm_head
+            assert not lm_head.weight.requires_grad
+            assert (lm_head.weight == input_embeds.weight).all().item()
+        elif model == FSMT_TINY:
+            fsmt = module.model.model
+            embed_pos = fsmt.decoder.embed_positions
+            assert not embed_pos.weight.requires_grad
+            assert not fsmt.decoder.embed_tokens.weight.requires_grad
+            # check that embeds are not the same
+            assert fsmt.decoder.embed_tokens != fsmt.encoder.embed_tokens
        else:
-            expected_strings.extend(ROUGE_KEYS)
-        for w in expected_strings:
-            assert w in cs.out
-        for w in un_expected_strings:
-            assert w not in cs.out
-        assert Path(output_file_name).exists()
-        os.remove(Path(output_file_name))
+            bart = module.model.model
+            embed_pos = bart.decoder.embed_positions
+            assert not embed_pos.weight.requires_grad
+            assert not bart.shared.weight.requires_grad
+            # check that embeds are the same
+            assert bart.decoder.embed_tokens == bart.encoder.embed_tokens
+            assert bart.decoder.embed_tokens == bart.shared

+        example_batch = load_json(module.output_dir / "text_batch.json")
+        assert isinstance(example_batch, dict)
+        assert len(example_batch) >= 4

-@pytest.mark.parametrize(
-    "model",
-    [T5_TINY, BART_TINY, MBART_TINY, MARIAN_TINY, FSMT_TINY],
-)
-def test_finetune(model):
-    args_d: dict = CHEAP_ARGS.copy()
-    task = "translation" if model in [MBART_TINY, MARIAN_TINY, FSMT_TINY] else "summarization"
-    args_d["label_smoothing"] = 0.1 if task == "translation" else 0
+    def test_finetune_extra_model_args(self):
+        args_d: dict = CHEAP_ARGS.copy()

-    tmp_dir = make_test_data_dir()
-    output_dir = tempfile.mkdtemp(prefix="output_")
-    args_d.update(
-        data_dir=tmp_dir,
-        model_name_or_path=model,
-        tokenizer_name=None,
-        train_batch_size=2,
-        eval_batch_size=2,
-        output_dir=output_dir,
-        do_predict=True,
-        task=task,
-        src_lang="en_XX",
-        tgt_lang="ro_RO",
-        freeze_encoder=True,
-        freeze_embeds=True,
-    )
-    assert "n_train" in args_d
-    args = argparse.Namespace(**args_d)
-    module = main(args)
+        task = "summarization"
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())

-    input_embeds = module.model.get_input_embeddings()
-    assert not input_embeds.weight.requires_grad
-    if model == T5_TINY:
-        lm_head = module.model.lm_head
-        assert not lm_head.weight.requires_grad
-        assert (lm_head.weight == input_embeds.weight).all().item()
-    elif model == FSMT_TINY:
-        fsmt = module.model.model
-        embed_pos = fsmt.decoder.embed_positions
-        assert not embed_pos.weight.requires_grad
-        assert not fsmt.decoder.embed_tokens.weight.requires_grad
-        # check that embeds are not the same
-        assert fsmt.decoder.embed_tokens != fsmt.encoder.embed_tokens
-    else:
-        bart = module.model.model
-        embed_pos = bart.decoder.embed_positions
-        assert not embed_pos.weight.requires_grad
-        assert not bart.shared.weight.requires_grad
-        # check that embeds are the same
-        assert bart.decoder.embed_tokens == bart.encoder.embed_tokens
-        assert bart.decoder.embed_tokens == bart.shared
+        args_d.update(
+            data_dir=tmp_dir,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            do_predict=False,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )

-
-def test_finetune_extra_model_args():
-    args_d: dict = CHEAP_ARGS.copy()
-
-    task = "summarization"
-    tmp_dir = make_test_data_dir()
-
-    args_d.update(
-        data_dir=tmp_dir,
-        tokenizer_name=None,
-        train_batch_size=2,
-        eval_batch_size=2,
-        do_predict=False,
-        task=task,
-        src_lang="en_XX",
-        tgt_lang="ro_RO",
-        freeze_encoder=True,
-        freeze_embeds=True,
-    )
-
-    # test models whose config includes the extra_model_args
-    model = BART_TINY
-    output_dir = tempfile.mkdtemp(prefix="output_1_")
-    args_d1 = args_d.copy()
-    args_d1.update(
-        model_name_or_path=model,
-        output_dir=output_dir,
-    )
-    extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
-    for p in extra_model_params:
-        args_d1[p] = 0.5
-    args = argparse.Namespace(**args_d1)
-    model = main(args)
-    for p in extra_model_params:
-        assert getattr(model.config, p) == 0.5, f"failed to override the model config for param {p}"
-
-    # test models whose config doesn't include the extra_model_args
-    model = T5_TINY
-    output_dir = tempfile.mkdtemp(prefix="output_2_")
-    args_d2 = args_d.copy()
-    args_d2.update(
-        model_name_or_path=model,
-        output_dir=output_dir,
-    )
-    unsupported_param = "encoder_layerdrop"
-    args_d2[unsupported_param] = 0.5
-    args = argparse.Namespace(**args_d2)
-    with pytest.raises(Exception) as excinfo:
+        # test models whose config includes the extra_model_args
+        model = BART_TINY
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d1 = args_d.copy()
+        args_d1.update(
+            model_name_or_path=model,
+            output_dir=output_dir,
+        )
+        extra_model_params = ("encoder_layerdrop", "decoder_layerdrop", "dropout", "attention_dropout")
+        for p in extra_model_params:
+            args_d1[p] = 0.5
+        args = argparse.Namespace(**args_d1)
        model = main(args)
-    assert str(excinfo.value) == f"model config doesn't have a `{unsupported_param}` attribute"
+        for p in extra_model_params:
+            assert getattr(model.config, p) == 0.5, f"failed to override the model config for param {p}"

+        # test models whose config doesn't include the extra_model_args
+        model = T5_TINY
+        output_dir = self.get_auto_remove_tmp_dir()
+        args_d2 = args_d.copy()
+        args_d2.update(
+            model_name_or_path=model,
+            output_dir=output_dir,
+        )
+        unsupported_param = "encoder_layerdrop"
+        args_d2[unsupported_param] = 0.5
+        args = argparse.Namespace(**args_d2)
+        with pytest.raises(Exception) as excinfo:
+            model = main(args)
+        assert str(excinfo.value) == f"model config doesn't have a `{unsupported_param}` attribute"

-def test_finetune_lr_schedulers():
-    args_d: dict = CHEAP_ARGS.copy()
+    def test_finetune_lr_schedulers(self):
+        args_d: dict = CHEAP_ARGS.copy()

-    task = "summarization"
-    tmp_dir = make_test_data_dir()
+        task = "summarization"
+        tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir())

-    model = BART_TINY
-    output_dir = tempfile.mkdtemp(prefix="output_1_")
+        model = BART_TINY
+        output_dir = self.get_auto_remove_tmp_dir()

-    args_d.update(
-        data_dir=tmp_dir,
-        model_name_or_path=model,
-        output_dir=output_dir,
-        tokenizer_name=None,
-        train_batch_size=2,
-        eval_batch_size=2,
-        do_predict=False,
-        task=task,
-        src_lang="en_XX",
-        tgt_lang="ro_RO",
-        freeze_encoder=True,
-        freeze_embeds=True,
-    )
+        args_d.update(
+            data_dir=tmp_dir,
+            model_name_or_path=model,
+            output_dir=output_dir,
+            tokenizer_name=None,
+            train_batch_size=2,
+            eval_batch_size=2,
+            do_predict=False,
+            task=task,
+            src_lang="en_XX",
+            tgt_lang="ro_RO",
+            freeze_encoder=True,
+            freeze_embeds=True,
+        )

-    # emulate finetune.py
-    parser = argparse.ArgumentParser()
-    parser = pl.Trainer.add_argparse_args(parser)
-    parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
-    args = {"--help": True}
+        # emulate finetune.py
+        parser = argparse.ArgumentParser()
+        parser = pl.Trainer.add_argparse_args(parser)
+        parser = SummarizationModule.add_model_specific_args(parser, os.getcwd())
+        args = {"--help": True}

-    # --help test
-    with pytest.raises(SystemExit) as excinfo:
-        with CaptureStdout() as cs:
-            args = parser.parse_args(args)
-        assert False, "--help is expected to sys.exit"
-    assert excinfo.type == SystemExit
-    expected = lightning_base.arg_to_scheduler_metavar
-    assert expected in cs.out, "--help is expected to list the supported schedulers"
+        # --help test
+        with pytest.raises(SystemExit) as excinfo:
+            with CaptureStdout() as cs:
+                args = parser.parse_args(args)
+            assert False, "--help is expected to sys.exit"
+        assert excinfo.type == SystemExit
+        expected = lightning_base.arg_to_scheduler_metavar
+        assert expected in cs.out, "--help is expected to list the supported schedulers"

-    # --lr_scheduler=non_existing_scheduler test
-    unsupported_param = "non_existing_scheduler"
-    args = {f"--lr_scheduler={unsupported_param}"}
-    with pytest.raises(SystemExit) as excinfo:
-        with CaptureStderr() as cs:
-            args = parser.parse_args(args)
-        assert False, "invalid argument is expected to sys.exit"
-    assert excinfo.type == SystemExit
-    expected = f"invalid choice: '{unsupported_param}'"
-    assert expected in cs.err, f"should have bailed on invalid choice of scheduler {unsupported_param}"
+        # --lr_scheduler=non_existing_scheduler test
+        unsupported_param = "non_existing_scheduler"
+        args = {f"--lr_scheduler={unsupported_param}"}
+        with pytest.raises(SystemExit) as excinfo:
+            with CaptureStderr() as cs:
+                args = parser.parse_args(args)
+            assert False, "invalid argument is expected to sys.exit"
+        assert excinfo.type == SystemExit
+        expected = f"invalid choice: '{unsupported_param}'"
+        assert expected in cs.err, f"should have bailed on invalid choice of scheduler {unsupported_param}"

-    # --lr_scheduler=existing_scheduler test
-    supported_param = "cosine"
-    args_d1 = args_d.copy()
-    args_d1["lr_scheduler"] = supported_param
-    args = argparse.Namespace(**args_d1)
-    model = main(args)
-    assert getattr(model.hparams, "lr_scheduler") == supported_param, f"lr_scheduler={supported_param} shouldn't fail"
+        # --lr_scheduler=existing_scheduler test
+        supported_param = "cosine"
+        args_d1 = args_d.copy()
+        args_d1["lr_scheduler"] = supported_param
+        args = argparse.Namespace(**args_d1)
+        model = main(args)
+        assert (
+            getattr(model.hparams, "lr_scheduler") == supported_param
+        ), f"lr_scheduler={supported_param} shouldn't fail"
--- a/examples/seq2seq/test_tatoeba_conversion.py
+++ b/examples/seq2seq/test_tatoeba_conversion.py
@@ -0,0 +1,22 @@
+import tempfile
+import unittest
+
+from transformers.convert_marian_tatoeba_to_pytorch import TatoebaConverter
+from transformers.file_utils import cached_property
+from transformers.testing_utils import slow
+
+
+class TatoebaConversionTester(unittest.TestCase):
+    @cached_property
+    def resolver(self):
+        tmp_dir = tempfile.mkdtemp()
+        return TatoebaConverter(save_dir=tmp_dir)
+
+    @slow
+    def test_resolver(self):
+        self.resolver.convert_models(["heb-eng"])
+
+    @slow
+    def test_model_card(self):
+        content, mmeta = self.resolver.write_model_card("opus-mt-he-en", dry_run=True)
+        assert mmeta["long_pair"] == "heb-eng"
--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -7,7 +7,7 @@ import pickle
 import socket
 from logging import getLogger
 from pathlib import Path
-from typing import Callable, Dict, Iterable, List, Union
+from typing import Callable, Dict, Iterable, List, Tuple, Union

 import git
 import numpy as np
@@ -19,8 +19,9 @@ from torch import nn
 from torch.utils.data import Dataset, Sampler

 from sentence_splitter import add_newline_to_end_of_each_sentence
-from transformers import BartTokenizer
+from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer
 from transformers.file_utils import cached_property
+from transformers.modeling_bart import shift_tokens_right


 try:
@@ -52,19 +53,6 @@ def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
    return loss, nll_loss


-def encode_line(tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"):
-    """Only used by LegacyDataset"""
-    extra_kw = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {}
-    return tokenizer(
-        [line],
-        max_length=max_length,
-        padding="max_length" if pad_to_max_length else None,
-        truncation=True,
-        return_tensors=return_tensors,
-        **extra_kw,
-    )
-
-
 def lmap(f: Callable, x: Iterable) -> List:
    """list(map(f, x))"""
    return list(map(f, x))
@@ -75,6 +63,35 @@ def calculate_bleu(output_lns, refs_lns, **kwargs) -> dict:
    return {"bleu": round(corpus_bleu(output_lns, [refs_lns], **kwargs).score, 4)}


+def build_compute_metrics_fn(task_name: str, tokenizer: PreTrainedTokenizer) -> Callable[[EvalPrediction], Dict]:
+    def non_pad_len(tokens: np.ndarray) -> int:
+        return np.count_nonzero(tokens != tokenizer.pad_token_id)
+
+    def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]:
+        pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
+        label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
+        pred_str = lmap(str.strip, pred_str)
+        label_str = lmap(str.strip, label_str)
+        return pred_str, label_str
+
+    def summarization_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        rouge: Dict = calculate_rouge(pred_str, label_str)
+        summ_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        rouge.update({"gen_len": summ_len})
+        return rouge
+
+    def translation_metrics(pred: EvalPrediction) -> Dict:
+        pred_str, label_str = decode_pred(pred)
+        bleu: Dict = calculate_bleu(pred_str, label_str)
+        gen_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
+        bleu.update({"gen_len": gen_len})
+        return bleu
+
+    compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics
+    return compute_metrics_fn
+
+
 def trim_batch(
    input_ids,
    pad_token_id,
@@ -97,9 +114,8 @@ class AbstractSeq2SeqDataset(Dataset):
        max_target_length,
        type_path="train",
        n_obs=None,
-        src_lang=None,
-        tgt_lang=None,
        prefix="",
+        **dataset_kwargs
    ):
        super().__init__()
        self.src_file = Path(data_dir).joinpath(type_path + ".source")
@@ -120,9 +136,8 @@ class AbstractSeq2SeqDataset(Dataset):
        if n_obs is not None:
            self.src_lens = self.src_lens[:n_obs]
        self.pad_token_id = self.tokenizer.pad_token_id
-        self.src_lang = src_lang
-        self.tgt_lang = tgt_lang
-        self.add_prefix_space = isinstance(self.tokenizer, BartTokenizer)
+        self.dataset_kwargs = dataset_kwargs
+        dataset_kwargs.update({"add_prefix_space": True} if isinstance(self.tokenizer, BartTokenizer) else {})

    def __len__(self):
        return len(self.src_lens)
@@ -182,8 +197,8 @@ class LegacySeq2SeqDataset(AbstractSeq2SeqDataset):
        tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
        assert source_line, f"empty source line for index {index}"
        assert tgt_line, f"empty tgt line for index {index}"
-        source_inputs = encode_line(self.tokenizer, source_line, self.max_source_length)
-        target_inputs = encode_line(self.tokenizer, tgt_line, self.max_target_length)
+        source_inputs = self.encode_line(self.tokenizer, source_line, self.max_source_length)
+        target_inputs = self.encode_line(self.tokenizer, tgt_line, self.max_target_length)

        source_ids = source_inputs["input_ids"].squeeze()
        target_ids = target_inputs["input_ids"].squeeze()
@@ -194,6 +209,17 @@ class LegacySeq2SeqDataset(AbstractSeq2SeqDataset):
            "labels": target_ids,
        }

+    def encode_line(self, tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"):
+        """Only used by LegacyDataset"""
+        return tokenizer(
+            [line],
+            max_length=max_length,
+            padding="max_length" if pad_to_max_length else None,
+            truncation=True,
+            return_tensors=return_tensors,
+            **self.dataset_kwargs,
+        )
+
    def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
        input_ids = torch.stack([x["input_ids"] for x in batch])
        masks = torch.stack([x["attention_mask"] for x in batch])
@@ -224,18 +250,80 @@ class Seq2SeqDataset(AbstractSeq2SeqDataset):
        """Call prepare_seq2seq_batch."""
        batch_encoding: Dict[str, torch.Tensor] = self.tokenizer.prepare_seq2seq_batch(
            [x["src_texts"] for x in batch],
-            src_lang=self.src_lang,
            tgt_texts=[x["tgt_texts"] for x in batch],
-            tgt_lang=self.tgt_lang,
            max_length=self.max_source_length,
            max_target_length=self.max_target_length,
            return_tensors="pt",
-            add_prefix_space=self.add_prefix_space,
+            **self.dataset_kwargs,
        ).data
        batch_encoding["ids"] = torch.tensor([x["id"] for x in batch])
        return batch_encoding


+class Seq2SeqDataCollator:
+    def __init__(self, tokenizer, data_args, tpu_num_cores=None):
+        self.tokenizer = tokenizer
+        self.pad_token_id = tokenizer.pad_token_id
+        assert (
+            self.pad_token_id is not None
+        ), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
+        self.data_args = data_args
+        self.tpu_num_cores = tpu_num_cores
+        self.dataset_kwargs = {"add_prefix_space": isinstance(tokenizer, BartTokenizer)}
+        if data_args.src_lang is not None:
+            self.dataset_kwargs["src_lang"] = data_args.src_lang
+        if data_args.tgt_lang is not None:
+            self.dataset_kwargs["tgt_lang"] = data_args.tgt_lang
+
+    def __call__(self, batch) -> Dict[str, torch.Tensor]:
+        if hasattr(self.tokenizer, "prepare_seq2seq_batch"):
+            batch = self._encode(batch)
+            input_ids, attention_mask, labels = (
+                batch["input_ids"],
+                batch["attention_mask"],
+                batch["labels"],
+            )
+        else:
+            input_ids = torch.stack([x["input_ids"] for x in batch])
+            attention_mask = torch.stack([x["attention_mask"] for x in batch])
+            labels = torch.stack([x["labels"] for x in batch])
+
+            labels = trim_batch(labels, self.pad_token_id)
+            input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
+
+        if isinstance(self.tokenizer, T5Tokenizer):
+            decoder_input_ids = self._shift_right_t5(labels)
+        else:
+            decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
+
+        batch = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "decoder_input_ids": decoder_input_ids,
+            "labels": labels,
+        }
+        return batch
+
+    def _shift_right_t5(self, input_ids):
+        # shift inputs to the right
+        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = self.pad_token_id
+        return shifted_input_ids
+
+    def _encode(self, batch) -> Dict[str, torch.Tensor]:
+        batch_encoding = self.tokenizer.prepare_seq2seq_batch(
+            [x["src_texts"] for x in batch],
+            tgt_texts=[x["tgt_texts"] for x in batch],
+            max_length=self.data_args.max_source_length,
+            max_target_length=self.data_args.max_target_length,
+            padding="max_length" if self.tpu_num_cores is not None else "longest",  # TPU hack
+            return_tensors="pt",
+            **self.dataset_kwargs,
+        )
+        return batch_encoding.data
+
+
 class SortishSampler(Sampler):
    "Go through the text data by order of src length with a bit of randomness. From fastai repo."

@@ -369,14 +457,22 @@ def load_json(path):


 def get_git_info():
-    repo = git.Repo(search_parent_directories=True)
-    repo_infos = {
-        "repo_id": str(repo),
-        "repo_sha": str(repo.head.object.hexsha),
-        "repo_branch": str(repo.active_branch),
-        "hostname": str(socket.gethostname()),
-    }
-    return repo_infos
+    try:
+        repo = git.Repo(search_parent_directories=True)
+        repo_infos = {
+            "repo_id": str(repo),
+            "repo_sha": str(repo.head.object.hexsha),
+            "repo_branch": str(repo.active_branch),
+            "hostname": str(socket.gethostname()),
+        }
+        return repo_infos
+    except TypeError:
+        return {
+            "repo_id": None,
+            "repo_sha": None,
+            "repo_branch": None,
+            "hostname": None,
+        }


 ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
@@ -447,6 +543,25 @@ def freeze_params(model: nn.Module):
        par.requires_grad = False


+def freeze_embeds(model):
+    """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
+    model_type = model.config.model_type
+
+    if model_type == "t5":
+        freeze_params(model.shared)
+        for d in [model.encoder, model.decoder]:
+            freeze_params(d.embed_tokens)
+    elif model_type == "fsmt":
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+    else:
+        freeze_params(model.model.shared)
+        for d in [model.model.encoder, model.model.decoder]:
+            freeze_params(d.embed_positions)
+            freeze_params(d.embed_tokens)
+
+
 def grad_status(model: nn.Module) -> Iterable:
    return (par.requires_grad for par in model.parameters())

--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -116,8 +116,8 @@ class ExamplesTests(TestCasePlus):
            testargs.append("--fp16")

        with patch.object(sys, "argv", testargs):
-            result = run_pl_glue.main()
-            # for now just testing that the script can run to a completion
+            result = run_pl_glue.main()[0]
+            # for now just testing that the script can run to completion
            self.assertGreater(result["acc"], 0.25)
            #
            # TODO: this fails on CI - doesn't get acc/f1>=0.75:
--- a/examples/test_xla_examples.py
+++ b/examples/test_xla_examples.py
@@ -59,7 +59,7 @@ class TorchXLAExamplesTests(unittest.TestCase):
            --model_name_or_path=bert-base-cased
            --per_device_train_batch_size=64
            --per_device_eval_batch_size=64
-            --evaluate_during_training
+            --evaluation_strategy steps
            --overwrite_cache
            """.split()
        with patch.object(sys, "argv", testargs):
@@ -80,4 +80,15 @@ class TorchXLAExamplesTests(unittest.TestCase):
                self.assertGreaterEqual(value, 0.70)

            # Assert that the script takes less than 300 seconds to make sure it doesn't hang.
-            self.assertLess(end - start, 300)
+            self.assertLess(end - start, 500)
+
+    def test_trainer_tpu(self):
+        import xla_spawn
+
+        testargs = """
+            transformers/tests/test_trainer_tpu.py
+            --num_cores=8
+            transformers/tests/test_trainer_tpu.py
+            """.split()
+        with patch.object(sys, "argv", testargs):
+            xla_spawn.main()
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -43,7 +43,7 @@ python run_tf_text_classification.py \
  --do_eval \
  --do_predict \
  --logging_steps 10 \
-  --evaluate_during_training \
+  --evaluation_strategy steps \
  --save_steps 10 \
  --overwrite_output_dir \
  --max_seq_length 128
--- a/examples/text-classification/run_tf_text_classification.py
+++ b/examples/text-classification/run_tf_text_classification.py
@@ -60,7 +60,7 @@ def get_tfds(
        for k in files.keys():
            transformed_ds[k] = ds[k].map(
                lambda example: tokenizer.batch_encode_plus(
-                    (example[features_name[0]], features_name[1]),
+                    (example[features_name[0]], example[features_name[1]]),
                    truncation=True,
                    max_length=max_seq_length,
                    padding="max_length",
@@ -96,6 +96,9 @@ def get_tfds(
        else None
    )

+    if train_ds is not None:
+        train_ds = train_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.TRAIN])))
+
    val_ds = (
        tf.data.Dataset.from_generator(
            gen_val,
@@ -106,6 +109,9 @@ def get_tfds(
        else None
    )

+    if val_ds is not None:
+        val_ds = val_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.VALIDATION])))
+
    test_ds = (
        tf.data.Dataset.from_generator(
            gen_test,
@@ -116,6 +122,9 @@ def get_tfds(
        else None
    )

+    if test_ds is not None:
+        test_ds = test_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.TEST])))
+
    return train_ds, val_ds, test_ds, label2id


--- a/model_cards/Rostlab/prot_t5_xl_bfd/README.md
+++ b/model_cards/Rostlab/prot_t5_xl_bfd/README.md
@@ -0,0 +1,125 @@
+---
+language: protein
+tags:
+- protein language model
+datasets:
+- BFD
+---
+
+# ProtT5-XL-BFD model
+
+Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
+[this paper](https://doi.org/10.1101/2020.07.12.199554) and first released in
+[this repository](https://github.com/agemagician/ProtTrans). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
+
+
+## Model description
+
+ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
+This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
+publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
+
+One important difference between this T5 model and the original T5 version is the denosing objective.
+The original T5-3B model was pretrained using a span denosing objective, while this model was pre-trained with a Bart-like MLM denosing objective.
+The masking probability is consistent with the original T5 training by randomly masking 15% of the amino acids in the input.
+
+It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
+shape.
+This implied learning some of the grammar of the language of life realized in protein sequences.
+
+## Intended uses & limitations
+
+The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
+We have noticed in some tasks on can gain more accuracy by fine-tuning the model rather than using it as a feature extractor.
+We have also noticed that for feature extraction, its better to use the feature extracted from the encoder not from the decoder.
+
+### How to use
+
+Here is how to use this model to extract the features of a given protein sequence in PyTorch:
+
+```python
+from transformers import T5Tokenizer, T5Model
+import re
+import torch
+
+tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)
+
+model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")
+
+sequences_Example = ["A E T C Z A O","S K T Z P"]
+
+sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
+
+ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)
+
+input_ids = torch.tensor(ids['input_ids'])
+attention_mask = torch.tensor(ids['attention_mask'])
+
+with torch.no_grad():
+    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)
+
+# For feature extraction we recommend to use the encoder embedding
+encoder_embedding = embedding[2].cpu().numpy()
+decoder_embedding = embedding[0].cpu().numpy()
+```
+
+## Training data
+
+The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
+
+## Training procedure
+
+### Preprocessing
+
+The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 21. The rare amino acids "U,Z,O,B" were mapped to "X".
+The inputs of the model are then of the form:
+
+```
+Protein Sequence [EOS]
+```
+
+The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
+
+The details of the masking procedure for each sequence are as follows:
+- 15% of the amino acids are masked.
+- In 90% of the cases, the masked amino acids are replaced by `[MASK]` token.
+- In 10% of the cases, the masked amino acids are replaced by a random amino acid (different) from the one they replace.
+
+### Pretraining
+
+The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k).
+It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
+The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
+
+
+## Evaluation results
+
+When the model is used for feature etraction, this model achieves the following results:
+
+Test results :
+
+| Task/Dataset | secondary structure (3-states) | secondary structure (8-states)  |  Localization | Membrane  |
+|:-----:|:-----:|:-----:|:-----:|:-----:|
+|   CASP12  | 77 | 66 |    |    |
+|   TS115   | 85 | 74 |    |    | 
+|   CB513   | 84 | 71 |    |    |
+|  DeepLoc  |    |    | 77 | 91 |
+
+### BibTeX entry and citation info
+
+```bibtex
+@article {Elnaggar2020.07.12.199554,
+	author = {Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Wang, Yu and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and BHOWMIK, DEBSINDHU and Rost, Burkhard},
+	title = {ProtTrans: Towards Cracking the Language of Life{\textquoteright}s Code Through Self-Supervised Deep Learning and High Performance Computing},
+	elocation-id = {2020.07.12.199554},
+	year = {2020},
+	doi = {10.1101/2020.07.12.199554},
+	publisher = {Cold Spring Harbor Laboratory},
+	abstract = {Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive language models (Transformer-XL, XLNet) and two auto-encoder models (Bert, Albert) on data from UniRef and BFD containing up to 393 billion amino acids (words) from 2.1 billion protein sequences (22- and 112 times the entire English Wikipedia). The LMs were trained on the Summit supercomputer at Oak Ridge National Laboratory (ORNL), using 936 nodes (total 5616 GPUs) and one TPU Pod (V3-512 or V3-1024). We validated the advantage of up-scaling LMs to larger models supported by bigger data by predicting secondary structure (3-states: Q3=76-84, 8 states: Q8=65-73), sub-cellular localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabeled data (only protein sequences) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC to larger data sets slightly reduced the gap between models trained on evolutionary information and LMs. Availability ProtTrans: \&lt;a href="https://github.com/agemagician/ProtTrans"\&gt;https://github.com/agemagician/ProtTrans\&lt;/a\&gt;Competing Interest StatementThe authors have declared no competing interest.},
+	URL = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554},
+	eprint = {https://www.biorxiv.org/content/early/2020/07/21/2020.07.12.199554.full.pdf},
+	journal = {bioRxiv}
+}
+```
+
+> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)
--- a/model_cards/TypicaAI/magbert-ner/README.md
+++ b/model_cards/TypicaAI/magbert-ner/README.md
@@ -0,0 +1,55 @@
+---
+language: fr
+widget:
+- text: "Je m'appelle Hicham et je vis a Fès"
+---
+
+# MagBERT-NER: a state-of-the-art NER model for Moroccan French language (Maghreb)
+
+## Introduction
+
+[MagBERT-NER] is a state-of-the-art NER model for Moroccan French language (Maghreb). The MagBERT-NER model was fine-tuned for NER Task based the language model for French Camembert (based on the RoBERTa architecture).
+
+For further information or requests, please go to [Typica.AI Website](https://typicasoft.io/)
+
+## How to use MagBERT-NER with HuggingFace
+
+##### Load MagBERT-NER and its sub-word tokenizer :
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+
+tokenizer = AutoTokenizer.from_pretrained("TypicaAI/magbert-ner")
+model = AutoModelForTokenClassification.from_pretrained("TypicaAI/magbert-ner")
+
+
+##### Process text sample (from wikipedia about the current Prime Minister of Morocco) Using NER pipeline  
+
+from transformers import pipeline
+
+nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
+nlp("Saad Dine El Otmani, né le 16 janvier 1956 à Inezgane, est un homme d'État marocain, chef du gouvernement du Maroc depuis le 5 avril 2017")
+
+
+#[{'entity_group': 'I-PERSON',
+#  'score': 0.8941445276141167,
+#  'word': 'Saad Dine El Otmani'},
+# {'entity_group': 'B-DATE',
+#  'score': 0.5967703461647034,
+#  'word': '16 janvier 1956'},
+# {'entity_group': 'B-GPE', 'score': 0.7160899192094803, 'word': 'Inezgane'},
+# {'entity_group': 'B-NORP', 'score': 0.7971733212471008, 'word': 'marocain'},
+# {'entity_group': 'B-GPE', 'score': 0.8921478390693665, 'word': 'Maroc'},
+# {'entity_group': 'B-DATE',
+#  'score': 0.5760444005330404,
+#  'word': '5 avril 2017'}]
+
+```
+
+```
+
+
+## Authors 
+
+MagBert-NER was trained and evaluated by Hicham Assoudi, Ph.D.
+
+
--- a/model_cards/abhilash1910/french-roberta/README.md
+++ b/model_cards/abhilash1910/french-roberta/README.md
@@ -0,0 +1,131 @@
+# Roberta Trained Model For Masked Language Model On French Corpus :robot:
+
+
+This is a Masked Language Model trained with [Roberta](https://huggingface.co/transformers/model_doc/roberta.html) on a small French News Corpus(Leipzig corpora).
+The model is built using Huggingface transformers.
+The model can be found at :[French-Roberta](https://huggingface.co/abhilash1910/french-roberta)
+
+
+## Specifications
+
+
+The corpus for training is taken from Leipzig Corpora (French News) , and is trained on a small set of the corpus (300K). 
+
+
+## Model Specification
+
+
+The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
+ 1. vocab_size=32000
+ 2. max_position_embeddings=514
+ 3. num_attention_heads=12
+ 4. num_hidden_layers=6
+ 5. type_vocab_size=1
+
+
+This is trained by using  RobertaConfig from transformers package.The total training parameters :68124416
+The model is trained for 100 epochs with a gpu batch size of 64 units. 
+More details for building custom models can be found at the [HuggingFace Blog](https://huggingface.co/blog/how-to-train)
+
+
+
+## Usage Specifications
+
+
+For using this model, we have to first import AutoTokenizer and AutoModelWithLMHead Modules from transformers
+After that we have to specify, the pre-trained model,which in this case is 'abhilash1910/french-roberta' for the tokenizers and the model.
+
+
+```python
+from transformers import AutoTokenizer, AutoModelWithLMHead
+
+tokenizer = AutoTokenizer.from_pretrained("abhilash1910/french-roberta")
+
+model = AutoModelWithLMHead.from_pretrained("abhilash1910/french-roberta")
+```
+
+
+After this the model will be downloaded, it will take some time to download all the model files.
+For testing the model, we have to import  pipeline module from transformers and create a masked output model for inference as follows:
+
+
+```python
+from transformers import pipeline
+model_mask = pipeline('fill-mask', model='abhilash1910/french-roberta')
+model_mask("Le tweet <mask>.")
+```
+
+
+Some of the examples are also provided with generic French sentences:
+
+Example 1:
+
+
+```python
+model_mask("À ce jour, <mask> projet a entraîné")
+```
+
+
+Output:
+
+
+```bash
+[{'sequence': '<s>À ce jour, belles projet a entraîné</s>',
+  'score': 0.18685665726661682,
+  'token': 6504,
+  'token_str': 'Ġbelles'},
+ {'sequence': '<s>À ce jour,- projet a entraîné</s>',
+  'score': 0.0005200508167035878,
+  'token': 17,
+  'token_str': '-'},
+ {'sequence': '<s>À ce jour, de projet a entraîné</s>',
+  'score': 0.00045729897101409733,
+  'token': 268,
+  'token_str': 'Ġde'},
+ {'sequence': '<s>À ce jour, du projet a entraîné</s>',
+  'score': 0.0004307595663703978,
+  'token': 326,
+  'token_str': 'Ġdu'},
+ {'sequence': '<s>À ce jour," projet a entraîné</s>',
+  'score': 0.0004219160182401538,
+  'token': 6,
+  'token_str': '"'}]
+  ```
+ 
+ Example 2:
+ 
+```python
+ model_mask("C'est un <mask>")
+```
+
+Output:
+
+```bash
+[{'sequence': "<s>C'est un belles</s>",
+  'score': 0.16440927982330322,
+  'token': 6504,
+  'token_str': 'Ġbelles'},
+ {'sequence': "<s>C'est un de</s>",
+  'score': 0.0005495127406902611,
+  'token': 268,
+  'token_str': 'Ġde'},
+ {'sequence': "<s>C'est un du</s>",
+  'score': 0.00044988933950662613,
+  'token': 326,
+  'token_str': 'Ġdu'},
+ {'sequence': "<s>C'est un-</s>",
+  'score': 0.00044542422983795404,
+  'token': 17,
+  'token_str': '-'},
+ {'sequence': "<s>C'est un\t</s>",
+  'score': 0.00037563967634923756,
+  'token': 202,
+  'token_str': 'ĉ'}]
+  ```
+  
+
+## Resources
+
+For all resources , please look into the [HuggingFace](https://huggingface.co/) Site and the [Repositories](https://github.com/huggingface).
+
+
--- a/model_cards/adalbertojunior/PTT5-SMALL-SUM/README.md
+++ b/model_cards/adalbertojunior/PTT5-SMALL-SUM/README.md
@@ -0,0 +1,37 @@
+---
+language: pt
+---
+
+# PTT5-SMALL-SUM
+
+## Model description
+
+This model was trained to summarize texts in portuguese
+
+
+based on ```unicamp-dl/ptt5-small-portuguese-vocab```
+
+#### How to use
+
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+tokenizer = T5Tokenizer.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
+
+t5 = T5ForConditionalGeneration.from_pretrained('adalbertojunior/PTT5-SMALL-SUM')
+
+text="Esse é um exemplo de sumarização."
+
+input_ids = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True)
+
+generated_ids = t5.generate(
+        input_ids=input_ids,
+        num_beams=1,
+        max_length=40,
+        #repetition_penalty=2.5
+    ).squeeze()
+    
+predicted_span = tokenizer.decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+
+
+```
--- a/model_cards/ahotrod/albert_xxlargev1_squad2_512/README.md
+++ b/model_cards/ahotrod/albert_xxlargev1_squad2_512/README.md
@@ -1,71 +1,60 @@
 ## Albert xxlarge version 1 language model fine-tuned on SQuAD2.0

-### with the following results:
+###  (updated 30Sept2020) with the following results:

 ```
-exact: 85.65653162637918
-f1: 89.260458954177
+exact: 86.11134506864315
+f1: 89.35371214945009
 total': 11873
-HasAns_exact': 82.6417004048583
-HasAns_f1': 89.8598902096736
+HasAns_exact': 83.56950067476383
+HasAns_f1': 90.06353312254078
 HasAns_total': 5928
-NoAns_exact': 88.66274179983179
-NoAns_f1': 88.66274179983179
+NoAns_exact': 88.64592094196804
+NoAns_f1': 88.64592094196804
 NoAns_total': 5945
-best_exact': 85.65653162637918
+best_exact': 86.11134506864315
 best_exact_thresh': 0.0
-best_f1': 89.2604589541768
+best_f1': 89.35371214944985
 best_f1_thresh': 0.0
 ```

 ### from script:

 ```
-python -m torch.distributed.launch --nproc_per_node=2 ${RUN_SQUAD_DIR}/run_squad.py \
--model_type albert \
--model_name_or_path albert-xxlarge-v1 \
--do_train \
--train_file ${SQUAD_DIR}/train-v2.0.json \
--predict_file ${SQUAD_DIR}/dev-v2.0.json \
--version_2_with_negative \
--num_train_epochs 3 \
--max_steps 8144 \
--warmup_steps 814 \
--do_lower_case \
--learning_rate 3e-5 \
--max_seq_length 512 \
--doc_stride 128 \
--save_steps 2000 \
--per_gpu_train_batch_size 1 \
--gradient_accumulation_steps 24 \
--output_dir ${MODEL_PATH}
-
-CUDA_VISIBLE_DEVICES=0 python ${RUN_SQUAD_DIR}/run_squad.py \
--model_type albert \
--model_name_or_path ${MODEL_PATH} \
--do_eval \
--train_file ${SQUAD_DIR}/train-v2.0.json \
--predict_file ${SQUAD_DIR}/dev-v2.0.json \
--version_2_with_negative \
--do_lower_case \
--max_seq_length 512 \
--per_gpu_eval_batch_size 48 \
--output_dir ${MODEL_PATH}
+python ${EXAMPLES}/run_squad.py \
+  --model_type albert \
+  --model_name_or_path albert-xxlarge-v1 \
+  --do_train \
+  --do_eval \
+  --train_file ${SQUAD}/train-v2.0.json \
+  --predict_file ${SQUAD}/dev-v2.0.json \
+  --version_2_with_negative \
+  --do_lower_case \
+  --num_train_epochs 3 \
+  --max_steps 8144 \
+  --warmup_steps 814 \
+  --learning_rate 3e-5 \
+  --max_seq_length 512 \
+  --doc_stride 128 \
+  --per_gpu_train_batch_size 6 \
+  --gradient_accumulation_steps 8 \
+  --per_gpu_eval_batch_size 48 \
+  --fp16 \
+  --fp16_opt_level O1 \
+  --threads 12 \
+  --logging_steps 50 \
+  --save_steps 3000 \
+  --overwrite_output_dir \
+  --output_dir ${MODEL_PATH}
 ```

-### using the following system & software:
+### using the following software & system:

 ```
-OS/Platform: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid
-GPU/CPU: 2 x NVIDIA 1080Ti / Intel i7-8700
-Transformers: 2.3.0
-PyTorch: 1.4.0
-TensorFlow: 2.1.0
-Python: 3.7.6
+Transformers: 3.1.0
+PyTorch: 1.6.0
+TensorFlow: 2.3.1
+Python: 3.8.1
+OS: Linux-5.4.0-48-generic-x86_64-with-glibc2.10
+CPU/GPU: Intel i9-9900K / NVIDIA Titan RTX 24GB
 ```
-
-### Access this albert_xxlargev1_sqd2_512 fine-tuned model with:
-
-```python
-tokenizer = AutoTokenizer.from_pretrained("ahotrod/albert_xxlargev1_squad2_512")
-model = AutoModelForQuestionAnswering.from_pretrained("ahotrod/albert_xxlargev1_squad2_512")
--- a/model_cards/akhooli/personachat-arabic/README.md
+++ b/model_cards/akhooli/personachat-arabic/README.md
@@ -0,0 +1,12 @@
+---
+tags:
+- conversational
+language:
+- ar
+license: mit
+---
+## personachat-arabic (conversational AI)
+This is personachat-arabic, using a subset from the persona-chat validation dataset, machine translated to Arabic (from English) 
+and fine-tuned from [akhooli/gpt2-small-arabic](https://huggingface.co/akhooli/gpt2-small-arabic) which is a limited text generation model.  
+Usage: see the last section of this [example notebook](https://colab.research.google.com/drive/1I6RFOWMaTpPBX7saJYjnSTddW0TD6H1t?usp=sharing) 
+Note: model has limited training set which was machine translated (do not use for production). 
--- a/model_cards/akhooli/xlm-r-large-arabic-toxic/README.md
+++ b/model_cards/akhooli/xlm-r-large-arabic-toxic/README.md
@@ -0,0 +1,12 @@
+---
+
+language:
+- ar
+- en
+
+license: mit
+---
+### xlm-r-large-arabic-toxic (toxic/hate speech classifier) 
+Toxic (hate speech) classification (Label_0: non-toxic, Label_1: toxic) of Arabic comments by fine-tuning XLM-Roberta-Large. 
+Zero shot classification of other languages (also works in mixed languages - ex. Arabic & English).  
+Usage and further info: see last section in this [Colab notebook](https://lnkd.in/d3bCFyZ)
--- a/model_cards/allegro/herbert-base-cased/README.md
+++ b/model_cards/allegro/herbert-base-cased/README.md
@@ -0,0 +1,51 @@
+---
+language: pl
+tags:
+- herbert
+license: cc-by-sa-4.0
+---
+
+# HerBERT 
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
+using MLM and SSO objectives with dynamic masking of whole words.
+Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
+
+## Tokenizer
+The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
+a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library. 
+We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
+
+## HerBERT usage
+
+
+Example code:
+```python
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
+model = AutoModel.from_pretrained("allegro/herbert-base-cased")
+
+output = model(
+    **tokenizer.batch_encode_plus(
+        [
+            (
+                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
+                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
+            )
+        ],
+    padding='longest',
+    add_special_tokens=True,
+    return_tensors='pt'
+    )
+)
+```
+
+
+## License
+CC BY-SA 4.0
+
+
+## Authors
+Model was trained by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/allegro/herbert-large-cased/README.md
+++ b/model_cards/allegro/herbert-large-cased/README.md
@@ -0,0 +1,50 @@
+---
+language: pl
+tags:
+- herbert
+license: cc-by-sa-4.0
+---
+# HerBERT 
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
+using MLM and SSO objectives with dynamic masking of whole words.
+Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
+
+## Tokenizer
+The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
+a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library. 
+We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
+
+## HerBERT usage
+
+
+Example code:
+```python
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased")
+model = AutoModel.from_pretrained("allegro/herbert-large-cased")
+
+output = model(
+    **tokenizer.batch_encode_plus(
+        [
+            (
+                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
+                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
+            )
+        ],
+    padding='longest',
+    add_special_tokens=True,
+    return_tensors='pt'
+    )
+)
+```
+
+
+## License
+CC BY-SA 4.0
+
+
+## Authors
+Model was trained by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/allenyummy/chinese-bert-wwm-ehr-ner-sl/README.md
+++ b/model_cards/allenyummy/chinese-bert-wwm-ehr-ner-sl/README.md
@@ -0,0 +1,15 @@
+---
+language: zh-tw
+---
+
+# Model name
+Chinese-bert-wwm-electrical-health-record-ner-sequence-labeling
+
+
+#### How to use
+
+```
+from transformers import AutoTokenizer, AutoModelForTokenClassification  
+tokenizer = AutoTokenizer.from_pretrained("chinese-bert-wwm-ehr-ner-sl")  
+model = AutoModelForTokenClassification.from_pretrained("chinese-bert-wwm-ehr-ner-sl") 
+```
--- a/model_cards/amine/bert-base-5lang-cased/README.md
+++ b/model_cards/amine/bert-base-5lang-cased/README.md
@@ -0,0 +1,64 @@
+---
+language: 
+- en
+- fr
+- es
+- de
+- zh
+
+tags:
+- pytorch
+- bert
+- multilingual
+- en
+- fr
+- es
+- de
+- zh
+
+datasets: wikipedia
+
+license: apache-2.0
+
+inference: false
+---
+
+# bert-base-5lang-cased
+This is a smaller version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) that handles only 5 languages (en, fr, es, de and zh) instead of 104.
+The model is therefore 30% smaller than the original one (124M parameters instead of 178M) but gives exactly the same representations for the above cited languages. 
+Starting from `bert-base-5lang-cased` will facilitate the deployment of your model on public cloud platforms while keeping similar results. 
+For instance, Google Cloud Platform requires that the model size on disk should be lower than 500 MB for serveless deployments (Cloud Functions / Cloud ML) which is not the case of the original `bert-base-multilingual-cased`.
+
+For more information about the models size, memory footprint and loading time please refer to the table below:
+
+|            Model             | Num parameters |   Size   |  Memory  | Loading time |
+| ---------------------------- | -------------- | -------- | -------- | ------------ |
+| bert-base-multilingual-cased |   178 million  |  714 MB  | 1400 MB  |    4.2 sec   |
+| bert-base-5lang-cased        |   124 million  |  495 MB  |  950 MB  |    3.6 sec   |
+
+These measurements have been computed on a [Google Cloud n1-standard-1 machine (1 vCPU, 3.75 GB)](https://cloud.google.com/compute/docs/machine-types\#n1_machine_type).
+
+## How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("amine/bert-base-5lang-cased")
+model = AutoModel.from_pretrained("amine/bert-base-5lang-cased")
+
+```
+
+### How to cite
+
+```bibtex
+@inproceedings{smallermbert,
+  title={Load What You Need: Smaller Versions of Mutlilingual BERT},
+  author={Abdaoui, Amine and Pradel, Camille and Sigel, Grégoire},
+  booktitle={SustaiNLP / EMNLP},
+  year={2020}
+}
+```
+
+## Contact 
+
+Please contact amine@geotrend.fr for any question, feedback or request.
--- a/model_cards/bayartsogt/bert-base-mongolian-cased/README.md
+++ b/model_cards/bayartsogt/bert-base-mongolian-cased/README.md
@@ -0,0 +1,60 @@
+---
+language: "mn"
+tags:
+- mongolian
+- cased
+---
+
+# BERT-BASE-MONGOLIAN-CASED
+[Link to Official Mongolian-BERT repo](https://github.com/tugstugi/mongolian-bert)
+
+## Model description
+This repository contains pre-trained Mongolian [BERT](https://arxiv.org/abs/1810.04805) models trained by [tugstugi](https://github.com/tugstugi), [enod](https://github.com/enod) and [sharavsambuu](https://github.com/sharavsambuu).
+Special thanks to [nabar](https://github.com/nabar) who provided 5x TPUs.
+
+This repository is based on the following open source projects: [google-research/bert](https://github.com/google-research/bert/),
+[huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) and [yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese).
+
+#### How to use
+
+```python
+from transformers import pipeline, AlbertTokenizer, BertForMaskedLM
+
+tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/bert-base-mongolian-cased')
+model = BertForMaskedLM.from_pretrained('bayartsogt/bert-base-mongolian-cased')
+
+## declare task ##
+pipe = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
+
+## example ##
+input_  = 'Миний [MASK] хоол идэх нь тун чухал.'
+
+output_ = pipe(input_)
+for i in range(len(output_)):
+    print(output_[i])
+    
+## Output ##
+# {'sequence': '[CLS] Миний хувьд хоол идэх нь тун чухал.[SEP]', 'score': 0.8734784722328186, 'token': 95, 'token_str': '▁хувьд'}
+# {'sequence': '[CLS] Миний бодлоор хоол идэх нь тун чухал.[SEP]', 'score': 0.09788835793733597, 'token': 6320, 'token_str': '▁бодлоор'}
+# {'sequence': '[CLS] Миний хүү хоол идэх нь тун чухал.[SEP]', 'score': 0.0027510314248502254, 'token': 590, 'token_str': '▁хүү'}
+# {'sequence': '[CLS] Миний бие хоол идэх нь тун чухал.[SEP]', 'score': 0.0014857524074614048, 'token': 267, 'token_str': '▁бие'}
+# {'sequence': '[CLS] Миний охин хоол идэх нь тун чухал.[SEP]', 'score': 0.0013575413031503558, 'token': 1116, 'token_str': '▁охин'}
+
+```
+
+
+## Training data
+Mongolian Wikipedia and the 700 million word Mongolian news data set  [[Pretraining Procedure](https://github.com/tugstugi/mongolian-bert#pre-training)]
+
+### BibTeX entry and citation info
+
+```bibtex
+@misc{mongolian-bert,
+  author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
+  title = {BERT Pretrained Models on Mongolian Datasets},
+  year = {2019},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
+}
+```
--- a/model_cards/bayartsogt/bert-base-mongolian-uncased/README.md
+++ b/model_cards/bayartsogt/bert-base-mongolian-uncased/README.md
@@ -0,0 +1,54 @@
+---
+language: "mn"
+tags:
+- bert
+- mongolian
+- uncased
+---
+
+# BERT-BASE-MONGOLIAN-UNCASED
+[Link to Official Mongolian-BERT repo](https://github.com/tugstugi/mongolian-bert)
+
+## Model description
+This repository contains pre-trained Mongolian [BERT](https://arxiv.org/abs/1810.04805) models trained by [tugstugi](https://github.com/tugstugi), [enod](https://github.com/enod) and [sharavsambuu](https://github.com/sharavsambuu).
+Special thanks to [nabar](https://github.com/nabar) who provided 5x TPUs.
+
+This repository is based on the following open source projects: [google-research/bert](https://github.com/google-research/bert/),
+[huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) and [yoheikikuta/bert-japanese](https://github.com/yoheikikuta/bert-japanese).
+
+#### How to use
+
+```python
+from transformers import pipeline, AlbertTokenizer, BertForMaskedLM
+
+tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/bert-base-mongolian-uncased')
+model = BertForMaskedLM.from_pretrained('bayartsogt/bert-base-mongolian-uncased')
+
+## declare task ##
+pipe = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
+
+## example ##
+input_  = 'Миний [MASK] хоол идэх нь тун чухал.'
+
+output_ = pipe(input_)
+for i in range(len(output_)):
+    print(output_[i])
+
+```
+
+
+## Training data
+Mongolian Wikipedia and the 700 million word Mongolian news data set  [[Pretraining Procedure](https://github.com/tugstugi/mongolian-bert#pre-training)]
+
+### BibTeX entry and citation info
+
+```bibtex
+@misc{mongolian-bert,
+  author = {Tuguldur, Erdene-Ochir and Gunchinish, Sharavsambuu and Bataa, Enkhbold},
+  title = {BERT Pretrained Models on Mongolian Datasets},
+  year = {2019},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/tugstugi/mongolian-bert/}}
+}
+```
--- a/model_cards/cedpsam/chatbot_fr/README.md
+++ b/model_cards/cedpsam/chatbot_fr/README.md
@@ -13,6 +13,9 @@ trained with this notebook
 https://colab.research.google.com/drive/1pfCV3bngAmISNZVfDvBMyEhQKuYw37Rl#scrollTo=AyImj9qZYLRi&uniqifier=3

 config from microsoft/DialoGPT-medium
+dataset generated from 2018 opensubtitle from opus folowing these guidelines
+https://github.com/PolyAI-LDN/conversational-datasets/tree/master/opensubtitles with this notebook
+https://colab.research.google.com/drive/1uyh3vJ9nEjqOHI68VD73qxt4olJzODxi#scrollTo=deaacv4XfLMk
 ### How to use

 Now we are ready to try out how the model works as a chatting partner!
--- a/Show More
+++ b/Show More