Fix CI after killing archive maps (#4724 )

* 🐛 Fix model ids for BART and Flaubert
Release: v2.11.0
2020-06-02 10:21:09 -04:00 · 2020-06-02 09:49:09 -04:00 · 2020-06-02 09:39:33 -04:00 · 2020-06-02 11:03:46 +02:00 · 2020-06-02 11:02:27 +02:00 · 2020-06-02 04:29:28 -04:00
320 changed files with 17093 additions and 4348 deletions
--- a/.github/workflows/github-torch-hub.yml
+++ b/.github/workflows/github-torch-hub.yml
@@ -21,7 +21,7 @@ jobs:
    - name: Install dependencies
      run: |
        pip install torch
-        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
+        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging

    - name: Torch hub list
      run: |
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -1,9 +1,13 @@
 name: Self-hosted runner (push)

 on: 
-  # push:
-  #   branches:
-  #     - master
+  push:
+    branches:
+      - master
+    paths: 
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
  # pull_request:
  repository_dispatch:

@@ -31,8 +35,8 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install .[sklearn,tf,torch,testing]
-        pip uninstall -y tensorflow
+        pip install torch
+        pip install .[sklearn,testing]

    - name: Are GPUs recognized by our DL frameworks
      run: |
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -31,13 +31,12 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install .[sklearn,tf,torch,testing]
+        pip install .[sklearn,torch,testing]

    - name: Are GPUs recognized by our DL frameworks
      run: |
        source .env/bin/activate
        python -c "import torch; print(torch.cuda.is_available())"
-        python -c "import tensorflow as tf; print(tf.test.is_built_with_cuda(), tf.config.list_physical_devices('GPU'))"

    - name: Run all tests on GPU
      env:
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -44,9 +44,16 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
 To get the OS and software versions automatically, you can run the following command:

 ```bash
-python transformers-cli env
+transformers-cli env
 ```

+or from the root of the repository the following command:
+
+```bash
+python src/transformers/commands/transformers_cli.py env
+```
+
+
 ### Do you want to implement a new model?

 Awesome! Please provide the following information:
@@ -198,11 +205,12 @@ Follow these steps to start contributing:
   are useful to avoid duplicated work, and to differentiate it from PRs ready
   to be merged;
 4. Make sure existing tests pass;
-5. Add high-coverage tests. No quality test, no merge. 
+5. Add high-coverage tests. No quality testing = no merge. 
 - If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
 - If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. 
+ - If you are adding a new tokenizer, write tests, and make sure `RUN_SLOW=1 python -m pytest tests/test_tokenization_{your_model_name}.py` passes.
 CircleCI does not run them. 
-6. All public methods must have informative docstrings;
+6. All public methods must have informative docstrings that work nicely with sphinx. See `modeling_ctrl.py` for an example.

 ### Tests

--- a/README.md
+++ b/README.md
@@ -63,7 +63,7 @@ Choose the right framework for every part of a model's lifetime

 ## Installation

-This repo is tested on Python 3.6+, PyTorch 1.0.0+ and TensorFlow 2.0.
+This repo is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.

 You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).

@@ -164,8 +164,10 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
 18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-20. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
-21. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+22. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
+23. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

@@ -338,8 +340,8 @@ python ./examples/text-classification/run_glue.py \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/
@@ -365,8 +367,8 @@ python ./examples/text-classification/run_glue.py \
    --data_dir=${GLUE_DIR}/STS-B  \
    --output_dir=./proc_data/sts-b-110   \
    --max_seq_length=128   \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=1200  \
    --model_name=xlnet-large-cased   \
@@ -389,8 +391,8 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classifica
    --do_eval   \
    --data_dir $GLUE_DIR/MRPC/   \
    --max_seq_length 128   \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --learning_rate 2e-5   \
    --num_train_epochs 3.0  \
    --output_dir /tmp/mrpc_output/ \
@@ -414,7 +416,7 @@ Training with these hyper-parameters gave us the following results:
 This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
@@ -426,8 +428,8 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=3   \
-    --per_gpu_train_batch_size=3   \
+    --per_device_eval_batch_size=3   \
+    --per_device_train_batch_size=3   \
 ```

 Training with these hyper-parameters gave us the following results:
@@ -447,7 +449,7 @@ The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-g
 Here is how to run the script with the small version of OpenAI GPT-2 model:

 ```shell
-python ./examples/run_generation.py \
+python ./examples/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=20 \
    --model_name_or_path=gpt2 \
@@ -455,7 +457,7 @@ python ./examples/run_generation.py \

 and from the Salesforce CTRL model:
 ```shell
-python ./examples/run_generation.py \
+python ./examples/text-generation/run_generation.py \
    --model_type=ctrl \
    --length=20 \
    --model_name_or_path=ctrl \
--- a/docs/README.md
+++ b/docs/README.md
@@ -67,3 +67,131 @@ It should build the static app that will be available under `/docs/_build/html`

 Accepted files are reStructuredText (.rst) and Markdown (.md). Create a file with its extension and put it
 in the source directory. You can then link it to the toc-tree by putting the filename without the extension.
+
+## Writing Documentation - Specification
+
+The `huggingface/transformers` documentation follows the
+[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style. It is
+mostly written in ReStructuredText 
+([Sphinx simple documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html), 
+[Sourceforge complete documentation](https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html))
+
+### Adding a new section
+
+A section is a page held in the `Notes` toc-tree on the documentation. Adding a new section is done in two steps:
+
+- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
+- Link that file in `./source/index.rst` on the correct toc-tree.
+
+### Adding a new model
+
+When adding a new model:
+ 
+- Create a file `xxx.rst` under `./source/model_doc`. 
+- Link that file in `./source/index.rst` on the `model_doc` toc-tree.
+- Write a short overview of the model:
+    - Overview with paper & authors
+    - Paper abstract
+    - Tips and tricks and how to use it best
+- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
+  every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
+  The order is generally: 
+    - Configuration, 
+    - Tokenizer
+    - PyTorch base model
+    - PyTorch head models
+    - TensorFlow base model
+    - TensorFlow head models
+
+These classes should be added using the RST syntax. Usually as follows:
+```
+XXXConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XXXConfig
+    :members:
+```
+
+This will include every public method of the configuration. If for some reason you wish for a method not to be displayed
+in the documentation, you can do so by specifying which methods should be in the docs:
+
+```
+XXXTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XXXTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+```
+
+### Writing source documentation
+
+Values that should be put in `code` should either be surrounded by double backticks: \`\`like so\`\` or be written as an object
+using the :obj: syntax: :obj:\`like so\`.
+
+When mentionning a class, it is recommended to use the :class: syntax as the mentioned class will be automatically
+linked by Sphinx: :class:\`transformers.XXXClass\`
+
+When mentioning a function, it is recommended to use the :func: syntax as the mentioned method will be automatically
+linked by Sphinx: :func:\`transformers.XXXClass.method\`
+
+Links should be done as so (note the double underscore at the end): \`text for the link <./local-link-or-global-link#loc>\`__
+
+#### Defining arguments in a method
+
+Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation. 
+The argument should be followed by its type, with its shape if it is a tensor, and a line return.
+Another indentation is necessary before writing the description of the argument.
+
+Here's an example showcasing everything so far:
+
+```
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`transformers.AlbertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.encode` and
+            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+```
+
+#### Writing a multi-line code block 
+
+Multi-line code blocks can be useful for displaying examples. They are done like so:
+
+```
+Example::
+
+    # first line of code
+    # second line
+    # etc
+```
+
+The `Example` string at the beginning can be replaced by anything as long as there are two semicolons following it.
+
+#### Writing a return block
+
+Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation. 
+The first line should be the type of the return, followed by a line return. No need to indent further for the elements
+building the return.
+
+Here's an example for tuple return, comprising several objects:
+
+```
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+```
+
+Here's an example for a single value return:
+
+```
+    Returns:
+        A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+```
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
@@ -15,4 +15,4 @@ In order to help this new field develop, we have included a few additional featu
 * accessing all the attention weights for each head of BERT/GPT/GPT-2,
 * retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.

-To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
+To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.9.0'
+release = u'2.11.0'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -12,7 +12,7 @@ A command-line interface is provided to convert original Bert/GPT/GPT-2/Transfor
 BERT
 ^^^^

-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/transformers/convert_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.

 This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).

@@ -33,6 +33,26 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas

 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.

+ALBERT
+^^^^^^
+
+Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
+
+Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
+
+.. code-block:: shell
+
+   export ALBERT_BASE_DIR=/path/to/albert/albert_base
+
+   transformers-cli convert --model_type albert \
+     --tf_checkpoint $ALBERT_BASE_DIR/model.ckpt-best \
+     --config $ALBERT_BASE_DIR/albert_config.json \
+     --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
+
+You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
+
 OpenAI GPT
 ^^^^^^^^^^

--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -1,649 +0,0 @@
-# Examples
-
-In this section a few examples are put together. All of these examples work for several models, making use of the very
-similar API between the different models.
-
-**Important**
-To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
-Execute the following steps in a new virtual environment:
-
-```bash
-git clone https://github.com/huggingface/transformers
-cd transformers
-pip install .
-pip install -r ./examples/requirements.txt
-```
-
-| Section                    | Description                                                                                                                                                |
-|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
-| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
-
-## TensorFlow 2.0 Bert models on GLUE
-
-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
-
-Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
-
-This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
-Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
-These options and the below benchmark are provided by @tlkh.
-
-Quick benchmarks from the script (no other modifications):
-
-| GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
-| --------- | -------- | ----------------------- | ----------------------|
-| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
-| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
-| V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
-| V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
-| 1080 Ti | FP32 | 55s | - |
-
-Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
-
-## Running on TPUs
-
-You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
-[README](https://github.com/pytorch/xla/blob/master/README.md).
-
-The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
-identical to your normal GPU + Huggingface setup.
-
-### GLUE
-
-Before running anyone of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-For running your GLUE task on MNLI dataset you can run something like the following:
-
-```
-export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MNLI
-
-python run_glue_tpu.py \
-  --model_type bert \
-  --model_name_or_path bert-base-cased \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME \
-  --overwrite_output_dir \
-  --logging_steps 50 \
-  --save_steps 200 \
-  --num_cores=8 \
-  --only_log_master
-```
-
-
-## Language model training
-
-Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
-
-Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
-to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
-are fine-tuned using a masked language modeling (MLM) loss.
-
-Before running the following example, you should get a file that contains text on which the language model will be
-trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
-
-We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
-text that will be used for evaluation.
-
-### GPT-2/GPT and causal language modeling
-
-The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
-the tokenization). The loss here is that of causal language modeling.
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export TEST_FILE=/path/to/dataset/wiki.test.raw
-
-python run_language_modeling.py \
-    --output_dir=output \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2 \
-    --do_train \
-    --train_data_file=$TRAIN_FILE \
-    --do_eval \
-    --eval_data_file=$TEST_FILE
-```
-
-This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
-a score of ~20 perplexity once fine-tuned on the dataset.
-
-### RoBERTa/BERT and masked language modeling
-
-The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
-as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
-pre-training: masked language modeling.
-
-In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
-slightly slower (over-fitting takes more epochs).
-
-We use the `--mlm` flag so that the script may change its loss function.
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export TEST_FILE=/path/to/dataset/wiki.test.raw
-
-python run_language_modeling.py \
-    --output_dir=output \
-    --model_type=roberta \
-    --model_name_or_path=roberta-base \
-    --do_train \
-    --train_data_file=$TRAIN_FILE \
-    --do_eval \
-    --eval_data_file=$TEST_FILE \
-    --mlm
-```
-
-## Language generation
-
-Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
-
-Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
-A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
-can try out the different models available in the library.
-
-Example usage:
-
-```bash
-python run_generation.py \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2
-```
-
-## GLUE
-
-Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
-
-Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
-Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
-
-GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
-uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
-batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
-between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
-
-| Task  | Metric                       | Result      |
-|-------|------------------------------|-------------|
-| CoLA  | Matthew's corr               | 49.23       |
-| SST-2 | Accuracy                     | 91.97       |
-| MRPC  | F1/Accuracy                  | 89.47/85.29 |
-| STS-B | Person/Spearman corr.        | 83.95/83.70 |
-| QQP   | Accuracy/F1                  | 88.40/84.31 |
-| MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
-| QNLI  | Accuracy                     | 87.46       |
-| RTE   | Accuracy                     | 61.73       |
-| WNLI  | Accuracy                     | 45.07       |
-
-Some of these results are significantly different from the ones reported on the test set
-of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```bash
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MRPC
-
-python run_glue.py \
-  --model_type bert \
-  --model_name_or_path bert-base-cased \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME/
-```
-
-where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
-
-The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
-In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
-output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
-
-The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
-CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
-said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
-since the data processor for each task inherits from the base class DataProcessor.
-
-### MRPC
-
-#### Fine-tuning example
-
-The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
-than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python run_glue.py \
-  --model_name_or_path bert-base-cased \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/
-```
-
-Our test ran on a few seeds with [the original implementation hyper-
-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
-results between 84% and 88%.
-
-#### Using Apex and mixed-precision
-
-Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
-[apex](https://github.com/NVIDIA/apex), then run the following example:
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python run_glue.py \
-  --model_name_or_path bert-base-cased \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/ \
-  --fp16
-```
-
-#### Distributed training
-
-Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
-reaches F1 > 92 on MRPC.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python -m torch.distributed.launch \
-    --nproc_per_node 8 run_glue.py \
-    --model_name_or_path bert-base-cased \
-    --task_name MRPC \
-    --do_train \
-    --do_eval \
-    --data_dir $GLUE_DIR/MRPC/ \
-    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
-    --learning_rate 2e-5 \
-    --num_train_epochs 3.0 \
-    --output_dir /tmp/mrpc_output/
-```
-
-Training with these hyper-parameters gave us the following results:
-
-```bash
-acc = 0.8823529411764706
-acc_and_f1 = 0.901702786377709
-eval_loss = 0.3418912578906332
-f1 = 0.9210526315789473
-global_step = 174
-loss = 0.07231863956341798
-```
-
-### MNLI
-
-The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python -m torch.distributed.launch \
-    --nproc_per_node 8 run_glue.py \
-    --model_name_or_path bert-base-cased \
-    --task_name mnli \
-    --do_train \
-    --do_eval \
-    --data_dir $GLUE_DIR/MNLI/ \
-    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
-    --learning_rate 2e-5 \
-    --num_train_epochs 3.0 \
-    --output_dir output_dir \
-```
-
-The results  are the following:
-
-```bash
-***** Eval results *****
-  acc = 0.8679706601466992
-  eval_loss = 0.4911287787382479
-  global_step = 18408
-  loss = 0.04755385363816904
-
-***** Eval results *****
-  acc = 0.8747965825874695
-  eval_loss = 0.45516540421714036
-  global_step = 18408
-  loss = 0.04755385363816904
-```
-
-## Multiple Choice
-
-Based on the script [`run_multiple_choice.py`]().
-
-#### Fine-tuning on SWAG
-Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
-
-```bash
-#training on 4 tesla V100(16GB) GPUS
-export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
-```
-Training with the defined hyper-parameters yields the following results:
-```
-***** Eval results *****
-eval_acc = 0.8338998300509847
-eval_loss = 0.44457291918821606
-```
-
-## SQuAD
-
-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
-
-#### Fine-tuning BERT on SQuAD1.0
-
-This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
-on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
-$SQUAD_DIR directory.
-
-* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
-* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
-* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
-
-And for SQuAD2.0, you need to download:
-
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-  --model_type bert \
-  --model_name_or_path bert-base-uncased \
-  --do_train \
-  --do_eval \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --per_gpu_train_batch_size 12 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/
-```
-
-Training with the previously defined hyper-parameters yields the following results:
-
-```bash
-f1 = 88.52
-exact_match = 81.22
-```
-
-#### Distributed training
-
-
-Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
-    --model_type bert \
-    --model_name_or_path bert-large-uncased-whole-word-masking \
-    --do_train \
-    --do_eval \
-    --train_file $SQUAD_DIR/train-v1.1.json \
-    --predict_file $SQUAD_DIR/dev-v1.1.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 2 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=3   \
-    --per_gpu_train_batch_size=3   \
-```
-
-Training with the previously defined hyper-parameters yields the following results:
-
-```bash
-f1 = 93.15
-exact_match = 86.91
-```
-
-This fine-tuned model is available as a checkpoint under the reference
-`bert-large-uncased-whole-word-masking-finetuned-squad`.
-
-#### Fine-tuning XLNet on SQuAD
-
-This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
-
-##### Command for SQuAD1.0:
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-    --model_type xlnet \
-    --model_name_or_path xlnet-large-cased \
-    --do_train \
-    --do_eval \
-    --train_file $SQUAD_DIR/train-v1.1.json \
-    --predict_file $SQUAD_DIR/dev-v1.1.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 2 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./wwm_cased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=4  \
-    --per_gpu_train_batch_size=4   \
-    --save_steps 5000
-```
-
-##### Command for SQuAD2.0:
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-    --model_type xlnet \
-    --model_name_or_path xlnet-large-cased \
-    --do_train \
-    --do_eval \
-    --version_2_with_negative \
-    --train_file $SQUAD_DIR/train-v2.0.json \
-    --predict_file $SQUAD_DIR/dev-v2.0.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 4 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./wwm_cased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=2  \
-    --per_gpu_train_batch_size=2   \
-    --save_steps 5000
-```
-
-Larger batch size may improve the performance while costing more memory.
-
-##### Results for SQuAD1.0 with the previously defined hyper-parameters:
-
-```python
-{
-"exact": 85.45884578997162,
-"f1": 92.5974600601065,
-"total": 10570,
-"HasAns_exact": 85.45884578997162,
-"HasAns_f1": 92.59746006010651,
-"HasAns_total": 10570
-}
-```
-
-##### Results for SQuAD2.0 with the previously defined hyper-parameters:
-
-```python
-{
-"exact": 80.4177545691906,
-"f1": 84.07154997729623,
-"total": 11873,
-"HasAns_exact": 76.73751686909581,
-"HasAns_f1": 84.05558584352873,
-"HasAns_total": 5928,
-"NoAns_exact": 84.0874684608915,
-"NoAns_f1": 84.0874684608915,
-"NoAns_total": 5945
-}
-```
-
-
-
-
-## XNLI
-
-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
-
-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-
-#### Fine-tuning on XNLI
-
-This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
-on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
-`$XNLI_DIR` directory.
-
-* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
-* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
-
-```bash
-export XNLI_DIR=/path/to/XNLI
-
-python run_xnli.py \
-  --model_type bert \
-  --model_name_or_path bert-base-multilingual-cased \
-  --language de \
-  --train_language en \
-  --do_train \
-  --do_eval \
-  --data_dir $XNLI_DIR \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 5e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 128 \
-  --output_dir /tmp/debug_xnli/ \
-  --save_steps -1
-```
-
-Training with the previously defined hyper-parameters yields the following results on the **test** set:
-
-```bash
-acc = 0.7093812375249501
-```
-
-## MM-IMDb
-
-Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
-
-[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
-
-### Training on MM-IMDb
-
-```
-python run_mmimdb.py \
-    --data_dir /path/to/mmimdb/dataset/ \
-    --model_type bert \
-    --model_name_or_path bert-base-uncased \
-    --output_dir /path/to/save/dir/ \
-    --do_train \
-    --do_eval \
-    --max_seq_len 512 \
-    --gradient_accumulation_steps 20 \
-    --num_image_embeds 3 \
-    --num_train_epochs 100 \
-    --patience 5
-```
-
-## Adversarial evaluation of model performances
-
-Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
-
-The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
-
-This is an example of using test_hans.py:
-
-```bash
-export HANS_DIR=path-to-hans
-export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
-export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
-
-python examples/hans/test_hans.py \
-        --task_name hans \
-        --model_type $MODEL_TYPE \
-        --do_eval \
-        --data_dir $HANS_DIR \
-        --model_name_or_path $MODEL_PATH \
-        --max_seq_length 128 \
-        --output_dir $MODEL_PATH \
-```
-
-This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
-
-The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
-
-```bash
-Heuristic entailed results:
-lexical_overlap: 0.9702
-subsequence: 0.9942
-constituent: 0.9962
-
-Heuristic non-entailed results:
-lexical_overlap: 0.199
-subsequence: 0.0396
-constituent: 0.118
-```
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -0,0 +1 @@
+../../examples/README.md
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -108,3 +108,5 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/electra
    model_doc/dialogpt
    model_doc/reformer
+    model_doc/marian
+    model_doc/longformer
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -74,7 +74,7 @@ This library hosts the processor to load the XNLI data:
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

 An example using these processors is given in the
-`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
+`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.


 SQuAD
@@ -150,4 +150,4 @@ Example::


 Another example using these processors is given in the
-`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/run_squad.py>`__ script.
+`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -1,5 +1,18 @@
-# Migrating from pytorch-pretrained-bert
+# Migrating from previous packages

+## Migrating from pytorch-transformers to transformers
+
+Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
+
+### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed
+
+To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed.
+
+If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change.
+
+If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments.
+
+## Migrating from pytorch-pretrained-bert

 Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `transformers`

--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -6,7 +6,7 @@ Overview

 The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
 by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
-two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
+two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

 - Splitting the embedding matrix into two smaller matrices
 - Using repeating layers split among groups
@@ -94,3 +94,17 @@ TFAlbertForSequenceClassification

 .. autoclass:: transformers.TFAlbertForSequenceClassification
    :members:
+
+
+TFAlbertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFAlbertForMultipleChoice
+    :members:
+
+
+TFAlbertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFAlbertForQuestionAnswering
+    :members:
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -1,6 +1,6 @@
 Bart
 ----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+**DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer

@@ -22,7 +22,7 @@ Implementation Notes
 - The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``)  if they are not passed. This is different than some other modeling APIs.
 - Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
 - ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
+- Models that load the ``"facebook/bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.



--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@@ -0,0 +1,91 @@
+Longformer
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+
+Overview
+~~~~~
+The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+Here the abstract: 
+
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
+
+The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
+
+Longformer Self Attention
+~~~~~~~~~~~~~~~~~~~~
+Longformer self attention employs self attention on both a "local" context and a "global" context.
+Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer. 
+A selecetd few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
+
+Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
+Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
+
+The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor `global_attention_mask` at run-time appropriately. `Longformer` employs the following logic for `global_attention_mask`: `0` - the token attends "locally", `1` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
+
+Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
+
+For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
+
+
+Training
+~~~~~~~~~~~~~~~~~~~~
+``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and 
+should be used as follows:
+
+::
+
+  input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
+  mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+
+  loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
+
+
+LongformerConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerConfig
+    :members:
+
+
+LongformerTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerTokenizer
+    :members: 
+
+
+LongformerModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerModel
+    :members:
+
+
+LongformerForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForMaskedLM
+    :members:
+
+
+LongformerForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForQuestionAnswering
+    :members:
+
+
+LongformerForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForMultipleChoice
+    :members:
+
+
+LongformerForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForTokenClassification
+    :members:
+
--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -0,0 +1,105 @@
+MarianMT
+----------------------------------------------------
+**DISCLAIMER:** If you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
+
+Implementation Notes
+~~~~~~~~~~~~~~~~~~~~
+- each model is about 298 MB on disk, there are 1,000+ models.
+- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
+- The 1,000+ models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
+- the 80 opus models that require BPE preprocessing are not supported.
+- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
+    - static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
+    - a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
+    - no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
+    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
+- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
+
+Naming
+~~~~~~
+- All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
+- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
+- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
+
+
+Multilingual Models
+~~~~~~~~~~~~~~~~~~~~
+
+All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
+    - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
+    - if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
+    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``
+
+Example of translating english to many romance languages, using language codes:
+
+.. code-block:: python
+
+    from transformers import MarianMTModel, MarianTokenizer
+    src_text = [
+        '>>fr<< this is a sentence in english that we want to translate to french',
+        '>>pt<< This should go to portuguese',
+        '>>es<< And this to Spanish'
+    ]
+
+    model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
+    tokenizer = MarianTokenizer.from_pretrained(model_name)
+    print(tokenizer.supported_language_codes)
+    model = MarianMTModel.from_pretrained(model_name)
+    translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
+    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
+    # ["c'est une phrase en anglais que nous voulons traduire en français",
+    # 'Isto deve ir para o português.',
+    # 'Y esto al español']
+
+Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
+There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
+
+For Example:
+    - ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
+    - ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
+
+
+
+.. code-block:: python
+
+    GROUP_MEMBERS = {
+     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
+     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
+     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
+     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
+     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
+    }
+
+Code to see available pretrained models:
+
+.. code-block:: python
+
+    from transformers.hf_api import HfApi
+    model_list = HfApi().model_list()
+    org = "Helsinki-NLP"
+    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
+    suffix = [x.split('/')[1] for x in model_ids]
+    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
+
+MarianMTModel
+~~~~~~~~~~~~~
+
+Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
+Model API is identical to BartForConditionalGeneration.
+Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
+This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
+
+.. autoclass:: transformers.MarianMTModel
+    :members:
+
+
+MarianTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.MarianTokenizer
+    :members: prepare_translation_batch
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -5,7 +5,7 @@ file a `Github Issue <https://github.com/huggingface/transformers/issues/new?ass

 Overview
 ~~~~~
-The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 Here the abstract: 

 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
@@ -62,7 +62,7 @@ For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>

 Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.

-It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly.
+When training a model from scratch, it is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length a good value for ``num_buckets`` is calculated on the fly. This value will then automatically be saved in the config and should be reused for inference.

 Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.

--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -74,6 +74,13 @@ RobertaForSequenceClassification
    :members:


+RobertaForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.RobertaForMultipleChoice
+    :members:
+
+
 RobertaForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -29,7 +29,7 @@ Tips:
  XLNet is pretrained using only a sub-set of the output tokens as target which are selected
  with the `target_mapping` input.
 - To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
-  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
+  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/text-generation/run_generation.py`)
 - XLNet is one of the few models that has no sequence length limit.

 The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -80,7 +80,7 @@ You can then feed it all as input to your model:
    outputs = model(input_ids, langs=langs)


-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
+The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
 can generate text using the CLM checkpoints from XLM, using the language embeddings.

 XLM without Language Embeddings
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -63,33 +63,33 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | Trained on uncased German text by DBMDZ                                                                                             |
 |                   |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese``                                     | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece.                                                               |
 |                   |                                                            | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization.                                                          |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-whole-word-masking``                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-whole-word-masking``        | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece.                                      |
 |                   |                                                            | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization.                                                          |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-char``                                | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-char``                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text. Text is tokenized into characters.                                                                        |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-char-whole-word-masking``             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-char-whole-word-masking``   | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.                                               |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-finnish-cased-v1``                             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``TurkuNLP/bert-base-finnish-cased-v1``                    | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on cased Finnish text.                                                                                                      |
 |                   |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-finnish-uncased-v1``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``TurkuNLP/bert-base-finnish-uncased-v1``                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on uncased Finnish text.                                                                                                    |
 |                   |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-dutch-cased``                                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``wietsedv/bert-base-dutch-cased``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on cased Dutch text.                                                                                                        |
 |                   |                                                            | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__).                                                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -259,32 +259,32 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``xlm-roberta-large``                                      | | ~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
 |                   |                                                            | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                          |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| FlauBERT          | ``flaubert-small-cased``                                   | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
+| FlauBERT          | ``flaubert/flaubert_small_cased``                          | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
 |                   |                                                            | | FlauBERT small architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-base-uncased``                                  | | 12-layer, 768-hidden, 12-heads, 137M parameters                                                                                     |
+|                   | ``flaubert/flaubert_base_uncased``                         | | 12-layer, 768-hidden, 12-heads, 137M parameters                                                                                     |
 |                   |                                                            | | FlauBERT base architecture with uncased vocabulary                                                                                  |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-base-cased``                                    | | 12-layer, 768-hidden, 12-heads, 138M parameters                                                                                     |
+|                   | ``flaubert/flaubert_base_cased``                           | | 12-layer, 768-hidden, 12-heads, 138M parameters                                                                                     |
 |                   |                                                            | | FlauBERT base architecture with cased vocabulary                                                                                    |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-large-cased``                                   | | 24-layer, 1024-hidden, 16-heads, 373M parameters                                                                                    |
+|                   | ``flaubert/flaubert_large_cased``                          | | 24-layer, 1024-hidden, 16-heads, 373M parameters                                                                                    |
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Bart              | ``bart-large``                                             | | 12-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
+| Bart              | ``facebook/bart-large``                                    | | 24-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bart-large-mnli``                                        | | Adds a 2 layer classification head with 1 million parameters                                                                        |
+|                   | ``facebook/bart-large-mnli``                               | | Adds a 2 layer classification head with 1 million parameters                                                                        |
 |                   |                                                            | | bart-large base architecture with a classification head, finetuned on MNLI                                                          |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bart-large-cnn``                                         | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
+|                   | ``facebook/bart-large-cnn``                                | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
 |                   |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``mbart-large-en-ro``                                      | | 12-layer, 1024-hidden, 16-heads, 880M parameters                                                                                    |
+|                   | ``facebook/mbart-large-en-ro``                             | | 12-layer, 1024-hidden, 16-heads, 880M parameters                                                                                    |
 |                   |                                                            | | bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation.                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DialoGPT          | ``DialoGPT-small``                                         | | 12-layer, 768-hidden, 12-heads, 124M parameters                                                                                     |
@@ -296,6 +296,18 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``DialoGPT-large``                                         | | 36-layer, 1280-hidden, 20-heads, 774M parameters                                                                                    |
 |                   |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Reformer          | ``reformer-crime-and-punishment``                          | | 6-layer, 256-hidden, 2-heads, 3M parameters                                                                                         |
-|                   |                                                            | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky                                                           |
+| Reformer          | ``reformer-enwik8``                                        | | 12-layer, 1024-hidden, 8-heads, 149M parameters                                                                                     |
+|                   |                                                            | | Trained on English Wikipedia data - enwik8.                                                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``reformer-crime-and-punishment``                          | | 6-layer, 256-hidden, 2-heads, 3M parameters                                                                                         |
+|                   |                                                            | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.                                                          |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| MarianMT          | ``Helsinki-NLP/opus-mt-{src}-{tgt}``                       | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.            |
+|                   |                                                            | | (see `model list <https://huggingface.co/Helsinki-NLP>`_)                                                                           |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Longformer        | ``allenai/longformer-base-4096``                           | | 12-layer, 768-hidden, 12-heads, ~149M parameters                                                                                    |
+|                   |                                                            | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096                                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``allenai/longformer-large-4096``                          | | 24-layer, 1024-hidden, 16-heads, ~435M parameters                                                                                   |
+|                   |                                                            | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -8,7 +8,7 @@ The library was designed with two strong goals in mind:

 - be as easy and fast to use as possible:

-  - we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
+  - we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
  - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
  - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.

@@ -31,27 +31,27 @@ A few other goals:

 ## Main concepts

-The library is build around three type of classes for each models:
+The library is build around three types of classes for each model:

- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
+- **model classes**  e.g., `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
+- **configuration classes** which store all the parameters required to build a model, e.g., `BertConfig`. You don't always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
+- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., `BertTokenizer`

 All these classes can be instantiated from pretrained instances and saved locally using two methods:

 - `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
 - `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.

-We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
+We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized into two parts:

 - the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
+- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and, in particular, the input/output that you should expect when calling each of them.

 ## Quick tour: Usage

 Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.

-See full API reference for examples for each model class.
+See the full API reference for examples of each model class.

 ### BERT example

@@ -191,7 +191,7 @@ Examples for each model class of each model architecture (Bert, GPT, GPT-2, Tran

 #### Using the past

-GPT-2 as well as some other models (GPT, XLNet, Transfo-XL, CTRL) make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
+GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.

 Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):

--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -45,7 +45,7 @@ Sequence classification is the task of classifying sequences according to a give
 of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
 a model on a GLUE sequence classification task, you may leverage the
 `run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
-`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
+`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`_ scripts.

 Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
 It leverages a fine-tuned model on sst2, which is a GLUE task.
@@ -404,48 +404,150 @@ Causal language modeling is the task of predicting the token following a sequenc
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
 for generation tasks.

-There is currently no pipeline to do causal language modeling/generation.
+Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.

-Here is an example using the tokenizer and model. leveraging the :func:`~transformers.PreTrainedModel.generate` method
-to generate the tokens following the initial sequence in PyTorch, and creating a simple loop in TensorFlow.
+Here is an example using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens.
+
+::
+
+    ## PYTORCH CODE
+    from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
+    import torch
+    from torch.nn import functional as F
+
+
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    model = AutoModelWithLMHead.from_pretrained("gpt2")
+
+    sequence = f"Hugging Face is based in DUMBO, New York City, and "
+
+    input_ids = tokenizer.encode(sequence, return_tensors="pt")
+
+    # get logits of last hidden state
+    next_token_logits = model(input_ids)[0][:, -1, :]
+
+    # filter
+    filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+    # sample
+    probs = F.softmax(filtered_next_token_logits, dim=-1)
+    next_token = torch.multinomial(probs, num_samples=1)
+
+    generated = torch.cat([input_ids, next_token], dim=-1)
+
+    resulting_string = tokenizer.decode(generated.tolist()[0])
+    print(resulting_string)
+    ## TENSORFLOW CODE
+    from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
+    import tensorflow as tf
+
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+
+    sequence = f"Hugging Face is based in DUMBO, New York City, and "
+
+    input_ids = tokenizer.encode(sequence, return_tensors="tf")
+
+    # get logits of last hidden state
+    next_token_logits = model(input_ids)[0][:, -1, :]
+
+    # filter
+    filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+    # sample
+    next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)
+
+    generated = tf.concat([input_ids, next_token], axis=1)
+
+    resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
+    print(resulting_string)
+
+
+This outputs a (hopefully) coherent next token following the original sequence, which is in our case is the word *has*:
+
+::
+
+    Hugging Face is based in DUMBO, New York City, and has
+
+In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
+
+Text Generation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. As an example, is it shown how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`_ for example).
+
+::
+
+    from transformers import pipeline
+
+    text_generator = pipeline("text-generation")
+    print(text_generator("As far as I am concerned, I will", max_length=50))
+
+
+Here the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*.
+The default arguments of ``PreTrainedModel.generate()`` can directly be overriden in the pipeline as is shown above for the argument ``max_length``.
+
+Here is an example for text generation using XLNet and its tokenzier. 

 ::

    ## PYTORCH CODE
    from transformers import AutoModelWithLMHead, AutoTokenizer

-    tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    model = AutoModelWithLMHead.from_pretrained("gpt2")
+    model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

-    sequence = f"Hugging Face is based in DUMBO, New York City, and is"
+    # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+    PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+    (except for Alexei and Maria) are discovered.
+    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+    remainder of the story. 1883 Western Siberia,
+    a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+    Rasputin has a vision and denounces one of the men as a horse thief. Although his
+    father initially slaps him for making such an accusation, Rasputin watches as the
+    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+    with people, even a bishop, begging for his blessing. <eod> </s> <eos>""" 

-    input = tokenizer.encode(sequence, return_tensors="pt")
-    generated = model.generate(input, max_length=50, do_sample=True)
+    prompt = "Today the weather is really nice and I am planning on "
+    inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
+    
+    prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
+    outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+    generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

-    resulting_string = tokenizer.decode(generated.tolist()[0])
-    print(resulting_string)
+    print(generated)
    ## TENSORFLOW CODE
    from transformers import TFAutoModelWithLMHead, AutoTokenizer
-    import tensorflow as tf

-    tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+    model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

-    sequence = f"Hugging Face is based in DUMBO, New York City, and is"
-    input = tokenizer.encode(sequence, return_tensors="tf")
-    generated = model.generate(input, max_length=50, do_sample=True)
+    # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+    PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+    (except for Alexei and Maria) are discovered.
+    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+    remainder of the story. 1883 Western Siberia,
+    a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+    Rasputin has a vision and denounces one of the men as a horse thief. Although his
+    father initially slaps him for making such an accusation, Rasputin watches as the
+    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+    with people, even a bishop, begging for his blessing. <eod> </s> <eos>""" 

-    resulting_string = tokenizer.decode(generated.tolist()[0])
-    print(resulting_string)
+    prompt = "Today the weather is really nice and I am planning on "
+    inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")

+    prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
+    outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+    generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

-This outputs a (hopefully) coherent string from the original sequence, as the
-:func:`~transformers.PreTrainedModel.generate` samples from a top_p/tok_k distribution:
+    print(generated)

-::
+Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-xl* often need to be padded to work well.
+GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions on webpages with a causal language modeling objective.

-    Hugging Face is based in DUMBO, New York City, and is a live-action TV series based on the novel by John
-    Carpenter, and its producers, David Kustlin and Steve Pichar. The film is directed by!
+For more information on how to apply different decoding strategies for text generation, please also refer to our generation blog post `here <https://huggingface.co/blog/how-to-generate>`_.


 Named Entity Recognition
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,10 +1,41 @@
-# Examples
+## Examples

-In this section a few examples are put together. All of these examples work for several models, making use of the very
-similar API between the different models.
+Version 2.9 of `transformers` introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
+Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.0+.
+
+Here is the list of all our examples:
+- **grouped by task** (all official examples work for multiple models)
+- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might just lack some features),
+- whether they also include examples for **`pytorch-lightning`**, which is a great fully-featured, general-purpose training library for PyTorch,
+- links to **Colab notebooks** to walk through the scripts and run them easily,
+- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
+
+This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**
+
+
+# The Big Table of Tasks
+
+| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
+|---|---|:---:|:---:|:---:|:---:|
+| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | Raw text        | ✅ | -  | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
+| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
+| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
+| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
+| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | -  | ✅ | -  | -
+| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)     | -           | -  | - | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
+| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation)       | All               | -  | -  | -  | -
+| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/summarization)     | CNN/Daily Mail    | -  | -  | -  | -
+| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/translation)         | WMT               | -  | -  | -  | -
+| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology)             | -                 | -  | -  | -  | -
+| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial)         | HANS              | -  | -  | -  | -
+
+
+<br>
+
+## Important note

 **Important**
-To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
+To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements.
 Execute the following steps in a new virtual environment:

 ```bash
@@ -14,16 +45,36 @@ pip install .
 pip install -r ./examples/requirements.txt
 ```

-| Section                    | Description                                                                                                                                                |
-|----------------------------|-----------------------------------------------------
-| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+## One-click Deploy to Cloud (wip)

+#### Azure
+
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
+
+## Running on TPUs
+
+When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
+
+When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
+very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
+
+In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
+Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
+
+For example for `run_glue`:
+
+```bash
+python examples/xla_spawn.py --num_cores 8 \
+	examples/text-classification/run_glue.py
+	--model_name_or_path bert-base-cased \
+	--task_name mnli \
+	--data_dir ./data/glue_data/MNLI \
+	--output_dir ./models/tpu \
+	--overwrite_output_dir \
+	--do_train \
+	--do_eval \
+	--num_train_epochs 1 \
+	--save_steps 20000
+```
+
+Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.
--- a/examples/adversarial/test_hans.py
+++ b/examples/adversarial/test_hans.py
@@ -65,13 +65,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (
-        tuple(conf.pretrained_config_archive_map.keys())
-        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
-    ),
-    (),
-)

 MODEL_CLASSES = {
    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
@@ -389,7 +382,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--task_name",
--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -0,0 +1,113 @@
+import csv
+from collections import defaultdict
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+
+import matplotlib.pyplot as plt
+from transformers import HfArgumentParser
+
+
+@dataclass
+class PlotArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+
+    csv_file: str = field(metadata={"help": "The csv file to plot."},)
+    plot_along_batch: bool = field(
+        default=False,
+        metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
+    )
+    is_time: bool = field(
+        default=False,
+        metadata={"help": "Whether the csv file has time results or memory results. Defaults to memory results."},
+    )
+    is_train: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether the csv file has training results or inference results. Defaults to inference results."
+        },
+    )
+    figure_png_file: Optional[str] = field(
+        default=None, metadata={"help": "Filename under which the plot will be saved. If unused no plot is saved."},
+    )
+
+
+class Plot:
+    def __init__(self, args):
+        self.args = args
+        self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={}))
+
+        with open(self.args.csv_file, newline="") as csv_file:
+            reader = csv.DictReader(csv_file)
+            for row in reader:
+                model_name = row["model"]
+                self.result_dict[model_name]["bsz"].append(int(row["batch_size"]))
+                self.result_dict[model_name]["seq_len"].append(int(row["sequence_length"]))
+                self.result_dict[model_name]["result"][(int(row["batch_size"]), int(row["sequence_length"]))] = row[
+                    "result"
+                ]
+
+    def plot(self):
+        fig, ax = plt.subplots()
+        title_str = "Time usage" if self.args.is_time else "Memory usage"
+        title_str = title_str + " for training" if self.args.is_train else title_str + " for inference"
+
+        for model_name in self.result_dict.keys():
+            batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"])))
+            sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"])))
+            results = self.result_dict[model_name]["result"]
+
+            (x_axis_array, inner_loop_array) = (
+                (batch_sizes, sequence_lengths) if self.args.plot_along_batch else (sequence_lengths, batch_sizes)
+            )
+
+            plt.xlim(min(x_axis_array), max(x_axis_array))
+
+            for inner_loop_value in inner_loop_array:
+                if self.args.plot_along_batch:
+                    y_axis_array = np.asarray([results[(x, inner_loop_value)] for x in x_axis_array], dtype=np.int)
+                else:
+                    y_axis_array = np.asarray([results[(inner_loop_value, x)] for x in x_axis_array], dtype=np.float32)
+
+                ax.set_xscale("log", basex=2)
+                ax.set_yscale("log", basey=10)
+
+                (x_axis_label, inner_loop_label) = (
+                    ("batch_size", "sequence_length in #tokens")
+                    if self.args.plot_along_batch
+                    else ("sequence_length in #tokens", "batch_size")
+                )
+
+                x_axis_array = np.asarray(x_axis_array, np.int)
+                plt.scatter(x_axis_array, y_axis_array, label=f"{model_name} - {inner_loop_label}: {inner_loop_value}")
+                plt.plot(x_axis_array, y_axis_array, "--")
+
+            title_str += f" {model_name} vs."
+
+        title_str = title_str[:-4]
+        y_axis_label = "Time in s" if self.args.is_time else "Memory in MB"
+
+        # plot
+        plt.title(title_str)
+        plt.xlabel(x_axis_label)
+        plt.ylabel(y_axis_label)
+        plt.legend()
+
+        if self.args.figure_png_file is not None:
+            plt.savefig(self.args.figure_png_file)
+        else:
+            plt.show()
+
+
+def main():
+    parser = HfArgumentParser(PlotArguments)
+    plot_args = parser.parse_args_into_dataclasses()[0]
+    plot = Plot(args=plot_args)
+    plot.plot()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/benchmarking/run_benchmark.py
+++ b/examples/benchmarking/run_benchmark.py
@@ -0,0 +1,29 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Benchmarking the library on inference and training """
+
+from transformers import HfArgumentParser, PyTorchBenchmark, PyTorchBenchmarkArguments
+
+
+def main():
+    parser = HfArgumentParser(PyTorchBenchmarkArguments)
+    benchmark_args = parser.parse_args_into_dataclasses()[0]
+    benchmark = PyTorchBenchmark(args=benchmark_args)
+    benchmark.run()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/benchmarks.py
+++ b/examples/benchmarks.py
@@ -1,710 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" Benchmarking the library on inference and training """
-
-# If checking the tensors placement
-# tf.debugging.set_log_device_placement(True)
-
-import argparse
-import csv
-import logging
-import timeit
-from time import time
-from typing import Callable, List
-
-from transformers import (
-    AutoConfig,
-    AutoTokenizer,
-    MemorySummary,
-    is_tf_available,
-    is_torch_available,
-    start_memory_tracing,
-    stop_memory_tracing,
-)
-
-
-if is_tf_available():
-    import tensorflow as tf
-    from transformers import TFAutoModel
-
-if is_torch_available():
-    import torch
-    from transformers import AutoModel
-
-
-input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
-the Director of Hatcheries and Conditioning entered the room, in the
-scarcely breathing silence, the absent-minded, soliloquizing hum or
-
-whistle, of absorbed concentration. A troop of newly arrived students,
-very young, pink and callow, followed nervously, rather abjectly, at the
-Director's heels. Each of them carried a notebook, in which, whenever
-the great man spoke, he desperately scribbled. Straight from the
-horse's mouth. It was a rare privilege. The D. H. C. for Central London
-always made a point of personally conducting his new students round
-the various departments.
-
-"Just to give you a general idea," he would explain to them. For of
-course some sort of general idea they must have, if they were to do
-their work intelligently-though as little of one, if they were to be good
-and happy members of society, as possible. For particulars, as every
-one knows, make for virtue and happiness; generalities are intellectu-
-ally necessary evils. Not philosophers but fret-sawyers and stamp col-
-lectors compose the backbone of society.
-
-"To-morrow," he would add, smiling at them with a slightly menacing
-geniality, "you'll be settling down to serious work. You won't have time
-for generalities. Meanwhile ..."
-
-Meanwhile, it was a privilege. Straight from the horse's mouth into the
-notebook. The boys scribbled like mad.
-
-Tall and rather thin but upright, the Director advanced into the room.
-He had a long chin and big rather prominent teeth, just covered, when
-he was not talking, by his full, floridly curved lips. Old, young? Thirty?
-Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
-arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
-
-"I shall begin at the beginning," said the D.H.C. and the more zealous
-students recorded his intention in their notebooks: Begin at the begin-
-ning. "These," he waved his hand, "are the incubators." And opening
-an insulated door he showed them racks upon racks of numbered test-
-tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
-whereas the male gametes," and here he opened another door, "they
-have to be kept at thirty-five instead of thirty-seven. Full blood heat
-sterilizes." Rams wrapped in theremogene beget no lambs.
-
-Still leaning against the incubators he gave them, while the pencils
-scurried illegibly across the pages, a brief description of the modern
-
-
-
-fertilizing process; spoke first, of course, of its surgical introduc-
-tion-"the operation undergone voluntarily for the good of Society, not
-to mention the fact that it carries a bonus amounting to six months'
-salary"; continued with some account of the technique for preserving
-the excised ovary alive and actively developing; passed on to a consid-
-eration of optimum temperature, salinity, viscosity; referred to the liq-
-uor in which the detached and ripened eggs were kept; and, leading
-his charges to the work tables, actually showed them how this liquor
-was drawn off from the test-tubes; how it was let out drop by drop
-onto the specially warmed slides of the microscopes; how the eggs
-which it contained were inspected for abnormalities, counted and
-transferred to a porous receptacle; how (and he now took them to
-watch the operation) this receptacle was immersed in a warm bouillon
-containing free-swimming spermatozoa-at a minimum concentration
-of one hundred thousand per cubic centimetre, he insisted; and how,
-after ten minutes, the container was lifted out of the liquor and its
-contents re-examined; how, if any of the eggs remained unfertilized, it
-was again immersed, and, if necessary, yet again; how the fertilized
-ova went back to the incubators; where the Alphas and Betas re-
-mained until definitely bottled; while the Gammas, Deltas and Epsilons
-were brought out again, after only thirty-six hours, to undergo Bo-
-kanovsky's Process.
-
-"Bokanovsky's Process," repeated the Director, and the students un-
-derlined the words in their little notebooks.
-
-One egg, one embryo, one adult-normality. But a bokanovskified egg
-will bud, will proliferate, will divide. From eight to ninety-six buds, and
-every bud will grow into a perfectly formed embryo, and every embryo
-into a full-sized adult. Making ninety-six human beings grow where
-only one grew before. Progress.
-
-"Essentially," the D.H.C. concluded, "bokanovskification consists of a
-series of arrests of development. We check the normal growth and,
-paradoxically enough, the egg responds by budding."
-
-Responds by budding. The pencils were busy.
-
-He pointed. On a very slowly moving band a rack-full of test-tubes was
-entering a large metal box, another, rack-full was emerging. Machinery
-faintly purred. It took eight minutes for the tubes to go through, he
-
-
-
-told them. Eight minutes of hard X-rays being about as much as an
-egg can stand. A few died; of the rest, the least susceptible divided
-into two; most put out four buds; some eight; all were returned to the
-incubators, where the buds began to develop; then, after two days,
-were suddenly chilled, chilled and checked. Two, four, eight, the buds
-in their turn budded; and having budded were dosed almost to death
-with alcohol; consequently burgeoned again and having budded-bud
-out of bud out of bud-were thereafter-further arrest being generally
-fatal-left to develop in peace. By which time the original egg was in a
-fair way to becoming anything from eight to ninety-six embryos- a
-prodigious improvement, you will agree, on nature. Identical twins-but
-not in piddling twos and threes as in the old viviparous days, when an
-egg would sometimes accidentally divide; actually by dozens, by
-scores at a time.
-
-"Scores," the Director repeated and flung out his arms, as though he
-were distributing largesse. "Scores."
-
-But one of the students was fool enough to ask where the advantage
-lay.
-
-"My good boy!" The Director wheeled sharply round on him. "Can't you
-see? Can't you see?" He raised a hand; his expression was solemn.
-"Bokanovsky's Process is one of the major instruments of social stabil-
-ity!"
-
-Major instruments of social stability.
-
-Standard men and women; in uniform batches. The whole of a small
-factory staffed with the products of a single bokanovskified egg.
-
-"Ninety-six identical twins working ninety-six identical machines!" The
-voice was almost tremulous with enthusiasm. "You really know where
-you are. For the first time in history." He quoted the planetary motto.
-"Community, Identity, Stability." Grand words. "If we could bo-
-kanovskify indefinitely the whole problem would be solved."
-
-Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
-lions of identical twins. The principle of mass production at last applied
-to biology.
-
-
-
-"But, alas," the Director shook his head, "we can't bokanovskify indefi-
-nitely."
-
-Ninety-six seemed to be the limit; seventy-two a good average. From
-the same ovary and with gametes of the same male to manufacture as
-many batches of identical twins as possible-that was the best (sadly a
-second best) that they could do. And even that was difficult.
-
-"For in nature it takes thirty years for two hundred eggs to reach ma-
-turity. But our business is to stabilize the population at this moment,
-here and now. Dribbling out twins over a quarter of a century-what
-would be the use of that?"
-
-Obviously, no use at all. But Podsnap's Technique had immensely ac-
-celerated the process of ripening. They could make sure of at least a
-hundred and fifty mature eggs within two years. Fertilize and bo-
-kanovskify-in other words, multiply by seventy-two-and you get an
-average of nearly eleven thousand brothers and sisters in a hundred
-and fifty batches of identical twins, all within two years of the same
-age.
-
-"And in exceptional cases we can make one ovary yield us over fifteen
-thousand adult individuals."
-
-Beckoning to a fair-haired, ruddy young man who happened to be
-passing at the moment. "Mr. Foster," he called. The ruddy young man
-approached. "Can you tell us the record for a single ovary, Mr. Foster?"
-
-"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
-out hesitation. He spoke very quickly, had a vivacious blue eye, and
-took an evident pleasure in quoting figures. "Sixteen thousand and
-twelve; in one hundred and eighty-nine batches of identicals. But of
-course they've done much better," he rattled on, "in some of the tropi-
-cal Centres. Singapore has often produced over sixteen thousand five
-hundred; and Mombasa has actually touched the seventeen thousand
-mark. But then they have unfair advantages. You should see the way a
-negro ovary responds to pituitary! It's quite astonishing, when you're
-used to working with European material. Still," he added, with a laugh
-(but the light of combat was in his eyes and the lift of his chin was
-challenging), "still, we mean to beat them if we can. I'm working on a
-wonderful Delta-Minus ovary at this moment. Only just eighteen
-
-
-
-months old. Over twelve thousand seven hundred children already, ei-
-ther decanted or in embryo. And still going strong. We'll beat them
-yet."
-
-"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
-the shoulder. "Come along with us, and give these boys the benefit of
-your expert knowledge."
-
-Mr. Foster smiled modestly. "With pleasure." They went.
-In the Bottling Room all was harmonious bustle and ordered activity.
-Flaps of fresh sow's peritoneum ready cut to the proper size came
-shooting up in little lifts from the Organ Store in the sub-basement.
-Whizz and then, click! the lift-hatches hew open; the bottle-liner had
-only to reach out a hand, take the flap, insert, smooth-down, and be-
-fore the lined bottle had had time to travel out of reach along the end-
-less band, whizz, click! another flap of peritoneum had shot up from
-the depths, ready to be slipped into yet another bottle, the next of that
-slow interminable procession on the band.
-
-Next to the Liners stood the Matriculators. The procession advanced;
-one by one the eggs were transferred from their test-tubes to the
-larger containers; deftly the peritoneal lining was slit, the morula
-dropped into place, the saline solution poured in ... and already the
-bottle had passed, and it was the turn of the labellers. Heredity, date
-of fertilization, membership of Bokanovsky Group-details were trans-
-ferred from test-tube to bottle. No longer anonymous, but named,
-identified, the procession marched slowly on; on through an opening in
-the wall, slowly on into the Social Predestination Room.
-"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
-as they entered."""
-
-
-def create_setup_and_compute(
-    model_names: List[str],
-    batch_sizes: List[int],
-    slice_sizes: List[int],
-    gpu: bool = True,
-    tensorflow: bool = False,
-    average_over: int = 3,
-    no_speed: bool = False,
-    no_memory: bool = False,
-    verbose: bool = False,
-    torchscript: bool = False,
-    xla: bool = False,
-    amp: bool = False,
-    fp16: bool = False,
-    save_to_csv: bool = False,
-    csv_time_filename: str = f"time_{round(time())}.csv",
-    csv_memory_filename: str = f"memory_{round(time())}.csv",
-    print_fn: Callable[[str], None] = print,
-):
-    if xla:
-        tf.config.optimizer.set_jit(True)
-    if amp:
-        tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
-
-    if tensorflow:
-        dictionary = {model_name: {} for model_name in model_names}
-        results = _compute_tensorflow(
-            model_names,
-            batch_sizes,
-            slice_sizes,
-            dictionary,
-            average_over,
-            amp,
-            no_speed,
-            no_memory,
-            verbose,
-            print_fn,
-        )
-    else:
-        device = "cuda" if (gpu and torch.cuda.is_available()) else "cpu"
-        dictionary = {model_name: {} for model_name in model_names}
-        results = _compute_pytorch(
-            model_names,
-            batch_sizes,
-            slice_sizes,
-            dictionary,
-            average_over,
-            device,
-            torchscript,
-            fp16,
-            no_speed,
-            no_memory,
-            verbose,
-            print_fn,
-        )
-
-    print_fn("=========== RESULTS ===========")
-    for model_name in model_names:
-        print_fn("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
-        for batch_size in results[model_name]["bs"]:
-            print_fn("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
-            for slice_size in results[model_name]["ss"]:
-                time = results[model_name]["time"][batch_size][slice_size]
-                memory = results[model_name]["memory"][batch_size][slice_size]
-                if isinstance(time, str):
-                    print_fn(f"\t\t{model_name}/{batch_size}/{slice_size}: " f"{time} " f"{memory}")
-                else:
-                    print_fn(
-                        f"\t\t{model_name}/{batch_size}/{slice_size}: "
-                        f"{(round(1000 * time) / 1000)}"
-                        f"s "
-                        f"{memory}"
-                    )
-
-    if save_to_csv:
-        with open(csv_time_filename, mode="w") as csv_time_file, open(
-            csv_memory_filename, mode="w"
-        ) as csv_memory_file:
-
-            assert len(model_names) > 0, "At least 1 model should be defined, but got {}".format(model_names)
-
-            fieldnames = ["model", "batch_size", "sequence_length"]
-            time_writer = csv.DictWriter(csv_time_file, fieldnames=fieldnames + ["time_in_s"])
-            time_writer.writeheader()
-            memory_writer = csv.DictWriter(csv_memory_file, fieldnames=fieldnames + ["memory"])
-            memory_writer.writeheader()
-
-            for model_name in model_names:
-                time_dict = results[model_name]["time"]
-                memory_dict = results[model_name]["memory"]
-                for bs in time_dict:
-                    for ss in time_dict[bs]:
-                        time_writer.writerow(
-                            {
-                                "model": model_name,
-                                "batch_size": bs,
-                                "sequence_length": ss,
-                                "time_in_s": "{:.4f}".format(time_dict[bs][ss]),
-                            }
-                        )
-
-                for bs in memory_dict:
-                    for ss in time_dict[bs]:
-                        memory_writer.writerow(
-                            {
-                                "model": model_name,
-                                "batch_size": bs,
-                                "sequence_length": ss,
-                                "memory": memory_dict[bs][ss],
-                            }
-                        )
-
-
-def print_summary_statistics(summary: MemorySummary, print_fn: Callable[[str], None]):
-    print_fn(
-        "\nLines by line memory consumption:\n"
-        + "\n".join(
-            f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.sequential
-        )
-    )
-    print_fn(
-        "\nLines with top memory consumption:\n"
-        + "\n".join(
-            f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.cumulative[:6]
-        )
-    )
-    print_fn(
-        "\nLines with lowest memory consumption:\n"
-        + "\n".join(
-            f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.cumulative[-6:]
-        )
-    )
-    print_fn(f"\nTotal memory increase: {summary.total}")
-
-
-def get_print_function(save_print_log, log_filename):
-    if save_print_log:
-        logging.basicConfig(
-            level=logging.DEBUG,
-            filename=log_filename,
-            filemode="a+",
-            format="%(asctime)-15s %(levelname)-8s %(message)s",
-        )
-
-        def print_with_print_log(*args):
-            logging.info(*args)
-            print(*args)
-
-        return print_with_print_log
-    else:
-        return print
-
-
-def _compute_pytorch(
-    model_names,
-    batch_sizes,
-    slice_sizes,
-    dictionary,
-    average_over,
-    device,
-    torchscript,
-    fp16,
-    no_speed,
-    no_memory,
-    verbose,
-    print_fn,
-):
-    for c, model_name in enumerate(model_names):
-        print_fn(f"{c + 1} / {len(model_names)}")
-        config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
-        model = AutoModel.from_pretrained(model_name, config=config)
-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-        tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
-
-        max_input_size = tokenizer.max_model_input_sizes[model_name]
-
-        dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
-        dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
-        dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
-
-        print_fn("Using model {}".format(model))
-        print_fn("Number of all parameters {}".format(model.num_parameters()))
-
-        for batch_size in batch_sizes:
-            if fp16:
-                model.half()
-            model.to(device)
-            model.eval()
-
-            for slice_size in slice_sizes:
-                if max_input_size is not None and slice_size > max_input_size:
-                    dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                else:
-                    sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
-                    try:
-                        if torchscript:
-                            print_fn("Tracing model with sequence size {}".format(sequence.shape))
-                            inference = torch.jit.trace(model, sequence)
-                            inference(sequence)
-                        else:
-                            inference = model
-                            inference(sequence)
-
-                        if not no_memory:
-                            # model.add_memory_hooks()  # Forward method tracing (only for PyTorch models)
-
-                            # Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
-                            trace = start_memory_tracing("transformers")
-                            inference(sequence)
-                            summary = stop_memory_tracing(trace)
-
-                            if verbose:
-                                print_summary_statistics(summary, print_fn)
-
-                            dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
-                        else:
-                            dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-
-                        if not no_speed:
-                            print_fn("Going through model with sequence of shape".format(sequence.shape))
-                            runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
-                            average_time = sum(runtimes) / float(len(runtimes)) / 3.0
-                            dictionary[model_name]["time"][batch_size][slice_size] = average_time
-                        else:
-                            dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-
-                    except RuntimeError as e:
-                        print_fn("Doesn't fit on GPU. {}".format(e))
-                        torch.cuda.empty_cache()
-                        dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                        dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-    return dictionary
-
-
-def _compute_tensorflow(
-    model_names, batch_sizes, slice_sizes, dictionary, average_over, amp, no_speed, no_memory, verbose, print_fn
-):
-    for c, model_name in enumerate(model_names):
-        print_fn(f"{c + 1} / {len(model_names)}")
-        config = AutoConfig.from_pretrained(model_name)
-        model = TFAutoModel.from_pretrained(model_name, config=config)
-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-        tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
-
-        max_input_size = tokenizer.max_model_input_sizes[model_name]
-
-        dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
-        dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
-        dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
-
-        print_fn("Using model {}".format(model))
-        print_fn("Number of all parameters {}".format(model.num_parameters()))
-
-        @tf.function
-        def inference(inputs):
-            return model(inputs)
-
-        for batch_size in batch_sizes:
-            for slice_size in slice_sizes:
-                if max_input_size is not None and slice_size > max_input_size:
-                    dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                else:
-                    sequence = tf.stack(
-                        [tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size
-                    )
-
-                    try:
-                        print_fn("Going through model with sequence of shape {}".format(sequence.shape))
-                        # To make sure that the model is traced + that the tensors are on the appropriate device
-                        inference(sequence)
-
-                        if not no_memory:
-                            # Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
-                            trace = start_memory_tracing("transformers")
-                            inference(sequence)
-                            summary = stop_memory_tracing(trace)
-
-                            if verbose:
-                                print_summary_statistics(summary, print_fn)
-
-                            dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
-                        else:
-                            dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-
-                        if not no_speed:
-                            runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
-                            average_time = sum(runtimes) / float(len(runtimes)) / 3.0
-                            dictionary[model_name]["time"][batch_size][slice_size] = average_time
-                        else:
-                            dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-
-                    except tf.errors.ResourceExhaustedError as e:
-                        print_fn("Doesn't fit on GPU. {}".format(e))
-                        dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                        dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-    return dictionary
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--models",
-        required=False,
-        type=str,
-        default="all",
-        help="Model checkpoints to be provided "
-        "to the AutoModel classes. Leave "
-        "blank to benchmark the base version "
-        "of all available model "
-        "architectures.",
-    )
-    parser.add_argument("--verbose", required=False, action="store_true", help="Verbose memory tracing")
-    parser.add_argument("--no_speed", required=False, action="store_true", help="Don't perform speed measurments")
-    parser.add_argument("--no_memory", required=False, action="store_true", help="Don't perform memory measurments")
-    parser.add_argument(
-        "--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the " "models"
-    )
-    parser.add_argument(
-        "--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available " "cuda devices"
-    )
-    parser.add_argument(
-        "--torchscript",
-        required=False,
-        action="store_true",
-        help="Pytorch only: trace the models " "using torchscript",
-    )
-    parser.add_argument(
-        "--tensorflow",
-        required=False,
-        action="store_true",
-        help="Benchmark the TensorFlow version "
-        "of the models. Will run on GPU if "
-        "the correct dependencies are "
-        "installed",
-    )
-    parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
-    parser.add_argument(
-        "--amp",
-        required=False,
-        action="store_true",
-        help="TensorFlow only: use automatic mixed precision acceleration.",
-    )
-    parser.add_argument(
-        "--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference."
-    )
-    parser.add_argument(
-        "--keras_predict",
-        required=False,
-        action="store_true",
-        help="Whether to use model.predict " "instead of model() to do a " "forward pass.",
-    )
-    parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
-    parser.add_argument(
-        "--log_print", required=False, action="store_true", help="Save all print statements in log file."
-    )
-    parser.add_argument(
-        "--csv_time_filename",
-        required=False,
-        default=f"time_{round(time())}.csv",
-        help="CSV filename used if saving time results to csv.",
-    )
-    parser.add_argument(
-        "--csv_memory_filename",
-        required=False,
-        default=f"memory_{round(time())}.csv",
-        help="CSV filename used if saving memory results to csv.",
-    )
-    parser.add_argument(
-        "--log_filename",
-        required=False,
-        default=f"log_{round(time())}.txt",
-        help="Log filename used if print statements are saved in log.",
-    )
-    parser.add_argument(
-        "--average_over", required=False, default=30, type=int, help="Times an experiment will be run."
-    )
-    parser.add_argument("--batch_sizes", nargs="+", type=int, default=[1, 2, 4, 8])
-    parser.add_argument("--slice_sizes", nargs="+", type=int, default=[8, 64, 128, 256, 512, 1024])
-
-    args = parser.parse_args()
-    if args.models == "all":
-        args.models = [
-            "gpt2",
-            "bert-base-cased",
-            "xlnet-base-cased",
-            "xlm-mlm-en-2048",
-            "transfo-xl-wt103",
-            "openai-gpt",
-            "distilbert-base-uncased",
-            "distilgpt2",
-            "roberta-base",
-            "ctrl",
-            "t5-base",
-            "bart-large",
-        ]
-    else:
-        args.models = args.models.split()
-
-    print_fn = get_print_function(args.log_print, args.log_filename)
-    print_fn("Running with arguments: {}".format(args))
-
-    if args.torch:
-        if is_torch_available():
-            create_setup_and_compute(
-                model_names=args.models,
-                batch_sizes=args.batch_sizes,
-                slice_sizes=args.slice_sizes,
-                tensorflow=False,
-                gpu=args.torch_cuda,
-                torchscript=args.torchscript,
-                fp16=args.fp16,
-                save_to_csv=args.save_to_csv,
-                csv_time_filename=args.csv_time_filename,
-                csv_memory_filename=args.csv_memory_filename,
-                average_over=args.average_over,
-                no_speed=args.no_speed,
-                no_memory=args.no_memory,
-                verbose=args.verbose,
-                print_fn=print_fn,
-            )
-        else:
-            raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
-
-    if args.tensorflow:
-        if is_tf_available():
-            create_setup_and_compute(
-                model_names=args.models,
-                batch_sizes=args.batch_sizes,
-                slice_sizes=args.slice_sizes,
-                tensorflow=True,
-                xla=args.xla,
-                amp=args.amp,
-                save_to_csv=args.save_to_csv,
-                csv_time_filename=args.csv_time_filename,
-                csv_memory_filename=args.csv_memory_filename,
-                average_over=args.average_over,
-                no_speed=args.no_speed,
-                no_memory=args.no_memory,
-                verbose=args.verbose,
-                print_fn=print_fn,
-            )
-        else:
-            raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
-
-
-if __name__ == "__main__":
-    main()
--- a/examples/bertology/run_bertology.py
+++ b/examples/bertology/run_bertology.py
@@ -64,7 +64,7 @@ def print_2d_tensor(tensor):


 def compute_heads_importance(
-    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
+    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
 ):
    """ This method shows how to compute:
        - head attention entropy
@@ -77,7 +77,12 @@ def compute_heads_importance(

    if head_mask is None:
        head_mask = torch.ones(n_layers, n_heads).to(args.device)
+
    head_mask.requires_grad_(requires_grad=True)
+    # If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
+    if actually_pruned:
+        head_mask = None
+
    preds = None
    labels = None
    tot_tokens = 0.0
@@ -172,6 +177,7 @@ def mask_heads(args, model, eval_dataloader):
        new_head_mask = new_head_mask.view(-1)
        new_head_mask[current_heads_to_mask] = 0.0
        new_head_mask = new_head_mask.view_as(head_mask)
+        new_head_mask = new_head_mask.clone().detach()
        print_2d_tensor(new_head_mask)

        # Compute metric and head importance again
@@ -181,7 +187,7 @@ def mask_heads(args, model, eval_dataloader):
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
        logger.info(
-            "Masking: current score: %f, remaning heads %d (%.1f percents)",
+            "Masking: current score: %f, remaining heads %d (%.1f percents)",
            current_score,
            new_head_mask.sum(),
            new_head_mask.sum() / new_head_mask.numel() * 100,
@@ -209,14 +215,23 @@ def prune_heads(args, model, eval_dataloader, head_mask):
    original_time = datetime.now() - before_time

    original_num_params = sum(p.numel() for p in model.parameters())
-    heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
+    heads_to_prune = dict(
+        (layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
+    )
+
    assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
    model.prune_heads(heads_to_prune)
    pruned_num_params = sum(p.numel() for p in model.parameters())

    before_time = datetime.now()
    _, _, preds, labels = compute_heads_importance(
-        args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
+        args,
+        model,
+        eval_dataloader,
+        compute_entropy=False,
+        compute_importance=False,
+        head_mask=None,
+        actually_pruned=True,
    )
    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
    score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
@@ -404,7 +419,7 @@ def main():
    logger.info("Training/evaluation parameters %s", args)

    # Prepare dataset for the GLUE task
-    eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True, local_rank=args.local_rank)
+    eval_dataset = GlueDataset(args, tokenizer=tokenizer, mode="dev")
    if args.data_subset > 0:
        eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
--- a/examples/contrib/mm-imdb/run_mmimdb.py
+++ b/examples/contrib/mm-imdb/run_mmimdb.py
@@ -34,26 +34,11 @@ from tqdm import tqdm, trange
 from transformers import (
    WEIGHTS_NAME,
    AdamW,
-    AlbertConfig,
-    AlbertModel,
-    AlbertTokenizer,
-    BertConfig,
-    BertModel,
-    BertTokenizer,
-    DistilBertConfig,
-    DistilBertModel,
-    DistilBertTokenizer,
+    AutoConfig,
+    AutoModel,
+    AutoTokenizer,
    MMBTConfig,
    MMBTForClassification,
-    RobertaConfig,
-    RobertaModel,
-    RobertaTokenizer,
-    XLMConfig,
-    XLMModel,
-    XLMTokenizer,
-    XLNetConfig,
-    XLNetModel,
-    XLNetTokenizer,
    get_linear_schedule_with_warmup,
 )
 from utils_mmimdb import ImageEncoder, JsonlDataset, collate_fn, get_image_transforms, get_mmimdb_labels
@@ -67,23 +52,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (
-        tuple(conf.pretrained_config_archive_map.keys())
-        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
-    ),
-    (),
-)
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertModel, BertTokenizer),
-    "xlnet": (XLNetConfig, XLNetModel, XLNetTokenizer),
-    "xlm": (XLMConfig, XLMModel, XLMTokenizer),
-    "roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
-    "distilbert": (DistilBertConfig, DistilBertModel, DistilBertTokenizer),
-    "albert": (AlbertConfig, AlbertModel, AlbertTokenizer),
-}
-

 def set_seed(args):
    random.seed(args.seed)
@@ -351,19 +319,12 @@ def main():
        required=True,
        help="The input data dir. Should contain the .jsonl files for MMIMDB.",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
@@ -385,7 +346,7 @@ def main():
    )
    parser.add_argument(
        "--cache_dir",
-        default="",
+        default=None,
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
@@ -526,18 +487,14 @@ def main():
    # Setup model
    labels = get_mmimdb_labels()
    num_labels = len(labels)
-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    transformer_config = config_class.from_pretrained(
-        args.config_name if args.config_name else args.model_name_or_path
-    )
-    tokenizer = tokenizer_class.from_pretrained(
+    transformer_config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    transformer = model_class.from_pretrained(
-        args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
+    transformer = AutoModel.from_pretrained(
+        args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir
    )
    img_encoder = ImageEncoder(args)
    config = MMBTConfig(transformer_config, num_labels=num_labels)
@@ -583,13 +540,12 @@ def main():
        # Load a trained model and vocabulary that you have fine-tuned
        model = MMBTForClassification(config, transformer, img_encoder)
        model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
--- a/examples/contrib/run_swag.py
+++ b/examples/contrib/run_swag.py
@@ -31,14 +31,8 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
 from torch.utils.data.distributed import DistributedSampler
 from tqdm import tqdm, trange

-from transformers import (
-    WEIGHTS_NAME,
-    AdamW,
-    BertConfig,
-    BertForMultipleChoice,
-    BertTokenizer,
-    get_linear_schedule_with_warmup,
-)
+from transformers import WEIGHTS_NAME, AdamW, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup
+from transformers.modeling_auto import AutoModelForMultipleChoice


 try:
@@ -49,12 +43,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in [BertConfig]), ())
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertForMultipleChoice, BertTokenizer),
-}
-

 class SwagExample(object):
    """A single training/test example for the SWAG dataset."""
@@ -492,19 +480,12 @@ def main():
        required=True,
        help="SWAG csv for predictions. E.g., val.csv or test.csv",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
@@ -536,9 +517,6 @@ def main():
    parser.add_argument(
        "--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
    )
-    parser.add_argument(
-        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
-    )

    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument(
@@ -652,13 +630,9 @@ def main():
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
-    tokenizer = tokenizer_class.from_pretrained(
-        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case
-    )
-    model = model_class.from_pretrained(
+    config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,)
+    model = AutoModelForMultipleChoice.from_pretrained(
        args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config
    )

@@ -694,8 +668,8 @@ def main():
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
-        model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model = AutoModelForMultipleChoice.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
@@ -718,8 +692,8 @@ def main():
        for checkpoint in checkpoints:
            # Reload the model
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
-            model = model_class.from_pretrained(checkpoint)
-            tokenizer = tokenizer_class.from_pretrained(checkpoint)
+            model = AutoModelForMultipleChoice.from_pretrained(checkpoint)
+            tokenizer = AutoTokenizer.from_pretrained(checkpoint)
            model.to(args.device)

            # Evaluate
--- a/examples/contrib/run_transfo_xl.py
+++ b/examples/contrib/run_transfo_xl.py
@@ -80,7 +80,7 @@ def main():

    # Load a pre-trained model
    model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
-    model = model.to(device)
+    model.to(device)

    logger.info(
        "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -80,7 +80,7 @@ class Distiller:

        self.mlm = params.mlm
        if self.mlm:
-            logger.info(f"Using MLM loss for LM step.")
+            logger.info("Using MLM loss for LM step.")
            self.mlm_mask_prop = params.mlm_mask_prop
            assert 0.0 <= self.mlm_mask_prop <= 1.0
            assert params.word_mask + params.word_keep + params.word_rand == 1.0
@@ -91,7 +91,7 @@ class Distiller:
                self.pred_probs = self.pred_probs.half()
                self.token_probs = self.token_probs.half()
        else:
-            logger.info(f"Using CLM loss for LM step.")
+            logger.info("Using CLM loss for LM step.")

        self.epoch = 0
        self.n_iter = 0
@@ -365,8 +365,8 @@ class Distiller:
            self.end_epoch()

        if self.is_master:
-            logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
-            self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
+            logger.info("Save very last checkpoint as `pytorch_model.bin`.")
+            self.save_checkpoint(checkpoint_name="pytorch_model.bin")
            logger.info("Training is finished")

    def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):
--- a/examples/distillation/run_squad_w_distillation.py
+++ b/examples/distillation/run_squad_w_distillation.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" This is the exact same script as `examples/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""
+""" This is the exact same script as `examples/question-answering/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""

 import argparse
 import glob
@@ -67,9 +67,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig)), ()
-)

 MODEL_CLASSES = {
    "bert": (BertConfig, BertForQuestionAnswering, BertTokenizer),
@@ -505,7 +502,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -60,7 +60,7 @@ def main():
    with open(args.file_path, "r", encoding="utf8") as fp:
        data = fp.readlines()

-    logger.info(f"Start encoding")
+    logger.info("Start encoding")
    logger.info(f"{len(data)} examples to process.")

    rslt = []
--- a/examples/distillation/scripts/extract.py
+++ b/examples/distillation/scripts/extract.py
@@ -93,7 +93,7 @@ if __name__ == "__main__":
    elif args.model_type == "gpt2":
        for w in ["weight", "bias"]:
            compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
-        compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
+        compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]

    print(f"N layers selected for distillation: {std_idx}")
    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -37,7 +37,7 @@ if __name__ == "__main__":
        model = BertForMaskedLM.from_pretrained(args.model_name)
        prefix = "bert"
    else:
-        raise ValueError(f'args.model_type should be "bert".')
+        raise ValueError('args.model_type should be "bert".')

    state_dict = model.state_dict()
    compressed_sd = {}
@@ -78,8 +78,8 @@ if __name__ == "__main__":
            ]
        std_idx += 1

-    compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
-    compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
+    compressed_sd["vocab_projector.weight"] = state_dict["cls.predictions.decoder.weight"]
+    compressed_sd["vocab_projector.bias"] = state_dict["cls.predictions.bias"]
    if args.vocab_transform:
        for w in ["weight", "bias"]:
            compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -273,7 +273,7 @@ def main():
        token_probs = None

    train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
-    logger.info(f"Data loader created.")
+    logger.info("Data loader created.")

    # STUDENT #
    logger.info(f"Loading student config from {args.student_config}")
@@ -288,7 +288,7 @@ def main():

    if args.n_gpu > 0:
        student.to(f"cuda:{args.local_rank}")
-    logger.info(f"Student loaded.")
+    logger.info("Student loaded.")

    # TEACHER #
    teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -1,10 +1,9 @@

 ## Language model training

-Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
+Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).

-Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
-to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
+Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT, DistilBERT and RoBERTa. GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT, DistilBERT and RoBERTa
 are fine-tuned using a masked language modeling (MLM) loss.

 Before running the following example, you should get a file that contains text on which the language model will be
@@ -35,7 +34,7 @@ python run_language_modeling.py \
 This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
 a score of ~20 perplexity once fine-tuned on the dataset.

-### RoBERTa/BERT and masked language modeling
+### RoBERTa/BERT/DistilBERT and masked language modeling

 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
@@ -115,15 +115,13 @@ class DataTrainingArguments:
    )


-def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False, local_rank=-1):
+def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
-        return LineByLineTextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
-        )
+        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank,
+            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )


@@ -220,16 +218,9 @@ def main():
        data_args.block_size = min(data_args.block_size, tokenizer.max_len)

    # Get datasets
-    train_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-        if training_args.do_eval
-        else None
-    )
+
+    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
+    eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
    )
@@ -260,25 +251,31 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()

-        perplexity = math.exp(eval_output["loss"])
+        perplexity = math.exp(eval_output["eval_loss"])
        result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key in sorted(result.keys()):
-                logger.info("  %s = %s", key, str(result[key]))
-                writer.write("%s = %s\n" % (key, str(result[key])))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key in sorted(result.keys()):
+                    logger.info("  %s = %s", key, str(result[key]))
+                    writer.write("%s = %s\n" % (key, str(result[key])))

        results.update(result)

    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/movement-pruning/README.md
+++ b/examples/movement-pruning/README.md
@@ -0,0 +1,183 @@
+# Movement Pruning: Adaptive Sparsity by Fine-Tuning
+
+*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
+
+| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning      | L0 Regularization      | Movement Pruning       | Soft Movement Pruning          |
+| :---:                                                       | :---:                   | :---:                    | :---:                  | :---:                  | :---:                  | :---:                          |
+| SQuAD - Dev<br>EM/F1                                        | 80.4/88.1               | 10%<br>3%                | 70.2/80.1<br>45.5/59.6 | 72.4/81.9<br>64.3/75.8 | 75.6/84.3<br>67.5/78.0 | **76.6/84.9**<br>**72.7/82.3** |
+| MNLI - Dev<br>acc/MM acc                                    | 84.5/84.9               | 10%<br>3%                | 78.3/79.3<br>69.4/70.6 | 78.7/79.7<br>76.0/76.2 | 80.1/80.4<br>76.5/77.4 | **81.2/81.8**<br>**79.5/80.1** |
+| QQP - Dev<br>acc/F1                                         | 91.4/88.4               | 10%<br>3%                | 79.8/65.0<br>72.4/57.8 | 88.1/82.8<br>87.0/81.9 | 89.7/86.2<br>86.1/81.5 | **90.2/86.8**<br>**89.1/85.5** |
+
+This page contains information on how to fine-prune pre-trained models such as `BERT` to obtain extremely sparse models with movement pruning. In contrast to magnitude pruning which selects weights that are far from 0, movement pruning retains weights that are moving away from 0.
+
+For more information, we invite you to check out [our paper](https://arxiv.org/abs/2005.07683).
+You can also have a look at this fun *Explain Like I'm Five* introductory [slide deck](https://www.slideshare.net/VictorSanh/movement-pruning-explain-like-im-five-234205241).
+
+<div align="center">
+<img src="https://www.seekpng.com/png/detail/166-1669328_how-to-make-emmental-cheese-at-home-icooker.png" width="400">
+</div>
+
+## Extreme sparsity and efficient storage
+
+One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
+
+In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
+
+While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).
+
+## Fine-pruned models
+
+As examples, we release two English PruneBERT checkpoints (models fine-pruned from a pre-trained `BERT` checkpoint), one on SQuAD and the other on MNLI.
+
+- **`prunebert-base-uncased-6-finepruned-w-distil-squad`**<br/>
+Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on SQuAD v1.1. We use an additional distillation signal from `BERT-base-uncased` finetuned on SQuAD. The encoder counts 6% of total non-null weights and reaches 83.8 F1 score. The model can be accessed with: `pruned_bert = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")`
+- **`prunebert-base-uncased-6-finepruned-w-distil-mnli`**<br/>
+Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on MNLI. We use an additional distillation signal from `BERT-base-uncased` finetuned on MNLI. The encoder counts 6% of total non-null weights and reaches 80.7 (matched) accuracy. The model can be accessed with: `pruned_bert = BertForSequenceClassification.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")`
+
+## How to fine-prune?
+
+### Setup
+
+The code relies on the 🤗 Transformers library. In addition to the dependencies listed in the [`examples`](https://github.com/huggingface/transformers/tree/master/examples) folder, you should install a few additional dependencies listed in the `requirements.txt` file: `pip install -r requirements.txt`.
+
+Note that we built our experiments on top of a stabilized version of the library (commit https://github.com/huggingface/transformers/commit/352d5472b0c1dec0f420d606d16747d851b4bda8): we do not guarantee that everything is still compatible with the latest version of the master branch.
+
+### Fine-pruning with movement pruning
+
+Below, we detail how to reproduce the results reported in the paper. We use SQuAD as a running example. Commands (and scripts) can be easily adapted for other tasks.
+
+The following command fine-prunes a pre-trained `BERT-base` on SQuAD using movement pruning towards 15% of remaining weights (85% sparsity). Note that we freeze all the embeddings modules (from their pre-trained value) and only prune the Fully Connected layers in the encoder (12 layers of Transformer Block).
+
+```bash
+SERIALIZATION_DIR=<OUTPUT_DIR>
+SQUAD_DATA=<SQUAD_DATA>
+
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
+    --initial_threshold 1 --final_threshold 0.15 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method topK --mask_init constant --mask_scale 0.
+```
+
+### Fine-pruning with other methods
+
+We can also explore other fine-pruning methods by changing the `pruning_method` parameter:
+
+Soft movement pruning
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
+    --initial_threshold 0 --final_threshold 0.1 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method sigmoied_threshold --mask_init constant --mask_scale 0. \
+    --regularization l1 --final_lambda 400.
+```
+
+L0 regularization
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-1 \
+    --initial_threshold 1. --final_threshold 1. \
+    --initial_warmup 1 --final_warmup 1 \
+    --pruning_method l0 --mask_init constant --mask_scale 2.197 \
+    --regularization l0 --final_lambda 125.
+```
+
+Iterative Magnitude Pruning
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir ./dbg \
+    --data_dir examples/distillation/data/squad_data \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 \
+    --initial_threshold 1 --final_threshold 0.15 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method magnitude
+```
+
+### After fine-pruning
+
+**Counting parameters**
+
+Regularization based pruning methods (soft movement pruning and L0 regularization) rely on the penalty to induce sparsity. The multiplicative coefficient controls the sparsity level.
+To obtain the effective sparsity level in the encoder, we simply count the number of activated (non-null) weights:
+
+```bash
+python examples/movement-pruning/count_parameters.py \
+    --pruning_method sigmoied_threshold \
+    --threshold 0.1 \
+    --serialization_dir $SERIALIZATION_DIR
+```
+
+**Pruning once for all**
+
+Once the model has been fine-pruned, the pruned weights can be set to 0. once for all (reducing the amount of information to store). In our running experiments, we can convert a `MaskedBertForQuestionAnswering` (a BERT model augmented to enable on-the-fly pruning capabilities) to a standard `BertForQuestionAnswering`:
+
+```bash
+python examples/movement-pruning/bertarize.py \
+    --pruning_method sigmoied_threshold \
+    --threshold 0.1 \
+    --model_name_or_path $SERIALIZATION_DIR
+```
+
+## Hyper-parameters
+
+For reproducibility purposes, we share the detailed results presented in the paper. These [tables](https://docs.google.com/spreadsheets/d/17JgRq_OFFTniUrz6BZWW_87DjFkKXpI1kYDSsseT_7g/edit?usp=sharing) exhaustively describe the individual hyper-parameters used for each data point.
+
+## Inference speed
+
+Early experiments show that even though models fine-pruned with (soft) movement pruning are extremely sparse, they do not benefit from significant improvement in terms of inference speed when using the standard PyTorch inference.
+We are currently benchmarking and exploring inference setups specifically for sparse architectures.
+In particular, hardware manufacturers are announcing devices that will speedup inference for sparse networks considerably.
+
+## Citation
+
+If you find this resource useful, please consider citing the following paper:
+
+```
+@article{sanh2020movement,
+    title={Movement Pruning: Adaptive Sparsity by Fine-Tuning},
+    author={Victor Sanh and Thomas Wolf and Alexander M. Rush},
+    year={2020},
+    eprint={2005.07683},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
--- a/examples/movement-pruning/Saving_PruneBERT.ipynb
+++ b/examples/movement-pruning/Saving_PruneBERT.ipynb
@@ -0,0 +1,612 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Saving PruneBERT\n",
+    "\n",
+    "\n",
+    "This notebook aims at showcasing how we can leverage standard tools to save (and load) an extremely sparse model fine-pruned with [movement pruning](https://arxiv.org/abs/2005.07683) (or any other unstructured pruning mehtod).\n",
+    "\n",
+    "In this example, we used BERT (base-uncased, but the procedure described here is not specific to BERT and can be applied to a large variety of models.\n",
+    "\n",
+    "We first obtain an extremely sparse model by fine-pruning with movement pruning on SQuAD v1.1. We then used the following combination of standard tools:\n",
+    "- We reduce the precision of the model with Int8 dynamic quantization using [PyTorch implementation](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html). We only quantized the Fully Connected Layers.\n",
+    "- Sparse quantized matrices are converted into the [Compressed Sparse Row format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).\n",
+    "- We use HDF5 with `gzip` compression to store the weights.\n",
+    "\n",
+    "We experiment with a question answering model with only 6% of total remaining weights in the encoder (previously obtained with movement pruning). **We are able to reduce the memory size of the encoder from 340MB (original dense BERT) to 11MB**, which fits on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical)!\n",
+    "\n",
+    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Floptical_disk_21MB.jpg/440px-Floptical_disk_21MB.jpg\" width=\"200\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Includes\n",
+    "\n",
+    "import h5py\n",
+    "import os\n",
+    "import json\n",
+    "from collections import OrderedDict\n",
+    "\n",
+    "from scipy import sparse\n",
+    "import numpy as np\n",
+    "\n",
+    "import torch\n",
+    "from torch import nn\n",
+    "\n",
+    "from transformers import *\n",
+    "\n",
+    "os.chdir('../../')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Saving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dynamic quantization induces little or no loss of performance while significantly reducing the memory footprint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load fine-pruned model and quantize the model\n",
+    "\n",
+    "model_path = \"serialization_dir/bert-base-uncased/92/squad/l1\"\n",
+    "model_name = \"bertarized_l1_with_distil_0._0.1_1_2_l1_1100._3e-5_1e-2_sigmoied_threshold_constant_0._10_epochs\"\n",
+    "\n",
+    "model = BertForQuestionAnswering.from_pretrained(os.path.join(model_path, model_name))\n",
+    "model.to('cpu')\n",
+    "\n",
+    "quantized_model = torch.quantization.quantize_dynamic(\n",
+    "                    model=model,\n",
+    "                    qconfig_spec = {\n",
+    "                        torch.nn.Linear : torch.quantization.default_dynamic_qconfig,\n",
+    "                    },\n",
+    "                    dtype=torch.qint8,\n",
+    "                )\n",
+    "# print(quantized_model)\n",
+    "\n",
+    "qtz_st = quantized_model.state_dict()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Saving the original (encoder + classifier) in the standard torch.save format\n",
+    "\n",
+    "dense_st = {name: param for name, param in model.state_dict().items() \n",
+    "                            if \"embedding\" not in name and \"pooler\" not in name}\n",
+    "torch.save(dense_st, 'dbg/dense_squad.pt',)\n",
+    "dense_mb_size = os.path.getsize(\"dbg/dense_squad.pt\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Decompose quantization for bert.encoder.layer.0.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.pooler.dense._packed_params.weight\n",
+      "Decompose quantization for qa_outputs._packed_params.weight\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Elementary representation: we decompose the quantized tensors into (scale, zero_point, int_repr).\n",
+    "# See https://pytorch.org/docs/stable/quantization.html\n",
+    "\n",
+    "# We further leverage the fact that int_repr is sparse matrix to optimize the storage: we decompose int_repr into\n",
+    "# its CSR representation (data, indptr, indices).\n",
+    "\n",
+    "elementary_qtz_st = {}\n",
+    "for name, param in qtz_st.items():\n",
+    "    if param.is_quantized:\n",
+    "        print(\"Decompose quantization for\", name)\n",
+    "        # We need to extract the scale, the zero_point and the int_repr for the quantized tensor and modules\n",
+    "        scale = param.q_scale()                                # torch.tensor(1,) - float32\n",
+    "        zero_point = param.q_zero_point()                      # torch.tensor(1,) - int32\n",
+    "        elementary_qtz_st[f\"{name}.scale\"] = scale\n",
+    "        elementary_qtz_st[f\"{name}.zero_point\"] = zero_point\n",
+    "\n",
+    "        # We assume the int_repr is sparse and compute its CSR representation\n",
+    "        # Only the FCs in the encoder are actually sparse\n",
+    "        int_repr = param.int_repr()                         # torch.tensor(nb_rows, nb_columns) - int8\n",
+    "        int_repr_cs = sparse.csr_matrix(int_repr)           # scipy.sparse.csr.csr_matrix\n",
+    "\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.data\"] = int_repr_cs.data                  # np.array int8\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.indptr\"] = int_repr_cs.indptr              # np.array int32\n",
+    "        assert max(int_repr_cs.indices) < 65535 # If not, we shall fall back to int32\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.indices\"] = np.uint16(int_repr_cs.indices) # np.array uint16\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.shape\"] = int_repr_cs.shape                # tuple(int, int)\n",
+    "    else:\n",
+    "        elementary_qtz_st[name] = param\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Encoder Size (MB) - Sparse & Quantized - `torch.save`: 21.29\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Saving the pruned (encoder + classifier) in the standard torch.save format\n",
+    "\n",
+    "dense_optimized_st = {name: param for name, param in elementary_qtz_st.items() \n",
+    "                                    if \"embedding\" not in name and \"pooler\" not in name}\n",
+    "torch.save(dense_optimized_st, 'dbg/dense_squad_optimized.pt',)\n",
+    "print(\"Encoder Size (MB) - Sparse & Quantized - `torch.save`:\",\n",
+    "      round(os.path.getsize(\"dbg/dense_squad_optimized.pt\")/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Skip bert.embeddings.word_embeddings.weight\n",
+      "Skip bert.embeddings.position_embeddings.weight\n",
+      "Skip bert.embeddings.token_type_embeddings.weight\n",
+      "Skip bert.embeddings.LayerNorm.weight\n",
+      "Skip bert.embeddings.LayerNorm.bias\n",
+      "Skip bert.pooler.dense.scale\n",
+      "Skip bert.pooler.dense.zero_point\n",
+      "Skip bert.pooler.dense._packed_params.weight.scale\n",
+      "Skip bert.pooler.dense._packed_params.weight.zero_point\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.data\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.indptr\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.indices\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.shape\n",
+      "Skip bert.pooler.dense._packed_params.bias\n",
+      "\n",
+      "Encoder Size (MB) - Dense:              340.25\n",
+      "Encoder Size (MB) - Sparse & Quantized: 11.27\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the decomposed state_dict with an HDF5 file\n",
+    "# Saving only the encoder + QA Head\n",
+    "\n",
+    "with h5py.File('dbg/squad_sparse.h5','w') as hf:\n",
+    "    for name, param in elementary_qtz_st.items():\n",
+    "        if \"embedding\" in name:\n",
+    "            print(f\"Skip {name}\")\n",
+    "            continue\n",
+    "\n",
+    "        if \"pooler\" in name:\n",
+    "            print(f\"Skip {name}\")\n",
+    "            continue\n",
+    "\n",
+    "        if type(param) == torch.Tensor:\n",
+    "            if param.numel() == 1:\n",
+    "                # module scale\n",
+    "                # module zero_point\n",
+    "                hf.attrs[name] = param\n",
+    "                continue\n",
+    "\n",
+    "            if param.requires_grad:\n",
+    "                # LayerNorm\n",
+    "                param = param.detach().numpy()\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "        elif type(param) == float or type(param) == int or type(param) == tuple:\n",
+    "            # float - tensor _packed_params.weight.scale\n",
+    "            # int   - tensor_packed_params.weight.zero_point\n",
+    "            # tuple - tensor _packed_params.weight.shape\n",
+    "            hf.attrs[name] = param\n",
+    "\n",
+    "        else:\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "\n",
+    "with open('dbg/metadata.json', 'w') as f:\n",
+    "    f.write(json.dumps(qtz_st._metadata))  \n",
+    "\n",
+    "size = os.path.getsize(\"dbg/squad_sparse.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
+    "print(\"\")\n",
+    "print(\"Encoder Size (MB) - Dense:             \", round(dense_mb_size/1e6, 2))\n",
+    "print(\"Encoder Size (MB) - Sparse & Quantized:\", round(size/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Size (MB): 99.39\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the decomposed state_dict to HDF5 storage\n",
+    "# Save everything in the architecutre (embedding + encoder + QA Head)\n",
+    "\n",
+    "with h5py.File('dbg/squad_sparse_with_embs.h5','w') as hf:\n",
+    "    for name, param in elementary_qtz_st.items():\n",
+    "#         if \"embedding\" in name:\n",
+    "#             print(f\"Skip {name}\")\n",
+    "#             continue\n",
+    "\n",
+    "#         if \"pooler\" in name:\n",
+    "#             print(f\"Skip {name}\")\n",
+    "#             continue\n",
+    "\n",
+    "        if type(param) == torch.Tensor:\n",
+    "            if param.numel() == 1:\n",
+    "                # module scale\n",
+    "                # module zero_point\n",
+    "                hf.attrs[name] = param\n",
+    "                continue\n",
+    "\n",
+    "            if param.requires_grad:\n",
+    "                # LayerNorm\n",
+    "                param = param.detach().numpy()\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "        elif type(param) == float or type(param) == int or type(param) == tuple:\n",
+    "            # float - tensor _packed_params.weight.scale\n",
+    "            # int   - tensor _packed_params.weight.zero_point\n",
+    "            # tuple - tensor _packed_params.weight.shape\n",
+    "            hf.attrs[name] = param\n",
+    "\n",
+    "        else:\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "\n",
+    "with open('dbg/metadata.json', 'w') as f:\n",
+    "    f.write(json.dumps(qtz_st._metadata))   \n",
+    "\n",
+    "size = os.path.getsize(\"dbg/squad_sparse_with_embs.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
+    "print('\\nSize (MB):', round(size/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reconstruct the elementary state dict\n",
+    "\n",
+    "reconstructed_elementary_qtz_st = {}\n",
+    "\n",
+    "hf = h5py.File('dbg/squad_sparse_with_embs.h5','r')\n",
+    "\n",
+    "for attr_name, attr_param in hf.attrs.items():\n",
+    "    if 'shape' in attr_name:\n",
+    "        attr_param = tuple(attr_param)\n",
+    "    elif \".scale\" in attr_name:\n",
+    "        if \"_packed_params\" in attr_name:\n",
+    "            attr_param = float(attr_param)\n",
+    "        else:\n",
+    "            attr_param = torch.tensor(attr_param)\n",
+    "    elif \".zero_point\" in attr_name:\n",
+    "        if \"_packed_params\" in attr_name:\n",
+    "            attr_param = int(attr_param)\n",
+    "        else:\n",
+    "            attr_param = torch.tensor(attr_param)\n",
+    "    reconstructed_elementary_qtz_st[attr_name] = attr_param\n",
+    "    # print(f\"Unpack {attr_name}\")\n",
+    "    \n",
+    "# Get the tensors/arrays\n",
+    "for data_name, data_param in hf.items():\n",
+    "    if \"LayerNorm\" in data_name or \"_packed_params.bias\" in data_name:\n",
+    "        reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
+    "    elif \"embedding\" in data_name:\n",
+    "        reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
+    "    else: # _packed_params.weight.int_repr.data, _packed_params.weight.int_repr.indices and _packed_params.weight.int_repr.indptr\n",
+    "        data_param = np.array(data_param)\n",
+    "        if \"indices\" in data_name:\n",
+    "            data_param = np.array(data_param, dtype=np.int32)\n",
+    "        reconstructed_elementary_qtz_st[data_name] = data_param\n",
+    "    # print(f\"Unpack {data_name}\")\n",
+    "    \n",
+    "\n",
+    "hf.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity checks\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    assert name in elementary_qtz_st\n",
+    "for name, param in elementary_qtz_st.items():\n",
+    "    assert name in reconstructed_elementary_qtz_st, name\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    assert type(param) == type(elementary_qtz_st[name]), name\n",
+    "    if type(param) == torch.Tensor:\n",
+    "        assert torch.all(torch.eq(param, elementary_qtz_st[name])), name\n",
+    "    elif type(param) == np.ndarray:\n",
+    "        assert (param == elementary_qtz_st[name]).all(), name\n",
+    "    else:\n",
+    "        assert param == elementary_qtz_st[name], name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Re-assemble the sparse int_repr from the CSR format\n",
+    "\n",
+    "reconstructed_qtz_st = {}\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    if \"weight.int_repr.indptr\" in name:\n",
+    "        prefix_ = name[:-16]\n",
+    "        data    = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.data\"]\n",
+    "        indptr  = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indptr\"]\n",
+    "        indices = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indices\"]\n",
+    "        shape   = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.shape\"]\n",
+    "\n",
+    "        int_repr = sparse.csr_matrix(arg1=(data, indices, indptr),\n",
+    "                                     shape=shape)\n",
+    "        int_repr = torch.tensor(int_repr.todense())\n",
+    "\n",
+    "        scale = reconstructed_elementary_qtz_st[f\"{prefix_}.scale\"]\n",
+    "        zero_point = reconstructed_elementary_qtz_st[f\"{prefix_}.zero_point\"]\n",
+    "        weight = torch._make_per_tensor_quantized_tensor(int_repr,\n",
+    "                                                         scale,\n",
+    "                                                         zero_point)\n",
+    "\n",
+    "        reconstructed_qtz_st[f\"{prefix_}\"] = weight\n",
+    "    elif \"int_repr.data\" in name or \"int_repr.shape\" in name or \"int_repr.indices\" in name or \\\n",
+    "         \"weight.scale\" in name or \"weight.zero_point\" in name:\n",
+    "        continue\n",
+    "    else:\n",
+    "        reconstructed_qtz_st[name] = param\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity checks\n",
+    "\n",
+    "for name, param in reconstructed_qtz_st.items():\n",
+    "    assert name in qtz_st\n",
+    "for name, param in qtz_st.items():\n",
+    "    assert name in reconstructed_qtz_st, name\n",
+    "\n",
+    "for name, param in reconstructed_qtz_st.items():\n",
+    "    assert type(param) == type(qtz_st[name]), name\n",
+    "    if type(param) == torch.Tensor:\n",
+    "        assert torch.all(torch.eq(param, qtz_st[name])), name\n",
+    "    elif type(param) == np.ndarray:\n",
+    "        assert (param == qtz_st[name]).all(), name\n",
+    "    else:\n",
+    "        assert param == qtz_st[name], name"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Sanity checks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<All keys matched successfully>"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Load the re-constructed state dict into a model\n",
+    "\n",
+    "dummy_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')\n",
+    "dummy_model.to('cpu')\n",
+    "\n",
+    "reconstructed_qtz_model = torch.quantization.quantize_dynamic(\n",
+    "                            model=dummy_model,\n",
+    "                            qconfig_spec = None,\n",
+    "                            dtype=torch.qint8,\n",
+    "                          )\n",
+    "\n",
+    "reconstructed_qtz_st = OrderedDict(reconstructed_qtz_st)\n",
+    "with open('dbg/metadata.json', 'r') as read_file:\n",
+    "    metadata = json.loads(read_file.read())\n",
+    "reconstructed_qtz_st._metadata = metadata\n",
+    "\n",
+    "reconstructed_qtz_model.load_state_dict(reconstructed_qtz_st)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Sanity check passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Sanity checks on the infernce\n",
+    "\n",
+    "N = 32\n",
+    "\n",
+    "for _ in range(25):\n",
+    "    inputs = torch.randint(low=0, high=30000, size=(N, 128))\n",
+    "    mask = torch.ones(size=(N, 128))\n",
+    "\n",
+    "    y_reconstructed = reconstructed_qtz_model(input_ids=inputs, attention_mask=mask)[0]\n",
+    "    y               = quantized_model(input_ids=inputs, attention_mask=mask)[0]\n",
+    "    \n",
+    "    assert torch.all(torch.eq(y, y_reconstructed))\n",
+    "print(\"Sanity check passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/examples/movement-pruning/bertarize.py
+++ b/examples/movement-pruning/bertarize.py
@@ -0,0 +1,132 @@
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Once a model has been fine-pruned, the weights that are masked during the forward pass can be pruned once for all.
+For instance, once the a model from the :class:`~emmental.MaskedBertForSequenceClassification` is trained, it can be saved (and then loaded)
+as a standard :class:`~transformers.BertForSequenceClassification`.
+"""
+
+import argparse
+import os
+import shutil
+
+import torch
+
+from emmental.modules import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+
+
+def main(args):
+    pruning_method = args.pruning_method
+    threshold = args.threshold
+
+    model_name_or_path = args.model_name_or_path.rstrip("/")
+    target_model_path = args.target_model_path
+
+    print(f"Load fine-pruned model from {model_name_or_path}")
+    model = torch.load(os.path.join(model_name_or_path, "pytorch_model.bin"))
+    pruned_model = {}
+
+    for name, tensor in model.items():
+        if "embeddings" in name or "LayerNorm" in name or "pooler" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        elif "classifier" in name or "qa_output" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        elif "bias" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        else:
+            if pruning_method == "magnitude":
+                mask = MagnitudeBinarizer.apply(inputs=tensor, threshold=threshold)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "topK":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                mask = TopKBinarizer.apply(scores, threshold)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "sigmoied_threshold":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                mask = ThresholdBinarizer.apply(scores, threshold, True)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "l0":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                l, r = -0.1, 1.1
+                s = torch.sigmoid(scores)
+                s_bar = s * (r - l) + l
+                mask = s_bar.clamp(min=0.0, max=1.0)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            else:
+                raise ValueError("Unknown pruning method")
+
+    if target_model_path is None:
+        target_model_path = os.path.join(
+            os.path.dirname(model_name_or_path), f"bertarized_{os.path.basename(model_name_or_path)}"
+        )
+
+    if not os.path.isdir(target_model_path):
+        shutil.copytree(model_name_or_path, target_model_path)
+        print(f"\nCreated folder {target_model_path}")
+
+    torch.save(pruned_model, os.path.join(target_model_path, "pytorch_model.bin"))
+    print("\nPruned model saved! See you later!")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--pruning_method",
+        choices=["l0", "magnitude", "topK", "sigmoied_threshold"],
+        type=str,
+        required=True,
+        help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        required=False,
+        help="For `magnitude` and `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
+        "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
+        "Not needed for `l0`",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        required=True,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+    parser.add_argument(
+        "--target_model_path",
+        default=None,
+        type=str,
+        required=False,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/movement-pruning/counts_parameters.py
+++ b/examples/movement-pruning/counts_parameters.py
@@ -0,0 +1,92 @@
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Count remaining (non-zero) weights in the encoder (i.e. the transformer layers).
+Sparsity and remaining weights levels are equivalent: sparsity % = 100 - remaining weights %.
+"""
+import argparse
+import os
+
+import torch
+
+from emmental.modules import ThresholdBinarizer, TopKBinarizer
+
+
+def main(args):
+    serialization_dir = args.serialization_dir
+    pruning_method = args.pruning_method
+    threshold = args.threshold
+
+    st = torch.load(os.path.join(serialization_dir, "pytorch_model.bin"), map_location="cpu")
+
+    remaining_count = 0  # Number of remaining (not pruned) params in the encoder
+    encoder_count = 0  # Number of params in the encoder
+
+    print("name".ljust(60, " "), "Remaining Weights %", "Remaning Weight")
+    for name, param in st.items():
+        if "encoder" not in name:
+            continue
+
+        if "mask_scores" in name:
+            if pruning_method == "topK":
+                mask_ones = TopKBinarizer.apply(param, threshold).sum().item()
+            elif pruning_method == "sigmoied_threshold":
+                mask_ones = ThresholdBinarizer.apply(param, threshold, True).sum().item()
+            elif pruning_method == "l0":
+                l, r = -0.1, 1.1
+                s = torch.sigmoid(param)
+                s_bar = s * (r - l) + l
+                mask = s_bar.clamp(min=0.0, max=1.0)
+                mask_ones = (mask > 0.0).sum().item()
+            else:
+                raise ValueError("Unknown pruning method")
+            remaining_count += mask_ones
+            print(name.ljust(60, " "), str(round(100 * mask_ones / param.numel(), 3)).ljust(20, " "), str(mask_ones))
+        else:
+            encoder_count += param.numel()
+            if "bias" in name or "LayerNorm" in name:
+                remaining_count += param.numel()
+
+    print("")
+    print("Remaining Weights (global) %: ", 100 * remaining_count / encoder_count)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--pruning_method",
+        choices=["l0", "topK", "sigmoied_threshold"],
+        type=str,
+        required=True,
+        help="Pruning Method (l0 = L0 regularization, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        required=False,
+        help="For `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
+        "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
+        "Not needed for `l0`",
+    )
+    parser.add_argument(
+        "--serialization_dir",
+        type=str,
+        required=True,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/movement-pruning/emmental/init.py
+++ b/examples/movement-pruning/emmental/init.py
@@ -0,0 +1,10 @@
+# flake8: noqa
+from .configuration_bert_masked import MaskedBertConfig
+from .modeling_bert_masked import (
+    MaskedBertForMultipleChoice,
+    MaskedBertForQuestionAnswering,
+    MaskedBertForSequenceClassification,
+    MaskedBertForTokenClassification,
+    MaskedBertModel,
+)
+from .modules import *
--- a/examples/movement-pruning/emmental/configuration_bert_masked.py
+++ b/examples/movement-pruning/emmental/configuration_bert_masked.py
@@ -0,0 +1,71 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Masked BERT model configuration. It replicates the class `~transformers.BertConfig`
+and adapts it to the specificities of MaskedBert (`pruning_method`, `mask_init` and `mask_scale`."""
+
+
+import logging
+
+from transformers.configuration_utils import PretrainedConfig
+
+
+logger = logging.getLogger(__name__)
+
+
+class MaskedBertConfig(PretrainedConfig):
+    """
+    A class replicating the `~transformers.BertConfig` with additional parameters for pruning/masking configuration.
+    """
+
+    model_type = "masked_bert"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        pruning_method="topK",
+        mask_init="constant",
+        mask_scale=0.0,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.pruning_method = pruning_method
+        self.mask_init = mask_init
+        self.mask_scale = mask_scale
--- a/examples/movement-pruning/emmental/modeling_bert_masked.py
+++ b/examples/movement-pruning/emmental/modeling_bert_masked.py
--- a/examples/movement-pruning/emmental/modules/init.py
+++ b/examples/movement-pruning/emmental/modules/init.py
@@ -0,0 +1,3 @@
+# flake8: noqa
+from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+from .masked_nn import MaskedLinear
--- a/examples/movement-pruning/emmental/modules/binarizer.py
+++ b/examples/movement-pruning/emmental/modules/binarizer.py
@@ -0,0 +1,144 @@
+# coding=utf-8
+# Copyright 2020-present, AllenAI Authors, University of Illinois Urbana-Champaign,
+# Intel Nervana Systems and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
+"""
+
+import torch
+from torch import autograd
+
+
+class ThresholdBinarizer(autograd.Function):
+    """
+    Thresholdd binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j} > \tau`
+    where `\tau` is a real value threshold.
+
+    Implementation is inspired from:
+        https://github.com/arunmallya/piggyback
+        Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
+        Arun Mallya, Dillon Davis, Svetlana Lazebnik
+    """
+
+    @staticmethod
+    def forward(ctx, inputs: torch.tensor, threshold: float, sigmoid: bool):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+            threshold (`float`)
+                The threshold value (in R).
+            sigmoid (`bool`)
+                If set to ``True``, we apply the sigmoid function to the `inputs` matrix before comparing to `threshold`.
+                In this case, `threshold` should be a value between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        nb_elems = inputs.numel()
+        nb_min = int(0.005 * nb_elems) + 1
+        if sigmoid:
+            mask = (torch.sigmoid(inputs) > threshold).type(inputs.type())
+        else:
+            mask = (inputs > threshold).type(inputs.type())
+        if mask.sum() < nb_min:
+            # We limit the pruning so that at least 0.5% (half a percent) of the weights are remaining
+            k_threshold = inputs.flatten().kthvalue(max(nb_elems - nb_min, 1)).values
+            mask = (inputs > k_threshold).type(inputs.type())
+        return mask
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput, None, None
+
+
+class TopKBinarizer(autograd.Function):
+    """
+    Top-k Binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
+    is among the k% highest values of S.
+
+    Implementation is inspired from:
+        https://github.com/allenai/hidden-networks
+        What's hidden in a randomly weighted neural network?
+        Vivek Ramanujan*, Mitchell Wortsman*, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari
+    """
+
+    @staticmethod
+    def forward(ctx, inputs: torch.tensor, threshold: float):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+            threshold (`float`)
+                The percentage of weights to keep (the rest is pruned).
+                `threshold` is a float between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        # Get the subnetwork by sorting the inputs and using the top threshold %
+        mask = inputs.clone()
+        _, idx = inputs.flatten().sort(descending=True)
+        j = int(threshold * inputs.numel())
+
+        # flat_out and mask access the same memory.
+        flat_out = mask.flatten()
+        flat_out[idx[j:]] = 0
+        flat_out[idx[:j]] = 1
+        return mask
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput, None
+
+
+class MagnitudeBinarizer(object):
+    """
+    Magnitude Binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
+    is among the k% highest values of |S| (absolute value).
+
+    Implementation is inspired from https://github.com/NervanaSystems/distiller/blob/2291fdcc2ea642a98d4e20629acb5a9e2e04b4e6/distiller/pruning/automated_gradual_pruner.py#L24
+    """
+
+    @staticmethod
+    def apply(inputs: torch.tensor, threshold: float):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+                This input marix is typically the weight matrix.
+            threshold (`float`)
+                The percentage of weights to keep (the rest is pruned).
+                `threshold` is a float between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        # Get the subnetwork by sorting the inputs and using the top threshold %
+        mask = inputs.clone()
+        _, idx = inputs.abs().flatten().sort(descending=True)
+        j = int(threshold * inputs.numel())
+
+        # flat_out and mask access the same memory.
+        flat_out = mask.flatten()
+        flat_out[idx[j:]] = 0
+        flat_out[idx[:j]] = 1
+        return mask
--- a/examples/movement-pruning/emmental/modules/masked_nn.py
+++ b/examples/movement-pruning/emmental/modules/masked_nn.py
@@ -0,0 +1,107 @@
+# coding=utf-8
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Masked Linear module: A fully connected layer that computes an adaptive binary mask on the fly.
+The mask (binary or not) is computed at each forward pass and multiplied against
+the weight matrix to prune a portion of the weights.
+The pruned weight matrix is then multiplied against the inputs (and if necessary, the bias is added).
+"""
+
+import math
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn import init
+
+from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+
+
+class MaskedLinear(nn.Linear):
+    """
+    Fully Connected layer with on the fly adaptive mask.
+    If needed, a score matrix is created to store the importance of each associated weight.
+    """
+
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        bias: bool = True,
+        mask_init: str = "constant",
+        mask_scale: float = 0.0,
+        pruning_method: str = "topK",
+    ):
+        """
+        Args:
+            in_features (`int`)
+                Size of each input sample
+            out_features (`int`)
+                Size of each output sample
+            bias (`bool`)
+                If set to ``False``, the layer will not learn an additive bias.
+                Default: ``True``
+            mask_init (`str`)
+                The initialization method for the score matrix if a score matrix is needed.
+                Choices: ["constant", "uniform", "kaiming"]
+                Default: ``constant``
+            mask_scale (`float`)
+                The initialization parameter for the chosen initialization method `mask_init`.
+                Default: ``0.``
+            pruning_method (`str`)
+                Method to compute the mask.
+                Choices: ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
+                Default: ``topK``
+        """
+        super(MaskedLinear, self).__init__(in_features=in_features, out_features=out_features, bias=bias)
+        assert pruning_method in ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
+        self.pruning_method = pruning_method
+
+        if self.pruning_method in ["topK", "threshold", "sigmoied_threshold", "l0"]:
+            self.mask_scale = mask_scale
+            self.mask_init = mask_init
+            self.mask_scores = nn.Parameter(torch.Tensor(self.weight.size()))
+            self.init_mask()
+
+    def init_mask(self):
+        if self.mask_init == "constant":
+            init.constant_(self.mask_scores, val=self.mask_scale)
+        elif self.mask_init == "uniform":
+            init.uniform_(self.mask_scores, a=-self.mask_scale, b=self.mask_scale)
+        elif self.mask_init == "kaiming":
+            init.kaiming_uniform_(self.mask_scores, a=math.sqrt(5))
+
+    def forward(self, input: torch.tensor, threshold: float):
+        # Get the mask
+        if self.pruning_method == "topK":
+            mask = TopKBinarizer.apply(self.mask_scores, threshold)
+        elif self.pruning_method in ["threshold", "sigmoied_threshold"]:
+            sig = "sigmoied" in self.pruning_method
+            mask = ThresholdBinarizer.apply(self.mask_scores, threshold, sig)
+        elif self.pruning_method == "magnitude":
+            mask = MagnitudeBinarizer.apply(self.weight, threshold)
+        elif self.pruning_method == "l0":
+            l, r, b = -0.1, 1.1, 2 / 3
+            if self.training:
+                u = torch.zeros_like(self.mask_scores).uniform_().clamp(0.0001, 0.9999)
+                s = torch.sigmoid((u.log() - (1 - u).log() + self.mask_scores) / b)
+            else:
+                s = torch.sigmoid(self.mask_scores)
+            s_bar = s * (r - l) + l
+            mask = s_bar.clamp(min=0.0, max=1.0)
+        # Mask weights with computed mask
+        weight_thresholded = mask * self.weight
+        # Compute output (linear layer) with masked weights
+        return F.linear(input, weight_thresholded, self.bias)
--- a/examples/movement-pruning/masked_run_glue.py
+++ b/examples/movement-pruning/masked_run_glue.py
@@ -0,0 +1,924 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Fine-pruning Masked BERT on sequence classification on GLUE."""
+
+import argparse
+import glob
+import json
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+from emmental import MaskedBertConfig, MaskedBertForSequenceClassification
+from transformers import (
+    WEIGHTS_NAME,
+    AdamW,
+    BertConfig,
+    BertForSequenceClassification,
+    BertTokenizer,
+    get_linear_schedule_with_warmup,
+)
+from transformers import glue_compute_metrics as compute_metrics
+from transformers import glue_convert_examples_to_features as convert_examples_to_features
+from transformers import glue_output_modes as output_modes
+from transformers import glue_processors as processors
+
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except ImportError:
+    from tensorboardX import SummaryWriter
+
+
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
+    "masked_bert": (MaskedBertConfig, MaskedBertForSequenceClassification, BertTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def schedule_threshold(
+    step: int,
+    total_step: int,
+    warmup_steps: int,
+    initial_threshold: float,
+    final_threshold: float,
+    initial_warmup: int,
+    final_warmup: int,
+    final_lambda: float,
+):
+    if step <= initial_warmup * warmup_steps:
+        threshold = initial_threshold
+    elif step > (total_step - final_warmup * warmup_steps):
+        threshold = final_threshold
+    else:
+        spars_warmup_steps = initial_warmup * warmup_steps
+        spars_schedu_steps = (final_warmup + initial_warmup) * warmup_steps
+        mul_coeff = 1 - (step - spars_warmup_steps) / (total_step - spars_schedu_steps)
+        threshold = final_threshold + (initial_threshold - final_threshold) * (mul_coeff ** 3)
+    regu_lambda = final_lambda * threshold / final_threshold
+    return threshold, regu_lambda
+
+
+def regularization(model: nn.Module, mode: str):
+    regu, counter = 0, 0
+    for name, param in model.named_parameters():
+        if "mask_scores" in name:
+            if mode == "l1":
+                regu += torch.norm(torch.sigmoid(param), p=1) / param.numel()
+            elif mode == "l0":
+                regu += torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1)).sum() / param.numel()
+            else:
+                ValueError("Don't know this mode.")
+            counter += 1
+    return regu / counter
+
+
+def train(args, train_dataset, model, tokenizer, teacher=None):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter(log_dir=args.output_dir)
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if "mask_score" in n and p.requires_grad],
+            "lr": args.mask_scores_learning_rate,
+        },
+        {
+            "params": [
+                p
+                for n, p in model.named_parameters()
+                if "mask_score" not in n and p.requires_grad and not any(nd in n for nd in no_decay)
+            ],
+            "lr": args.learning_rate,
+            "weight_decay": args.weight_decay,
+        },
+        {
+            "params": [
+                p
+                for n, p in model.named_parameters()
+                if "mask_score" not in n and p.requires_grad and any(nd in n for nd in no_decay)
+            ],
+            "lr": args.learning_rate,
+            "weight_decay": 0.0,
+        },
+    ]
+
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = get_linear_schedule_with_warmup(
+        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
+    )
+
+    # Check if saved optimizer or scheduler states exist
+    if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
+        os.path.join(args.model_name_or_path, "scheduler.pt")
+    ):
+        # Load in optimizer and scheduler states
+        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
+        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(
+            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
+        )
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info(
+        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
+        args.train_batch_size
+        * args.gradient_accumulation_steps
+        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
+    )
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    # Distillation
+    if teacher is not None:
+        logger.info("  Training with distillation")
+
+    global_step = 0
+    # Global TopK
+    if args.global_topk:
+        threshold_mem = None
+    epochs_trained = 0
+    steps_trained_in_current_epoch = 0
+    # Check if continuing training from a checkpoint
+    if os.path.exists(args.model_name_or_path):
+        # set global_step to global_step of last saved checkpoint from model path
+        try:
+            global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
+        except ValueError:
+            global_step = 0
+        epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
+        steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
+
+        logger.info("  Continuing training from checkpoint, will skip to saved global_step")
+        logger.info("  Continuing training from epoch %d", epochs_trained)
+        logger.info("  Continuing training from global step %d", global_step)
+        logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
+
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(
+        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
+    )
+    set_seed(args)  # Added here for reproductibility
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            # Skip past any already trained steps if resuming training
+            if steps_trained_in_current_epoch > 0:
+                steps_trained_in_current_epoch -= 1
+                continue
+
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            threshold, regu_lambda = schedule_threshold(
+                step=global_step,
+                total_step=t_total,
+                warmup_steps=args.warmup_steps,
+                final_threshold=args.final_threshold,
+                initial_threshold=args.initial_threshold,
+                final_warmup=args.final_warmup,
+                initial_warmup=args.initial_warmup,
+                final_lambda=args.final_lambda,
+            )
+            # Global TopK
+            if args.global_topk:
+                if threshold == 1.0:
+                    threshold = -1e2  # Or an indefinitely low quantity
+                else:
+                    if (threshold_mem is None) or (global_step % args.global_topk_frequency_compute == 0):
+                        # Sort all the values to get the global topK
+                        concat = torch.cat(
+                            [param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
+                        )
+                        n = concat.numel()
+                        kth = max(n - (int(n * threshold) + 1), 1)
+                        threshold_mem = concat.kthvalue(kth).values.item()
+                        threshold = threshold_mem
+                    else:
+                        threshold = threshold_mem
+            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+            if args.model_type != "distilbert":
+                inputs["token_type_ids"] = (
+                    batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
+                )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
+
+            if "masked" in args.model_type:
+                inputs["threshold"] = threshold
+
+            outputs = model(**inputs)
+            loss, logits_stu = outputs  # model outputs are always tuple in transformers (see doc)
+
+            # Distillation loss
+            if teacher is not None:
+                if "token_type_ids" not in inputs:
+                    inputs["token_type_ids"] = None if args.teacher_type == "xlm" else batch[2]
+                with torch.no_grad():
+                    (logits_tea,) = teacher(
+                        input_ids=inputs["input_ids"],
+                        token_type_ids=inputs["token_type_ids"],
+                        attention_mask=inputs["attention_mask"],
+                    )
+
+                loss_logits = F.kl_div(
+                    input=F.log_softmax(logits_stu / args.temperature, dim=-1),
+                    target=F.softmax(logits_tea / args.temperature, dim=-1),
+                    reduction="batchmean",
+                ) * (args.temperature ** 2)
+
+                loss = args.alpha_distil * loss_logits + args.alpha_ce * loss
+
+            # Regularization
+            if args.regularization is not None:
+                regu_ = regularization(model=model, mode=args.regularization)
+                loss = loss + regu_lambda * regu_
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0 or (
+                # last step in epoch but step is always smaller than gradient_accumulation_steps
+                len(epoch_iterator) <= args.gradient_accumulation_steps
+                and (step + 1) == len(epoch_iterator)
+            ):
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    tb_writer.add_scalar("threshold", threshold, global_step)
+                    for name, param in model.named_parameters():
+                        if not param.requires_grad:
+                            continue
+                        tb_writer.add_scalar("parameter_mean/" + name, param.data.mean(), global_step)
+                        tb_writer.add_scalar("parameter_std/" + name, param.data.std(), global_step)
+                        tb_writer.add_scalar("parameter_min/" + name, param.data.min(), global_step)
+                        tb_writer.add_scalar("parameter_max/" + name, param.data.max(), global_step)
+                        tb_writer.add_scalar("grad_mean/" + name, param.grad.data.mean(), global_step)
+                        tb_writer.add_scalar("grad_std/" + name, param.grad.data.std(), global_step)
+                        if args.regularization is not None and "mask_scores" in name:
+                            if args.regularization == "l1":
+                                perc = (torch.sigmoid(param) > threshold).sum().item() / param.numel()
+                            elif args.regularization == "l0":
+                                perc = (torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1))).sum().item() / param.numel()
+                            tb_writer.add_scalar("retained_weights_perc/" + name, perc, global_step)
+
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    logs = {}
+                    if (
+                        args.local_rank == -1 and args.evaluate_during_training
+                    ):  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            eval_key = "eval_{}".format(key)
+                            logs[eval_key] = value
+
+                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
+                    learning_rate_scalar = scheduler.get_lr()
+                    logs["learning_rate"] = learning_rate_scalar[0]
+                    if len(learning_rate_scalar) > 1:
+                        for idx, lr in enumerate(learning_rate_scalar[1:]):
+                            logs[f"learning_rate/{idx+1}"] = lr
+                    logs["loss"] = loss_scalar
+                    if teacher is not None:
+                        logs["loss/distil"] = loss_logits.item()
+                    if args.regularization is not None:
+                        logs["loss/regularization"] = regu_.item()
+                    if (teacher is not None) or (args.regularization is not None):
+                        if (teacher is not None) and (args.regularization is not None):
+                            logs["loss/instant_ce"] = (
+                                loss.item()
+                                - regu_lambda * logs["loss/regularization"]
+                                - args.alpha_distil * logs["loss/distil"]
+                            ) / args.alpha_ce
+                        elif teacher is not None:
+                            logs["loss/instant_ce"] = (
+                                loss.item() - args.alpha_distil * logs["loss/distil"]
+                            ) / args.alpha_ce
+                        else:
+                            logs["loss/instant_ce"] = loss.item() - regu_lambda * logs["loss/regularization"]
+                    logging_loss = tr_loss
+
+                    for key, value in logs.items():
+                        tb_writer.add_scalar(key, value, global_step)
+                    print(json.dumps({**logs, **{"step": global_step}}))
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = (
+                        model.module if hasattr(model, "module") else model
+                    )  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+
+                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+                    logger.info("Saving optimizer and scheduler states to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
+    eval_outputs_dirs = (args.output_dir, args.output_dir + "/MM") if args.task_name == "mnli" else (args.output_dir,)
+
+    results = {}
+    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
+        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
+
+        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(eval_output_dir)
+
+        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+        # Note that DistributedSampler samples randomly
+        eval_sampler = SequentialSampler(eval_dataset)
+        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+        # multi-gpu eval
+        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
+            model = torch.nn.DataParallel(model)
+
+        # Eval!
+        logger.info("***** Running evaluation {} *****".format(prefix))
+        logger.info("  Num examples = %d", len(eval_dataset))
+        logger.info("  Batch size = %d", args.eval_batch_size)
+        eval_loss = 0.0
+        nb_eval_steps = 0
+        preds = None
+        out_label_ids = None
+
+        # Global TopK
+        if args.global_topk:
+            threshold_mem = None
+
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            model.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+
+            with torch.no_grad():
+                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+                if args.model_type != "distilbert":
+                    inputs["token_type_ids"] = (
+                        batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
+                    )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
+                if "masked" in args.model_type:
+                    inputs["threshold"] = args.final_threshold
+                    if args.global_topk:
+                        if threshold_mem is None:
+                            concat = torch.cat(
+                                [param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
+                            )
+                            n = concat.numel()
+                            kth = max(n - (int(n * args.final_threshold) + 1), 1)
+                            threshold_mem = concat.kthvalue(kth).values.item()
+                        inputs["threshold"] = threshold_mem
+                outputs = model(**inputs)
+                tmp_eval_loss, logits = outputs[:2]
+
+                eval_loss += tmp_eval_loss.mean().item()
+            nb_eval_steps += 1
+            if preds is None:
+                preds = logits.detach().cpu().numpy()
+                out_label_ids = inputs["labels"].detach().cpu().numpy()
+            else:
+                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
+
+        eval_loss = eval_loss / nb_eval_steps
+        if args.output_mode == "classification":
+            from scipy.special import softmax
+
+            probs = softmax(preds, axis=-1)
+            entropy = np.exp((-probs * np.log(probs)).sum(axis=-1).mean())
+            preds = np.argmax(preds, axis=1)
+        elif args.output_mode == "regression":
+            preds = np.squeeze(preds)
+        result = compute_metrics(eval_task, preds, out_label_ids)
+        results.update(result)
+        if entropy is not None:
+            result["eval_avg_entropy"] = entropy
+
+        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results {} *****".format(prefix))
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return results
+
+
+def load_and_cache_examples(args, task, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    output_mode = output_modes[task]
+    # Load data features from cache or dataset file
+    cached_features_file = os.path.join(
+        args.data_dir,
+        "cached_{}_{}_{}_{}".format(
+            "dev" if evaluate else "train",
+            list(filter(None, args.model_name_or_path.split("/"))).pop(),
+            str(args.max_seq_length),
+            str(task),
+        ),
+    )
+    if os.path.exists(cached_features_file) and not args.overwrite_cache:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        label_list = processor.get_labels()
+        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
+            # HACK(label indices are swapped in RoBERTa pretrained model)
+            label_list[1], label_list[2] = label_list[2], label_list[1]
+        examples = (
+            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
+        )
+        features = convert_examples_to_features(
+            examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
+        )
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
+    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
+    if output_mode == "classification":
+        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
+    elif output_mode == "regression":
+        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
+
+    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pretrained model or model identifier from huggingface.co/models",
+    )
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    # Other parameters
+    parser.add_argument(
+        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--tokenizer_name",
+        default="",
+        type=str,
+        help="Pretrained tokenizer name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--cache_dir",
+        default="",
+        type=str,
+        help="Where do you want to store the pre-trained models downloaded from s3",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
+    parser.add_argument(
+        "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
+    )
+    parser.add_argument(
+        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
+    )
+
+    parser.add_argument(
+        "--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+
+    # Pruning parameters
+    parser.add_argument(
+        "--mask_scores_learning_rate",
+        default=1e-2,
+        type=float,
+        help="The Adam initial learning rate of the mask scores.",
+    )
+    parser.add_argument(
+        "--initial_threshold", default=1.0, type=float, help="Initial value of the threshold (for scheduling)."
+    )
+    parser.add_argument(
+        "--final_threshold", default=0.7, type=float, help="Final value of the threshold (for scheduling)."
+    )
+    parser.add_argument(
+        "--initial_warmup",
+        default=1,
+        type=int,
+        help="Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays"
+        "at its `initial_threshold` value (sparsity schedule).",
+    )
+    parser.add_argument(
+        "--final_warmup",
+        default=2,
+        type=int,
+        help="Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays"
+        "at its final_threshold value (sparsity schedule).",
+    )
+
+    parser.add_argument(
+        "--pruning_method",
+        default="topK",
+        type=str,
+        help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning).",
+    )
+    parser.add_argument(
+        "--mask_init",
+        default="constant",
+        type=str,
+        help="Initialization method for the mask scores. Choices: constant, uniform, kaiming.",
+    )
+    parser.add_argument(
+        "--mask_scale", default=0.0, type=float, help="Initialization parameter for the chosen initialization method."
+    )
+
+    parser.add_argument("--regularization", default=None, help="Add L0 or L1 regularization to the mask scores.")
+    parser.add_argument(
+        "--final_lambda",
+        default=0.0,
+        type=float,
+        help="Regularization intensity (used in conjunction with `regulariation`.",
+    )
+
+    parser.add_argument("--global_topk", action="store_true", help="Global TopK on the Scores.")
+    parser.add_argument(
+        "--global_topk_frequency_compute",
+        default=25,
+        type=int,
+        help="Frequency at which we compute the TopK global threshold.",
+    )
+
+    # Distillation parameters (optional)
+    parser.add_argument(
+        "--teacher_type",
+        default=None,
+        type=str,
+        help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.",
+    )
+    parser.add_argument(
+        "--teacher_name_or_path",
+        default=None,
+        type=str,
+        help="Path to the already fine-tuned teacher model. Only for distillation.",
+    )
+    parser.add_argument(
+        "--alpha_ce", default=0.5, type=float, help="Cross entropy loss linear weight. Only for distillation."
+    )
+    parser.add_argument(
+        "--alpha_distil", default=0.5, type=float, help="Distillation loss linear weight. Only for distillation."
+    )
+    parser.add_argument(
+        "--temperature", default=2.0, type=float, help="Distillation temperature. Only for distillation."
+    )
+
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--eval_all_checkpoints",
+        action="store_true",
+        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
+    )
+    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+    parser.add_argument(
+        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
+    )
+    parser.add_argument(
+        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+    )
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O1",
+        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://nvidia.github.io/apex/amp.html",
+    )
+    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+
+    args = parser.parse_args()
+
+    # Regularization
+    if args.regularization == "null":
+        args.regularization = None
+
+    if (
+        os.path.exists(args.output_dir)
+        and os.listdir(args.output_dir)
+        and args.do_train
+        and not args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend="nccl")
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
+    )
+    logger.warning(
+        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+        args.local_rank,
+        device,
+        args.n_gpu,
+        bool(args.local_rank != -1),
+        args.fp16,
+    )
+
+    # Set seed
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    args.output_mode = output_modes[args.task_name]
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(
+        args.config_name if args.config_name else args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=args.task_name,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+        pruning_method=args.pruning_method,
+        mask_init=args.mask_init,
+        mask_scale=args.mask_scale,
+    )
+    tokenizer = tokenizer_class.from_pretrained(
+        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+        do_lower_case=args.do_lower_case,
+    )
+    model = model_class.from_pretrained(
+        args.model_name_or_path,
+        from_tf=bool(".ckpt" in args.model_name_or_path),
+        config=config,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+    )
+
+    if args.teacher_type is not None:
+        assert args.teacher_name_or_path is not None
+        assert args.alpha_distil > 0.0
+        assert args.alpha_distil + args.alpha_ce > 0.0
+        teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
+        teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
+        teacher = teacher_model_class.from_pretrained(
+            args.teacher_name_or_path,
+            from_tf=False,
+            config=teacher_config,
+            cache_dir=args.cache_dir if args.cache_dir else None,
+        )
+        teacher.to(args.device)
+    else:
+        teacher = None
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = (
+            model.module if hasattr(model, "module") else model
+        )  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(
+                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
+            )
+            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
+
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=prefix)
+            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/movement-pruning/masked_run_squad.py
+++ b/examples/movement-pruning/masked_run_squad.py
--- a/examples/movement-pruning/requirements.txt
+++ b/examples/movement-pruning/requirements.txt
@@ -0,0 +1,6 @@
+torch>=1.4.0
+-e git+https://github.com/huggingface/transformers.git@352d5472b0c1dec0f420d606d16747d851b4bda8#egg=transformers
+knockknock>=0.1.8.1
+h5py>=2.10.0
+numpy>=1.18.2
+scipy>=1.4.1
--- a/examples/multiple-choice/README.md
+++ b/examples/multiple-choice/README.md
@@ -8,7 +8,7 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/run_multiple_choice.py \
+python ./examples/multiple-choice/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
@@ -19,7 +19,7 @@ python ./examples/run_multiple_choice.py \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
+--per_device_train_batch_size=16 \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
@@ -29,3 +29,28 @@ Training with the defined hyper-parameters yields the following results:
 eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
+
+
+## Tensorflow
+
+```bash
+export SWAG_DIR=/path/to/swag_data_dir
+python ./examples/multiple-choice/run_tf_multiple_choice.py \
+--task_name swag \
+--model_name_or_path bert-base-cased \
+--do_train \
+--do_eval \
+--data_dir $SWAG_DIR \
+--learning_rate 5e-5 \
+--num_train_epochs 3 \
+--max_seq_length 80 \
+--output_dir models_bert/swag_base \
+--per_gpu_eval_batch_size=16 \
+--per_device_train_batch_size=16 \
+--logging-dir logs \
+--gradient_accumulation_steps 2 \
+--overwrite_output
+```
+
+# Run it in colab
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
--- a/examples/multiple-choice/run_multiple_choice.py
+++ b/examples/multiple-choice/run_multiple_choice.py
@@ -159,7 +159,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -172,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -204,22 +202,28 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+                results.update(result)

    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/multiple-choice/run_tf_multiple_choice.py
+++ b/examples/multiple-choice/run_tf_multiple_choice.py
@@ -0,0 +1,211 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for multiple choice (Bert, Roberta, XLNet)."""
+
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import Dict, Optional
+
+import numpy as np
+
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    EvalPrediction,
+    HfArgumentParser,
+    TFAutoModelForMultipleChoice,
+    TFTrainer,
+    TFTrainingArguments,
+    set_seed,
+)
+from utils_multiple_choice import Split, TFMultipleChoiceDataset, processors
+
+
+logger = logging.getLogger(__name__)
+
+
+def simple_accuracy(preds, labels):
+    return (preds == labels).mean()
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    task_name: str = field(metadata={"help": "The name of the task to train on: " + ", ".join(processors.keys())})
+    data_dir: str = field(metadata={"help": "Should contain the data files for the task."})
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if (
+        os.path.exists(training_args.output_dir)
+        and os.listdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.warning(
+        "device: %s, n_gpu: %s, 16-bits training: %s", training_args.device, training_args.n_gpu, training_args.fp16,
+    )
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Set seed
+    set_seed(training_args.seed)
+
+    try:
+        processor = processors[data_args.task_name]()
+        label_list = processor.get_labels()
+        num_labels = len(label_list)
+    except KeyError:
+        raise ValueError("Task not found: %s" % (data_args.task_name))
+
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=data_args.task_name,
+        cache_dir=model_args.cache_dir,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+    )
+    with training_args.strategy.scope():
+        model = TFAutoModelForMultipleChoice.from_pretrained(
+            model_args.model_name_or_path,
+            from_pt=bool(".bin" in model_args.model_name_or_path),
+            config=config,
+            cache_dir=model_args.cache_dir,
+        )
+    # Get datasets
+    train_dataset = (
+        TFMultipleChoiceDataset(
+            data_dir=data_args.data_dir,
+            tokenizer=tokenizer,
+            task=data_args.task_name,
+            max_seq_length=data_args.max_seq_length,
+            overwrite_cache=data_args.overwrite_cache,
+            mode=Split.train,
+        )
+        if training_args.do_train
+        else None
+    )
+    eval_dataset = (
+        TFMultipleChoiceDataset(
+            data_dir=data_args.data_dir,
+            tokenizer=tokenizer,
+            task=data_args.task_name,
+            max_seq_length=data_args.max_seq_length,
+            overwrite_cache=data_args.overwrite_cache,
+            mode=Split.dev,
+        )
+        if training_args.do_eval
+        else None
+    )
+
+    def compute_metrics(p: EvalPrediction) -> Dict:
+        preds = np.argmax(p.predictions, axis=1)
+        return {"acc": simple_accuracy(preds, p.label_ids)}
+
+    # Initialize our Trainer
+    trainer = TFTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset.get_dataset() if train_dataset else None,
+        eval_dataset=eval_dataset.get_dataset() if eval_dataset else None,
+        compute_metrics=compute_metrics,
+    )
+
+    # Training
+    if training_args.do_train:
+        trainer.train()
+        trainer.save_model()
+        tokenizer.save_pretrained(training_args.output_dir)
+    # Evaluation
+    results = {}
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+
+        result = trainer.evaluate()
+
+        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key, value in result.items():
+                logger.info("  %s = %s", key, value)
+                writer.write("%s = %s\n" % (key, value))
+
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
@@ -25,11 +25,10 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List, Optional

-import torch
 import tqdm
-from torch.utils.data.dataset import Dataset
+from filelock import FileLock

-from transformers import PreTrainedTokenizer, torch_distributed_zero_first
+from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available


 logger = logging.getLogger(__name__)
@@ -76,66 +75,159 @@ class Split(Enum):
    test = "test"


-class MultipleChoiceDataset(Dataset):
-    """
-    This will be superseded by a framework-agnostic approach
-    soon.
-    """
+if is_torch_available():
+    import torch
+    from torch.utils.data.dataset import Dataset

-    features: List[InputFeatures]
+    class MultipleChoiceDataset(Dataset):
+        """
+        This will be superseded by a framework-agnostic approach
+        soon.
+        """

-    def __init__(
-        self,
-        data_dir: str,
-        tokenizer: PreTrainedTokenizer,
-        task: str,
-        max_seq_length: Optional[int] = None,
-        overwrite_cache=False,
-        mode: Split = Split.train,
-        local_rank=-1,
-    ):
-        processor = processors[task]()
+        features: List[InputFeatures]
+
+        def __init__(
+            self,
+            data_dir: str,
+            tokenizer: PreTrainedTokenizer,
+            task: str,
+            max_seq_length: Optional[int] = None,
+            overwrite_cache=False,
+            mode: Split = Split.train,
+        ):
+            processor = processors[task]()
+
+            cached_features_file = os.path.join(
+                data_dir,
+                "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
+            )

-        cached_features_file = os.path.join(
-            data_dir,
-            "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
-        )
-        with torch_distributed_zero_first(local_rank):
            # Make sure only the first process in distributed training processes the dataset,
            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

-            if os.path.exists(cached_features_file) and not overwrite_cache:
-                logger.info(f"Loading features from cached file {cached_features_file}")
-                self.features = torch.load(cached_features_file)
-            else:
-                logger.info(f"Creating features from dataset file at {data_dir}")
-                label_list = processor.get_labels()
-                if mode == Split.dev:
-                    examples = processor.get_dev_examples(data_dir)
-                elif mode == Split.test:
-                    examples = processor.get_test_examples(data_dir)
+                if os.path.exists(cached_features_file) and not overwrite_cache:
+                    logger.info(f"Loading features from cached file {cached_features_file}")
+                    self.features = torch.load(cached_features_file)
                else:
-                    examples = processor.get_train_examples(data_dir)
-                logger.info("Training examples: %s", len(examples))
-                # TODO clean up all this to leverage built-in features of tokenizers
-                self.features = convert_examples_to_features(
-                    examples,
-                    label_list,
-                    max_seq_length,
-                    tokenizer,
-                    pad_on_left=bool(tokenizer.padding_side == "left"),
-                    pad_token=tokenizer.pad_token_id,
-                    pad_token_segment_id=tokenizer.pad_token_type_id,
-                )
-                if local_rank in [-1, 0]:
+                    logger.info(f"Creating features from dataset file at {data_dir}")
+                    label_list = processor.get_labels()
+                    if mode == Split.dev:
+                        examples = processor.get_dev_examples(data_dir)
+                    elif mode == Split.test:
+                        examples = processor.get_test_examples(data_dir)
+                    else:
+                        examples = processor.get_train_examples(data_dir)
+                    logger.info("Training examples: %s", len(examples))
+                    # TODO clean up all this to leverage built-in features of tokenizers
+                    self.features = convert_examples_to_features(
+                        examples,
+                        label_list,
+                        max_seq_length,
+                        tokenizer,
+                        pad_on_left=bool(tokenizer.padding_side == "left"),
+                        pad_token=tokenizer.pad_token_id,
+                        pad_token_segment_id=tokenizer.pad_token_type_id,
+                    )
                    logger.info("Saving features into cached file %s", cached_features_file)
                    torch.save(self.features, cached_features_file)

-    def __len__(self):
-        return len(self.features)
+        def __len__(self):
+            return len(self.features)

-    def __getitem__(self, i) -> InputFeatures:
-        return self.features[i]
+        def __getitem__(self, i) -> InputFeatures:
+            return self.features[i]
+
+
+if is_tf_available():
+    import tensorflow as tf
+
+    class TFMultipleChoiceDataset:
+        """
+        This will be superseded by a framework-agnostic approach
+        soon.
+        """
+
+        features: List[InputFeatures]
+
+        def __init__(
+            self,
+            data_dir: str,
+            tokenizer: PreTrainedTokenizer,
+            task: str,
+            max_seq_length: Optional[int] = 128,
+            overwrite_cache=False,
+            mode: Split = Split.train,
+        ):
+            processor = processors[task]()
+
+            logger.info(f"Creating features from dataset file at {data_dir}")
+            label_list = processor.get_labels()
+            if mode == Split.dev:
+                examples = processor.get_dev_examples(data_dir)
+            elif mode == Split.test:
+                examples = processor.get_test_examples(data_dir)
+            else:
+                examples = processor.get_train_examples(data_dir)
+            logger.info("Training examples: %s", len(examples))
+            # TODO clean up all this to leverage built-in features of tokenizers
+            self.features = convert_examples_to_features(
+                examples,
+                label_list,
+                max_seq_length,
+                tokenizer,
+                pad_on_left=bool(tokenizer.padding_side == "left"),
+                pad_token=tokenizer.pad_token_id,
+                pad_token_segment_id=tokenizer.pad_token_type_id,
+            )
+
+            def gen():
+                for (ex_index, ex) in tqdm.tqdm(enumerate(self.features), desc="convert examples to features"):
+                    if ex_index % 10000 == 0:
+                        logger.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+                    yield (
+                        {
+                            "example_id": 0,
+                            "input_ids": ex.input_ids,
+                            "attention_mask": ex.attention_mask,
+                            "token_type_ids": ex.token_type_ids,
+                        },
+                        ex.label,
+                    )
+
+            self.dataset = tf.data.Dataset.from_generator(
+                gen,
+                (
+                    {
+                        "example_id": tf.int32,
+                        "input_ids": tf.int32,
+                        "attention_mask": tf.int32,
+                        "token_type_ids": tf.int32,
+                    },
+                    tf.int64,
+                ),
+                (
+                    {
+                        "example_id": tf.TensorShape([]),
+                        "input_ids": tf.TensorShape([None, None]),
+                        "attention_mask": tf.TensorShape([None, None]),
+                        "token_type_ids": tf.TensorShape([None, None]),
+                    },
+                    tf.TensorShape([]),
+                ),
+            )
+
+        def get_dataset(self):
+            return self.dataset
+
+        def __len__(self):
+            return len(self.features)
+
+        def __getitem__(self, i) -> InputFeatures:
+            return self.features[i]


 class DataProcessor:
@@ -225,6 +317,52 @@ class RaceProcessor(DataProcessor):
        return examples


+class SynonymProcessor(DataProcessor):
+    """Processor for the Synonym data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} train".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mctrain.csv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mchp.csv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mctest.csv")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1", "2", "3", "4"]
+
+    def _read_csv(self, input_file):
+        with open(input_file, "r", encoding="utf-8") as f:
+            return list(csv.reader(f))
+
+    def _create_examples(self, lines: List[List[str]], type: str):
+        """Creates examples for the training and dev sets."""
+
+        examples = [
+            InputExample(
+                example_id=line[0],
+                question="",  # in the swag dataset, the
+                # common beginning of each
+                # choice is stored in "sent2".
+                contexts=[line[1], line[1], line[1], line[1], line[1]],
+                endings=[line[2], line[3], line[4], line[5], line[6]],
+                label=line[7],
+            )
+            for line in lines  # we skip the line with the column names
+        ]
+
+        return examples
+
+
 class SwagProcessor(DataProcessor):
    """Processor for the SWAG data set."""

@@ -397,7 +535,12 @@ def convert_examples_to_features(
                text_b = example.question + " " + ending

            inputs = tokenizer.encode_plus(
-                text_a, text_b, add_special_tokens=True, max_length=max_length, pad_to_max_length=True,
+                text_a,
+                text_b,
+                add_special_tokens=True,
+                max_length=max_length,
+                pad_to_max_length=True,
+                return_overflowing_tokens=True,
            )
            if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
                logger.info(
@@ -435,7 +578,5 @@ def convert_examples_to_features(
    return features


-processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor}
-
-
-MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4}
+processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}
+MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4, "syn", 5}
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -2,7 +2,7 @@

 ## SQuAD

-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
+Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).

 #### Fine-tuning BERT on SQuAD1.0

@@ -28,6 +28,7 @@ python run_squad.py \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
+  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
@@ -51,11 +52,12 @@ exact_match = 81.22
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
+    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
@@ -157,3 +159,23 @@ Larger batch size may improve the performance while costing more memory.
 }
 ```

+## SQuAD with the Tensorflow Trainer
+
+```bash
+python run_tf_squad.py \
+    --model_name_or_path bert-base-uncased \
+    --output_dir model \
+    --max-seq-length 384 \
+    --num_train_epochs 2 \
+    --per_gpu_train_batch_size 8 \
+    --per_gpu_eval_batch_size 16 \
+    --do_train \
+    --logging_dir logs \
+    --mode question-answering \
+    --logging_steps 10 \
+    --learning_rate 3e-5 \
+    --doc_stride 128 \
+    --optimizer_name adamw
+```
+
+For the moment the evaluation is not available in the Tensorflow Trainer only the training.
--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
@@ -58,8 +58,6 @@ logger = logging.getLogger(__name__)
 MODEL_CONFIG_CLASSES = list(MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys())
 MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)
-

 def set_seed(args):
    random.seed(args.seed)
@@ -491,7 +489,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
--- a/examples/question-answering/run_tf_squad.py
+++ b/examples/question-answering/run_tf_squad.py
@@ -0,0 +1,237 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Fine-tuning the library models for question-answering."""
+
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import Optional
+
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    HfArgumentParser,
+    TFAutoModelForQuestionAnswering,
+    TFTrainer,
+    TFTrainingArguments,
+    squad_convert_examples_to_features,
+)
+from transformers.data.processors.squad import SquadV1Processor, SquadV2Processor
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
+    # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
+    # or just modify its tokenizer_config.json.
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    data_dir: Optional[str] = field(
+        default=None, metadata={"help": "The input data dir. Should contain the .json files for the SQuAD task."}
+    )
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    doc_stride: int = field(
+        default=128,
+        metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
+    )
+    max_query_length: int = field(
+        default=64,
+        metadata={
+            "help": "The maximum number of tokens for the question. Questions longer than this will "
+            "be truncated to this length."
+        },
+    )
+    max_answer_length: int = field(
+        default=30,
+        metadata={
+            "help": "The maximum length of an answer that can be generated. This is needed because the start "
+            "and end predictions are not conditioned on one another."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    version_2_with_negative: bool = field(
+        default=False, metadata={"help": "If true, the SQuAD examples contain some that do not have an answer."}
+    )
+    null_score_diff_threshold: float = field(
+        default=0.0, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
+    )
+    n_best_size: int = field(
+        default=20, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
+    )
+    lang_id: int = field(
+        default=0,
+        metadata={
+            "help": "language id of input for language-specific xlm models (see tokenization_xlm.PRETRAINED_INIT_CONFIGURATION)"
+        },
+    )
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if (
+        os.path.exists(training_args.output_dir)
+        and os.listdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(
+        "n_gpu: %s, distributed training: %s, 16-bits training: %s",
+        training_args.n_gpu,
+        bool(training_args.n_gpu > 1),
+        training_args.fp16,
+    )
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Prepare Question-Answering task
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_fast=model_args.use_fast,
+    )
+
+    with training_args.strategy.scope():
+        model = TFAutoModelForQuestionAnswering.from_pretrained(
+            model_args.model_name_or_path,
+            from_pt=bool(".bin" in model_args.model_name_or_path),
+            config=config,
+            cache_dir=model_args.cache_dir,
+        )
+
+    # Get datasets
+    if not data_args.data_dir:
+        if data_args.version_2_with_negative:
+            logger.warn("tensorflow_datasets does not handle version 2 of SQuAD. Switch to version 1 automatically")
+
+        try:
+            import tensorflow_datasets as tfds
+        except ImportError:
+            raise ImportError("If not data_dir is specified, tensorflow_datasets needs to be installed.")
+
+        tfds_examples = tfds.load("squad")
+        train_examples = (
+            SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=False)
+            if training_args.do_train
+            else None
+        )
+        eval_examples = (
+            SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=True)
+            if training_args.do_eval
+            else None
+        )
+    else:
+        processor = SquadV2Processor() if data_args.version_2_with_negative else SquadV1Processor()
+        train_examples = processor.get_train_examples(data_args.data_dir) if training_args.do_train else None
+        eval_examples = processor.get_dev_examples(data_args.data_dir) if training_args.do_eval else None
+
+    train_dataset = (
+        squad_convert_examples_to_features(
+            examples=train_examples,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            doc_stride=data_args.doc_stride,
+            max_query_length=data_args.max_query_length,
+            is_training=True,
+            return_dataset="tf",
+        )
+        if training_args.do_train
+        else None
+    )
+
+    eval_dataset = (
+        squad_convert_examples_to_features(
+            examples=eval_examples,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            doc_stride=data_args.doc_stride,
+            max_query_length=data_args.max_query_length,
+            is_training=False,
+            return_dataset="tf",
+        )
+        if training_args.do_eval
+        else None
+    )
+
+    # Initialize our Trainer
+    trainer = TFTrainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset,)
+
+    # Training
+    if training_args.do_train:
+        trainer.train()
+        trainer.save_model()
+        tokenizer.save_pretrained(training_args.output_dir)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -6,3 +6,4 @@ sacrebleu
 rouge-score
 tensorflow_datasets
 pytorch-lightning==0.7.3  # April 10, 2020 release
+matplotlib
--- a/examples/summarization/bart/evaluate_cnn.py
+++ b/examples/summarization/bart/evaluate_cnn.py
@@ -21,7 +21,7 @@ def generate_summaries(
 ):
    fout = Path(out_file).open("w")
    model = BartForConditionalGeneration.from_pretrained(model_name).to(device)
-    tokenizer = BartTokenizer.from_pretrained("bart-large")
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")

    max_length = 140
    min_length = 55
@@ -54,7 +54,7 @@ def run_generate():
        "output_path", type=str, help="where to save summaries",
    )
    parser.add_argument(
-        "model_name", type=str, default="bart-large-cnn", help="like bart-large-cnn",
+        "model_name", type=str, default="facebook/bart-large-cnn", help="like bart-large-cnn",
    )
    parser.add_argument(
        "--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.",
--- a/examples/summarization/bart/test_bart_examples.py
+++ b/examples/summarization/bart/test_bart_examples.py
@@ -129,7 +129,7 @@ class TestBartExamples(unittest.TestCase):
        summaries = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
        _dump_articles((tmp_dir / "train.source"), articles)
        _dump_articles((tmp_dir / "train.target"), summaries)
-        tokenizer = BartTokenizer.from_pretrained("bart-large")
+        tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
        max_len_source = max(len(tokenizer.encode(a)) for a in articles)
        max_len_target = max(len(tokenizer.encode(a)) for a in summaries)
        trunc_target = 4
--- a/examples/summarization/bertabs/configuration_bertabs.py
+++ b/examples/summarization/bertabs/configuration_bertabs.py
@@ -61,7 +61,6 @@ class BertAbsConfig(PretrainedConfig):
            the decoder.
    """

-    pretrained_config_archive_map = BERTABS_FINETUNED_CONFIG_MAP
    model_type = "bertabs"

    def __init__(
--- a/examples/summarization/bertabs/modeling_bertabs.py
+++ b/examples/summarization/bertabs/modeling_bertabs.py
@@ -33,14 +33,13 @@ from transformers import BertConfig, BertModel, PreTrainedModel

 MAX_SIZE = 5000

-BERTABS_FINETUNED_MODEL_MAP = {
-    "bertabs-finetuned-cnndm": "https://cdn.huggingface.co/remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization/pytorch_model.bin",
-}
+BERTABS_FINETUNED_MODEL_ARCHIVE_LIST = [
+    "remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization",
+]


 class BertAbsPreTrainedModel(PreTrainedModel):
    config_class = BertAbsConfig
-    pretrained_model_archive_map = BERTABS_FINETUNED_MODEL_MAP
    load_tf_weights = False
    base_model_prefix = "bert"

--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -61,8 +61,8 @@ class ExamplesTests(unittest.TestCase):
            --do_train
            --do_eval
            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --per_gpu_train_batch_size=2
-            --per_gpu_eval_batch_size=1
+            --per_device_train_batch_size=2
+            --per_device_eval_batch_size=1
            --learning_rate=1e-4
            --max_steps=10
            --warmup_steps=2
@@ -72,7 +72,7 @@ class ExamplesTests(unittest.TestCase):
            """.split()
        with patch.object(sys, "argv", testargs):
            result = run_glue.main()
-            del result["loss"]
+            del result["eval_loss"]
            for value in result.values():
                self.assertGreaterEqual(value, 0.75)

--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -2,7 +2,7 @@

 # Run TensorFlow 2.0 version

-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
+Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).

 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).

@@ -68,7 +68,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
@@ -85,10 +85,12 @@ CoLA, SST-2. The following section provides details on how to run half-precision
 said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.

-## Running on TPUs
+## Running on TPUs in PyTorch

-You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
-[README](https://github.com/pytorch/xla/blob/master/README.md).
+**Update**: read the more up-to-date [Running on TPUs](../README.md#running-on-tpus) in the main README.md instead.
+
+Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on how to setup your TPU environment refer to the
+[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).

 The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
 identical to your normal GPU + Huggingface setup.
@@ -101,7 +103,6 @@ export GLUE_DIR=/path/to/glue
 export TASK_NAME=MNLI

 python run_glue_tpu.py \
-  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
@@ -115,8 +116,7 @@ python run_glue_tpu.py \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
-  --num_cores=8 \
-  --only_log_master
+  --num_cores=8
 ```

 ### MRPC
@@ -141,7 +141,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
@@ -166,7 +166,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
@@ -189,7 +189,7 @@ python -m torch.distributed.launch \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
+    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
@@ -221,7 +221,7 @@ python -m torch.distributed.launch \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
+    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
@@ -256,9 +256,9 @@ TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recal

 # XNLI

-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
+Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).

-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

 #### Fine-tuning on XNLI

@@ -273,14 +273,13 @@ on a single tesla V100 16GB. The data for XNLI can be downloaded with the follow
 export XNLI_DIR=/path/to/XNLI

 python run_xnli.py \
-  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
@@ -134,16 +134,9 @@ def main():
    )

    # Get datasets
-    train_dataset = (
-        GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-        if training_args.do_eval
-        else None
-    )
+    train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
+    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
+    test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None

    def compute_metrics(p: EvalPrediction) -> Dict:
        if output_mode == "classification":
@@ -173,33 +166,57 @@ def main():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
-    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    eval_results = {}
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
-            eval_datasets.append(
-                GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-            )
+            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))

        for eval_dataset in eval_datasets:
-            result = trainer.evaluate(eval_dataset=eval_dataset)
+            eval_result = trainer.evaluate(eval_dataset=eval_dataset)

            output_eval_file = os.path.join(
                training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
            )
-            with open(output_eval_file, "w") as writer:
-                logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
-                for key, value in result.items():
-                    logger.info("  %s = %s", key, value)
-                    writer.write("%s = %s\n" % (key, value))
+            if trainer.is_world_master():
+                with open(output_eval_file, "w") as writer:
+                    logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
+                    for key, value in eval_result.items():
+                        logger.info("  %s = %s", key, value)
+                        writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+            eval_results.update(eval_result)

-    return results
+    if training_args.do_predict:
+        logging.info("*** Test ***")
+        test_datasets = [test_dataset]
+        if data_args.task_name == "mnli":
+            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
+            test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
+
+        for test_dataset in test_datasets:
+            predictions = trainer.predict(test_dataset=test_dataset).predictions
+            if output_mode == "classification":
+                predictions = np.argmax(predictions, axis=1)
+
+            output_test_file = os.path.join(
+                training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
+            )
+            if trainer.is_world_master():
+                with open(output_test_file, "w") as writer:
+                    logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
+                    writer.write("index\tprediction\n")
+                    for index, item in enumerate(predictions):
+                        if output_mode == "regression":
+                            writer.write("%d\t%3.3f\n" % (index, item))
+                        else:
+                            item = test_dataset.get_labels()[item]
+                            writer.write("%d\t%s\n" % (index, item))
+    return eval_results


 def _mp_fn(index):
--- a/examples/text-classification/run_xnli.py
+++ b/examples/text-classification/run_xnli.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
+""" Finetuning multi-lingual models on XNLI (e.g. Bert, DistilBERT, XLM).
    Adapted from `examples/text-classification/run_glue.py`"""


@@ -32,15 +32,9 @@ from tqdm import tqdm, trange
 from transformers import (
    WEIGHTS_NAME,
    AdamW,
-    BertConfig,
-    BertForSequenceClassification,
-    BertTokenizer,
-    DistilBertConfig,
-    DistilBertForSequenceClassification,
-    DistilBertTokenizer,
-    XLMConfig,
-    XLMForSequenceClassification,
-    XLMTokenizer,
+    AutoConfig,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
    get_linear_schedule_with_warmup,
 )
 from transformers import glue_convert_examples_to_features as convert_examples_to_features
@@ -57,16 +51,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, DistilBertConfig, XLMConfig)), ()
-)
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
-    "xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
-    "distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
-}
-

 def set_seed(args):
    random.seed(args.seed)
@@ -377,19 +361,12 @@ def main():
        required=True,
        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--language",
@@ -421,7 +398,7 @@ def main():
    )
    parser.add_argument(
        "--cache_dir",
-        default="",
+        default=None,
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
@@ -562,24 +539,23 @@ def main():
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(
+    config = AutoConfig.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    tokenizer = tokenizer_class.from_pretrained(
+    args.model_type = config.model_type
+    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    model = model_class.from_pretrained(
+    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )

    if args.local_rank == 0:
@@ -614,14 +590,13 @@ def main():
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
-        model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
@@ -633,7 +608,7 @@ def main():
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

-            model = model_class.from_pretrained(checkpoint)
+            model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -1,6 +1,6 @@
 ## Language generation

-Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
+Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).

 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
--- a/examples/text-generation/pplm/run_pplm.py
+++ b/examples/text-generation/pplm/run_pplm.py
@@ -17,10 +17,10 @@

 """
 Example command with bag of words:
-python examples/run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95
+python run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95

 Example command with discriminator:
-python examples/run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
+python run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
 """

 import argparse
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
@@ -1,7 +1,7 @@
 ## Named Entity Recognition

-Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py) for Pytorch and
-[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_tf_ner.py) for Tensorflow 2.
+Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) for Pytorch and
+[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
 This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
 Details and results for the fine-tuning provided by @stefan-it.

@@ -69,7 +69,7 @@ python3 run_ner.py --data_dir ./ \
 --output_dir $OUTPUT_DIR \
 --max_seq_length  $MAX_LENGTH \
 --num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
+--per_device_train_batch_size $BATCH_SIZE \
 --save_steps $SAVE_STEPS \
 --seed $SEED \
 --do_train \
@@ -91,7 +91,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc
    "output_dir": "germeval-model",
    "max_seq_length": 128,
    "num_train_epochs": 3,
-    "per_gpu_train_batch_size": 32,
+    "per_device_train_batch_size": 32,
    "save_steps": 750,
    "seed": 1,
    "do_train": true,
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
+""" Fine-tuning the library models for named entity recognition on CoNLL-2003. """


 import logging
@@ -171,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -185,7 +184,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -237,22 +235,23 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

            results.update(result)

    # Predict
-    if training_args.do_predict and training_args.local_rank in [-1, 0]:
+    if training_args.do_predict:
        test_dataset = NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
@@ -261,36 +260,44 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.test,
-            local_rank=training_args.local_rank,
        )

        predictions, label_ids, metrics = trainer.predict(test_dataset)
        preds_list, _ = align_predictions(predictions, label_ids)

        output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
-        with open(output_test_results_file, "w") as writer:
-            for key, value in metrics.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_test_results_file, "w") as writer:
+                for key, value in metrics.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

        # Save predictions
        output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
-        with open(output_test_predictions_file, "w") as writer:
-            with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
-                example_id = 0
-                for line in f:
-                    if line.startswith("-DOCSTART-") or line == "" or line == "\n":
-                        writer.write(line)
-                        if not preds_list[example_id]:
-                            example_id += 1
-                    elif preds_list[example_id]:
-                        output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
-                        writer.write(output_line)
-                    else:
-                        logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
+        if trainer.is_world_master():
+            with open(output_test_predictions_file, "w") as writer:
+                with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
+                    example_id = 0
+                    for line in f:
+                        if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                            writer.write(line)
+                            if not preds_list[example_id]:
+                                example_id += 1
+                        elif preds_list[example_id]:
+                            output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
+                            writer.write(output_line)
+                        else:
+                            logger.warning(
+                                "Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
+                            )

    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/token-classification/test_ner_examples.py
+++ b/examples/token-classification/test_ner_examples.py
@@ -6,7 +6,7 @@ from unittest.mock import patch
 import run_ner


-logging.basicConfig(level=logging.DEBUG)
+logging.basicConfig(level=logging.INFO)

 logger = logging.getLogger()

@@ -30,4 +30,4 @@ class ExamplesTests(unittest.TestCase):
            """.split()
        with patch.object(sys, "argv", ["run.py"] + testargs):
            result = run_ner.main()
-            self.assertLess(result["loss"], 1.5)
+            self.assertLess(result["eval_loss"], 1.5)
--- a/examples/token-classification/utils_ner.py
+++ b/examples/token-classification/utils_ner.py
@@ -22,6 +22,8 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List, Optional, Union

+from filelock import FileLock
+
 from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available


@@ -68,7 +70,6 @@ if is_torch_available():
    import torch
    from torch import nn
    from torch.utils.data.dataset import Dataset
-    from transformers import torch_distributed_zero_first

    class NerDataset(Dataset):
        """
@@ -90,16 +91,16 @@ if is_torch_available():
            max_seq_length: Optional[int] = None,
            overwrite_cache=False,
            mode: Split = Split.train,
-            local_rank=-1,
        ):
            # Load data features from cache or dataset file
            cached_features_file = os.path.join(
                data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
            )

-            with torch_distributed_zero_first(local_rank):
-                # Make sure only the first process in distributed training processes the dataset,
-                # and the others will use the cache.
+            # Make sure only the first process in distributed training processes the dataset,
+            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

                if os.path.exists(cached_features_file) and not overwrite_cache:
                    logger.info(f"Loading features from cached file {cached_features_file}")
@@ -125,9 +126,8 @@ if is_torch_available():
                        pad_token_segment_id=tokenizer.pad_token_type_id,
                        pad_token_label_id=self.pad_token_label_id,
                    )
-                    if local_rank in [-1, 0]:
-                        logger.info(f"Saving features into cached file {cached_features_file}")
-                        torch.save(self.features, cached_features_file)
+                    logger.info(f"Saving features into cached file {cached_features_file}")
+                    torch.save(self.features, cached_features_file)

        def __len__(self):
            return len(self.features)
--- a/examples/xla_spawn.py
+++ b/examples/xla_spawn.py
@@ -12,17 +12,13 @@ Inspired by https://github.com/pytorch/pytorch/blob/master/torch/distributed/lau


 import importlib
-import os
 import sys
 from argparse import REMAINDER, ArgumentParser
+from pathlib import Path

 import torch_xla.distributed.xla_multiprocessing as xmp


-def trim_suffix(s: str, suffix: str):
-    return s if not s.endswith(suffix) or len(suffix) == 0 else s[: -len(suffix)]
-
-
 def parse_args():
    """
    Helper function parsing the command line options
@@ -44,7 +40,7 @@ def parse_args():
        "training_script",
        type=str,
        help=(
-            "The full module name to the single TPU training "
+            "The full path to the single TPU training "
            "program/script to be launched in parallel, "
            "followed by all the arguments for the "
            "training script"
@@ -61,7 +57,9 @@ def main():
    args = parse_args()

    # Import training_script as a module.
-    mod_name = trim_suffix(os.path.basename(args.training_script), ".py")
+    script_fpath = Path(args.training_script)
+    sys.path.append(str(script_fpath.parent.resolve()))
+    mod_name = script_fpath.stem
    mod = importlib.import_module(mod_name)

    # Patch sys.argv
--- a/model_cards/HooshvareLab/bert-base-parsbert-armanner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-armanner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-ner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-ner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-peymaner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-peymaner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+---
+
+## Introduction
+
+This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
+
+As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpus into a proper format. This process produces more than 40M true sentences. 
+
+
+## Evaluation
+
+ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+
+
+### Sentiment Analysis (SA) task
+
+|           Dataset          |  ParsBERT | mBERT | DeepSentiPers |
+|:--------------------------:|:---------:|:-----:|:-------------:|
+|   Digikala User Comments   |   81.74*  | 80.74 |       -       |
+|   SnappFood User Comments  |   88.12*  | 87.87 |       -       |
+|   SentiPers (Multi Class)  |   71.11*  |   -   |     69.33     |
+|  SentiPers (Binary Class)  |   92.13*  |   -   |     91.98     |
+
+
+
+### Text Classification (TC) task
+
+|      Dataset      | ParsBERT | mBERT |
+|:-----------------:|:--------:|:-----:|
+| Digikala Magazine |   93.59* | 90.72 |
+|    Persian News   |   97.19* | 95.79 |
+
+
+### Named Entity Recognition (NER) task
+
+| Dataset | ParsBERT |  mBERT   | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  PEYMA  |   93.10* |   86.64  |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN  |   98.79* |   95.89  |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+**If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference**
+
+## How to use
+
+### TensorFlow 2.0
+
+```python
+from transformers import AutoConfig, AutoTokenizer, TFAutoModel
+
+config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+
+text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
+tokenizer.tokenize(text)
+
+>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
+
+```
+
+### Pytorch
+
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModel
+
+config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+```
+
+
+## NLP Tasks Tutorial 
+
+Coming soon stay tuned
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+
+## Releases
+
+### Release v0.1 (May 27, 2019)
+This is the first version of our ParsBERT based on BERT<sub>BASE</sub>
--- a/model_cards/LorenzoDeMattei/GePpeTto/README.md
+++ b/model_cards/LorenzoDeMattei/GePpeTto/README.md
@@ -59,56 +59,64 @@ tokenizer = GPT2Tokenizer.from_pretrained(
 ## Example using GPT2LMHeadModel

 ```python
-from transformers import GPT2Tokenizer, GPT2LMHeadModel
+from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline, GPT2Tokenizer

-tokenizer = GPT2Tokenizer.from_pretrained('LorenzoDeMattei/GePpeTto')
-model = GPT2LMHeadModel.from_pretrained(
-    'LorenzoDeMattei/GePpeTto', pad_token_id = tokenizer.eos_token_id
+tokenizer = AutoTokenizer.from_pretrained("LorenzoDeMattei/GePpeTto")
+model = AutoModelWithLMHead.from_pretrained("LorenzoDeMattei/GePpeTto")
+
+text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
+prompts = [
+    "Wikipedia Geppetto",
+    "Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso"]
+
+
+samples_outputs = text_generator(
+    prompts,
+    do_sample=True,
+    max_length=50,
+    top_k=50,
+    top_p=0.95,
+    num_return_sequences=3
 )

-input_ids = tokenizer.encode(
-    'Wikipedia Geppetto', return_tensors = 'pt'
-)
-sample_outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    max_length = 50,
-    top_k = 50,
-    top_p = 0.95,
-    num_return_sequences = 3,
-)

-print('Output:\n' + 100 * '-')
-for i, sample_output in enumerate(sample_outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
+for i, sample_outputs in enumerate(samples_outputs):
+    print(100 * '-')
+    print("Prompt:", prompts[i])
+    for sample_output in sample_outputs:
+        print("Sample:", sample_output['generated_text'])
+        print()
+
 ```

 Output is,

-```text
-Output:
+```
 ----------------------------------------------------------------------------------------------------
-0: Wikipedia Geppetto
+Prompt: Wikipedia Geppetto
+Sample: Wikipedia Geppetto rosso (film 1920)

-Geppetto è una città degli Stati Uniti d'America, situata nello Stato dell'Iowa, nella Contea di Greene.
+Geppetto rosso ("The Smokes in the Black") è un film muto del 1920 diretto da Henry H. Leonard.

-Wikipedia The Sax
+Il film fu prodotto dalla Selig Poly

-The Sax è il primo album discografico
-2: Wikipedia Geppetto/Passione
+Sample: Wikipedia Geppetto

-Geppetto è il primo album in studio dei Saturday Night Live, pubblicato dalla Iron Maiden nel 1974.
+Geppetto ("Geppetto" in piemontese) è un comune italiano di 978 abitanti della provincia di Cuneo in Piemonte.

-L'album è un lavoro di debutto che lo porta a definire
-3: Wikipedia Geppetto
+L'abitato, che si trova nel versante valtellinese, si sviluppa nella

-Geppetto ("Fenëvëv" in calabrese) è un comune italiano di abitanti della regione Calabria.
+Sample: Wikipedia Geppetto di Natale (romanzo)

-Zona di particolare pregio storico-artistico, paesaggistico, storico-artistico,
+Geppetto di Natale è un romanzo di Mario Caiano, pubblicato nel 2012.
+
+----------------------------------------------------------------------------------------------------
+Prompt: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso. Il burattino riesce a scappare. Dopo aver trovato un prezioso sacchetto si reca
+
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso, e l'unico che lo possiede, ma, di fronte a tutte queste prove
+
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso: - A voi gli occhi, le guance! A voi il mio pezzo!
 ```

 ## Citation
--- a/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
+++ b/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
@@ -0,0 +1,39 @@
+---
+language: ukrainian
+---
+
+Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
+
+## GPT2 124M Trained on Ukranian Fiction
+
+### Training details
+
+Model was trained on corpus of 4040 fiction books, 2.77 GiB in total.
+Evaluation on [brown-uk](https://github.com/brown-uk/corpus) gives perplexity of 50.16. 
+
+### Example usage:
+```python
+from transformers import AlbertTokenizer, GPT2LMHeadModel
+
+tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+
+input_ids = tokenizer.encode("Но зла Юнона, суча дочка,", add_special_tokens=False, return_tensors='pt')
+
+outputs = model.generate(
+    input_ids,
+    do_sample=True,
+    num_return_sequences=3,
+    max_length=50
+)
+
+for i, out in enumerate(outputs):
+    print("{}: {}".format(i, tokenizer.decode(out)))
+```
+
+Prints something like this:
+```bash
+0: Но зла Юнона, суча дочка, яка затьмарила всі її таємниці: І хто з'їсть її душу, той помре». І, не дочекавшись гніву богів, посунула в пітьму, щоб не бачити перед собою. Але, за
+1: Но зла Юнона, суча дочка, і довела мене до божевілля. Але він не знав нічого. Після того як я його побачив, мені стало зле. Я втратив рівновагу. Але в мене не було часу на роздуми. Я вже втратив надію
+2: Но зла Юнона, суча дочка, не нарікала нам! — раптом вигукнула Юнона. — Це ти, старий йолопе! — мовила вона, не перестаючи сміятись. — Хіба ти не знаєш, що мені подобається ходити з тобою?
+```
--- a/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
+++ b/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
@@ -0,0 +1,25 @@
+---
+language: norwegian
+thumbnail: https://i.imgur.com/QqSEC5I.png
+---
+
+# Norwegian Electra
+![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)
+
+Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
+
+Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
+# Model
+## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
+Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
+- https://openreview.net/pdf?id=r1xMH1BtvB
+- https://github.com/google-research/electra
+# Acknowledgments
+### TensorFlow Research Cloud
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
+- https://www.tensorflow.org/tfrc
+#### OSCAR corpus
+- https://oscar-corpus.com/
+#### OPUS
+- http://opus.nlpl.eu/
+- http://www.opensubtitles.org/
--- a/model_cards/activebus/BERT-DK_laptop/README.md
+++ b/model_cards/activebus/BERT-DK_laptop/README.md
@@ -0,0 +1,43 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-DK_rest/README.md
+++ b/model_cards/activebus/BERT-DK_rest/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
+model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_laptop/README.md
+++ b/model_cards/activebus/BERT-PT_laptop/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
+
+```
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_rest/README.md
+++ b/model_cards/activebus/BERT-PT_rest/README.md
@@ -0,0 +1,42 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
+model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-XD_Review/README.md
+++ b/model_cards/activebus/BERT-XD_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.  
+
+`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+## Model Description
+
+The original model is from `BERT-base-uncased`.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
+model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT_Review/README.md
+++ b/model_cards/activebus/BERT_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
+model = AutoModel.from_pretrained("activebus/BERT_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
+++ b/model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
@@ -0,0 +1,44 @@
+---
+language: polish
+---
+
+# HerBERT tokenizer
+
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** tokenizer is a character level byte-pair encoding with
+vocabulary size of 50k tokens. The tokenizer was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
+[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with [fastBPE](https://github.com/glample/fastBPE) library.
+Tokenizer utilize `XLMTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).
+
+## Tokenizer usage
+Herbert tokenizer should be used together with [HerBERT model](https://huggingface.co/allegro/herbert-klej-cased-v1):
+```python
+from transformers import XLMTokenizer, RobertaModel
+
+tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
+
+encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
+outputs = model(encoded_input)
+```
+
+## License
+CC BY-SA 4.0
+
+## Citation
+If you use this tokenizer, please cite the following paper:
+```
+@misc{rybak2020klej,
+    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
+    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
+    year={2020},
+    eprint={2005.00630},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
+
+## Authors
+Tokenizer was created by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/allegro/herbert-klej-cased-v1/README.md
+++ b/model_cards/allegro/herbert-klej-cased-v1/README.md
@@ -0,0 +1,85 @@
+---
+language: polish
+---
+
+# HerBERT 
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
+using only MLM objective with dynamic masking of whole words. For more details, please refer to: 
+[KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://arxiv.org/abs/2005.00630).
+
+## Dataset
+**HerBERT** training dataset is a combination of several publicly available corpora for Polish language:
+
+| Corpus | Tokens | Texts |
+| :------ | ------: | ------: |
+| [OSCAR](https://traces1.inria.fr/oscar/)| 6710M  | 145M |
+| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1084M  | 1.1M |
+| [Wikipedia](https://dumps.wikimedia.org/) | 260M  | 1.5M |
+| [Wolne Lektury](https://wolnelektury.pl/) | 41M  | 5.5k |
+| [Allegro Articles](https://allegro.pl/artykuly) | 18M  | 33k |
+
+## Tokenizer
+The training dataset was tokenized into subwords using [HerBERT Tokenizer](https://huggingface.co/allegro/herbert-klej-cased-tokenizer-v1); a character level byte-pair encoding with
+a vocabulary size of 50k tokens. The tokenizer itself was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of 
+[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with a [fastBPE](https://github.com/glample/fastBPE) library.
+
+Tokenizer utilizes `XLMTokenizer` implementation for that reason, one should load it as `allegro/herbert-klej-cased-tokenizer-v1`.
+
+## HerBERT models summary
+| Model | WWM | Cased | Tokenizer | Vocab Size  | Batch Size | Train Steps |
+| :------ | ------: | ------: | ------: | ------: | ------: | ------: |
+| herbert-klej-cased-v1 | YES | YES | BPE | 50K | 570 | 180k | 
+
+## Model evaluation
+HerBERT was evaluated on the [KLEJ](https://klejbenchmark.com/) benchmark, publicly available set of nine evaluation tasks for the Polish language understanding.
+It had the best average performance and obtained the best results for three of them.
+
+| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN	|PolEmo2.0-OUT | DYK | PSC | AR	|
+| :------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: |  ------: | ------: | ------: |
+| herbert-klej-cased-v1 | **80.5** | 92.7 | 92.5 | 91.9 | **50.3** | **89.2** |**76.3** |52.1 |95.3 | 84.5 |
+
+Full leaderboard is available [online](https://klejbenchmark.com/leaderboard). 
+
+
+## HerBERT usage
+Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.0.
+
+Example code:
+```python
+from transformers import XLMTokenizer, RobertaModel
+
+tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
+
+encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
+outputs = model(encoded_input)
+```
+
+HerBERT can also be loaded using `AutoTokenizer` and `AutoModel`:
+
+```python
+tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
+```
+
+## License
+CC BY-SA 4.0
+
+## Citation
+If you use this model, please cite the following paper:
+```
+@misc{rybak2020klej,
+    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
+    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
+    year={2020},
+    eprint={2005.00630},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
+
+## Authors
+Model was trained by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/allenai/longformer-base-4096-extra.pos.embd.only/README.md
+++ b/model_cards/allenai/longformer-base-4096-extra.pos.embd.only/README.md
@@ -0,0 +1,20 @@
+
+# longformer-base-4096-extra.pos.embd.only
+
+This model is similar to `longformer-base-4096` but it was pretrained to preserve RoBERTa weights by freezing all RoBERTa weights and only train the additional position embeddings. 
+
+
+### Citing
+
+If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
+```
+@article{Beltagy2020Longformer,
+  title={Longformer: The Long-Document Transformer},
+  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
+  journal={arXiv:2004.05150},
+  year={2020},
+}
+```
+
+`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
--- a/model_cards/allenai/longformer-base-4096/README.md
+++ b/model_cards/allenai/longformer-base-4096/README.md
@@ -0,0 +1,24 @@
+
+# longformer-base-4096
+[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents. 
+
+`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096. 
+ 
+Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
+Please refer to the examples in `modeling_longformer.py` and the paper for more details on how to set global attention.
+
+
+### Citing
+
+If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
+```
+@article{Beltagy2020Longformer,
+  title={Longformer: The Long-Document Transformer},
+  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
+  journal={arXiv:2004.05150},
+  year={2020},
+}
+```
+
+`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
--- a/model_cards/asafaya/bert-base-arabic/README.md
+++ b/model_cards/asafaya/bert-base-arabic/README.md
@@ -6,6 +6,17 @@ language: arabic

 Pretrained BERT base language model for Arabic

+_If you use this model in your work, please cite this paper (to appear in 2020):_
+
+```
+@inproceedings{
+  title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
+  author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
+  booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
+  year={2020}
+}
+```
+
 ## Pretraining Corpus

 `arabic-bert-base` model was pretrained on ~8.2 Billion words:
--- a/model_cards/aubmindlab/bert-base-arabert/README.md
+++ b/model_cards/aubmindlab/bert-base-arabert/README.md
@@ -3,8 +3,9 @@ language: arabic
 ---

 # AraBERT : Pre-training BERT for Arabic Language Understanding
+<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>  

-**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
+**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)

 There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).

@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

 We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)

+**Update 2 (21/5/2020) :**
+Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
+
+**Update 1 (21/4/2020) :** 
+Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
 ## Results (Acc.)
 Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:|:---:|:---:
-HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
-ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
-ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
-AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
-LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
-ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
-ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
+HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
+ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
+ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
+AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
+LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
+ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
+ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**

-*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
+*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*

 ## How to use

-You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
-
-To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
+You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)

+To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
+**Not all libraries built on top of transformers support the `never_split` argument**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel
+from arabert.preprocess_arabert import never_split_tokens, preprocess
+from farasa.segmenter import FarasaSegmenter

 arabert_tokenizer = AutoTokenizer.from_pretrained(
    "aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
    never_split=never_split_tokens)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")

-arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
+#Preprocess the text to make it compatible with AraBERT using farasapy
+farasa_segmenter = FarasaSegmenter(interactive=True)
+
+#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower 
+#(see update 2)
+#from py4j.java_gateway import JavaGateway
+#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
+#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
+
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+text_preprocessed = preprocess( text,
+                                do_farasa_tokenization = True,
+                                farasa = farasa_segmenter,
+                                use_farasapy = True)
+
+>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
+
+arabert_tokenizer.tokenize(text_preprocessed)

 >>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
 ```

 **AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel

 arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")

-arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+arabert_tokenizer.tokenize(text)

 >>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
 ```


-The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
+The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).

+**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
+
+**AraBERT on ARCD**
+During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
+```bash
+python arcd_preprocessing.py \
+    --input_file="/PATH_TO/arcd-test.json" \
+    --output_file="arcd-test-pre.json" \
+    --do_farasa_tokenization=True \
+    --use_farasapy=True \
+```
+```bash
+python SOQAL/bert/run_squad.py \
+  --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
+  --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
+  --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
+  --do_train=True \
+  --train_file=turk_combined_all_pre.json \
+  --do_predict=True \
+  --predict_file=arcd-test-pre.json \
+  --train_batch_size=32 \
+  --predict_batch_size=24 \
+  --learning_rate=3e-5 \
+  --num_train_epochs=4 \
+  --max_seq_length=384 \
+  --doc_stride=128 \
+  --do_lower_case=False\
+  --output_dir="/PATH_TO/OUTPUT_PATH"/ \
+  --use_tpu=True \
+  --tpu_name=$TPU_ADDRESS \
+```
 ## Model Weights and Vocab Download
 Models | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7

 ## If you used this model please cite us as:
 ```
-@misc{antoun2020arabert,
-    title={AraBERT: Transformer-based Model for Arabic Language Understanding},
-    author={Wissam Antoun and Fady Baly and Hazem Hajj},
-    year={2020},
-    eprint={2003.00104},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
+  pages={9}
 }
 ```
 ## Acknowledgments 
-Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
+Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

 ## Contacts
 **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>

-**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
-
-***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
+**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
--- a/model_cards/aubmindlab/bert-base-arabertv01/README.md
+++ b/model_cards/aubmindlab/bert-base-arabertv01/README.md
@@ -3,8 +3,9 @@ language: arabic
 ---

 # AraBERT : Pre-training BERT for Arabic Language Understanding
+<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>  

-**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
+**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)

 There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).

@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

 We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)

+**Update 2 (21/5/2020) :**
+Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
+
+**Update 1 (21/4/2020) :** 
+Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
 ## Results (Acc.)
 Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:|:---:|:---:
-HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
-ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
-ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
-AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
-LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
-ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
-ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
+HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
+ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
+ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
+AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
+LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
+ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
+ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**

-*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
+*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*

 ## How to use

-You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
-
-To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
+You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)

+To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
+**Not all libraries built on top of transformers support the `never_split` argument**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel
+from arabert.preprocess_arabert import never_split_tokens, preprocess
+from farasa.segmenter import FarasaSegmenter

 arabert_tokenizer = AutoTokenizer.from_pretrained(
    "aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
    never_split=never_split_tokens)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")

-arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
+#Preprocess the text to make it compatible with AraBERT using farasapy
+farasa_segmenter = FarasaSegmenter(interactive=True)
+
+#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower 
+#(see update 2)
+#from py4j.java_gateway import JavaGateway
+#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
+#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
+
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+text_preprocessed = preprocess( text,
+                                do_farasa_tokenization = True,
+                                farasa = farasa_segmenter,
+                                use_farasapy = True)
+
+>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
+
+arabert_tokenizer.tokenize(text_preprocessed)

 >>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
 ```

 **AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel

 arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")

-arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+arabert_tokenizer.tokenize(text)

 >>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
 ```


-The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
+The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).

+**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
+
+**AraBERT on ARCD**
+During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
+```bash
+python arcd_preprocessing.py \
+    --input_file="/PATH_TO/arcd-test.json" \
+    --output_file="arcd-test-pre.json" \
+    --do_farasa_tokenization=True \
+    --use_farasapy=True \
+```
+```bash
+python SOQAL/bert/run_squad.py \
+  --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
+  --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
+  --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
+  --do_train=True \
+  --train_file=turk_combined_all_pre.json \
+  --do_predict=True \
+  --predict_file=arcd-test-pre.json \
+  --train_batch_size=32 \
+  --predict_batch_size=24 \
+  --learning_rate=3e-5 \
+  --num_train_epochs=4 \
+  --max_seq_length=384 \
+  --doc_stride=128 \
+  --do_lower_case=False\
+  --output_dir="/PATH_TO/OUTPUT_PATH"/ \
+  --use_tpu=True \
+  --tpu_name=$TPU_ADDRESS \
+```
 ## Model Weights and Vocab Download
 Models | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7

 ## If you used this model please cite us as:
 ```
-@misc{antoun2020arabert,
-    title={AraBERT: Transformer-based Model for Arabic Language Understanding},
-    author={Wissam Antoun and Fady Baly and Hazem Hajj},
-    year={2020},
-    eprint={2003.00104},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
+  pages={9}
 }
 ```
 ## Acknowledgments 
-Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
+Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

 ## Contacts
 **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>

-**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
-
-***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
+**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
--- a/model_cards/bayartsogt/albert-mongolian/README.md
+++ b/model_cards/bayartsogt/albert-mongolian/README.md
@@ -0,0 +1,55 @@
+# ALBERT-Mongolian
+[pretraining repo link](https://github.com/bayartsogt-ya/albert-mongolian)
+## Model description
+Here we provide pretrained ALBERT model and trained SentencePiece model for Mongolia text. Training data is the Mongolian wikipedia corpus from Wikipedia Downloads and Mongolian News corpus.
+
+## Evaluation Result:
+```
+loss = 1.7478163
+masked_lm_accuracy = 0.6838185
+masked_lm_loss = 1.6687671
+sentence_order_accuracy = 0.998125
+sentence_order_loss = 0.007942731
+```
+
+## Fine-tuning Result on Eduge Dataset:
+```
+                precision    recall  f1-score   support
+
+  байгал орчин       0.83      0.76      0.80       483
+     боловсрол       0.79      0.75      0.77       420
+         спорт       0.98      0.96      0.97      1391
+     технологи       0.85      0.83      0.84       543
+       улс төр       0.88      0.87      0.87      1336
+    урлаг соёл       0.89      0.94      0.91       726
+         хууль       0.87      0.83      0.85       840
+   эдийн засаг       0.80      0.84      0.82      1265
+    эрүүл мэнд       0.84      0.90      0.87       562
+
+      accuracy                           0.87      7566
+     macro avg       0.86      0.85      0.86      7566
+  weighted avg       0.87      0.87      0.87      7566
+```
+
+## Reference
+1. [ALBERT - official repo](https://github.com/google-research/albert)
+2. [WikiExtrator](https://github.com/attardi/wikiextractor)
+3. [Mongolian BERT](https://github.com/tugstugi/mongolian-bert)
+4. [ALBERT - Japanese](https://github.com/alinear-corp/albert-japanese)
+5. [Mongolian Text Classification](https://github.com/sharavsambuu/mongolian-text-classification)
+6. [You's paper](https://arxiv.org/abs/1904.00962)
+
+## Citation
+```
+@misc{albert-mongolian,
+  author = {Bayartsogt Yadamsuren},
+  title = {ALBERT Pretrained Model on Mongolian Datasets},
+  year = {2020},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
+}
+```
+
+## For More Information
+Please contact by bayartsogtyadamsuren@icloud.com
--- a/model_cards/bert-base-german-cased-README.md
+++ b/model_cards/bert-base-german-cased-README.md
@@ -18,13 +18,16 @@ tags:
 **Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)  
 **Infrastructure**: 1x TPU v2  
 **Published**: Jun 14th, 2019
+
+**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. 
+For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.
 
 ## Details
 - We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
 - We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
 - As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
 - We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
+

 See https://deepset.ai/german-bert for more details

--- a/Show More
+++ b/Show More