Fix CI after killing archive maps (#4724 )

* 🐛 Fix model ids for BART and Flaubert
Release: v2.11.0
2020-06-02 10:21:09 -04:00 · 2020-06-02 09:49:09 -04:00 · 2020-06-02 09:39:33 -04:00 · 2020-06-02 11:03:46 +02:00 · 2020-06-02 11:02:27 +02:00 · 2020-06-02 04:29:28 -04:00
260 changed files with 13487 additions and 4002 deletions
--- a/.github/workflows/github-torch-hub.yml
+++ b/.github/workflows/github-torch-hub.yml
@@ -21,7 +21,7 @@ jobs:
    - name: Install dependencies
      run: |
        pip install torch
-        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
+        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging

    - name: Torch hub list
      run: |
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -35,7 +35,7 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install torch==1.4.0
+        pip install torch
        pip install .[sklearn,testing]

    - name: Are GPUs recognized by our DL frameworks
--- a/.github/workflows/self-scheduled.yml
+++ b/.github/workflows/self-scheduled.yml
@@ -31,13 +31,12 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install .[sklearn,tf,torch,testing]
+        pip install .[sklearn,torch,testing]

    - name: Are GPUs recognized by our DL frameworks
      run: |
        source .env/bin/activate
        python -c "import torch; print(torch.cuda.is_available())"
-        python -c "import tensorflow as tf; print(tf.test.is_built_with_cuda(), tf.config.list_physical_devices('GPU'))"

    - name: Run all tests on GPU
      env:
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -44,9 +44,16 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
 To get the OS and software versions automatically, you can run the following command:

 ```bash
-python transformers-cli env
+transformers-cli env
 ```

+or from the root of the repository the following command:
+
+```bash
+python src/transformers/commands/transformers_cli.py env
+```
+
+
 ### Do you want to implement a new model?

 Awesome! Please provide the following information:
@@ -198,11 +205,12 @@ Follow these steps to start contributing:
   are useful to avoid duplicated work, and to differentiate it from PRs ready
   to be merged;
 4. Make sure existing tests pass;
-5. Add high-coverage tests. No quality test, no merge. 
+5. Add high-coverage tests. No quality testing = no merge. 
 - If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
 - If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. 
+ - If you are adding a new tokenizer, write tests, and make sure `RUN_SLOW=1 python -m pytest tests/test_tokenization_{your_model_name}.py` passes.
 CircleCI does not run them. 
-6. All public methods must have informative docstrings;
+6. All public methods must have informative docstrings that work nicely with sphinx. See `modeling_ctrl.py` for an example.

 ### Tests

--- a/README.md
+++ b/README.md
@@ -63,7 +63,7 @@ Choose the right framework for every part of a model's lifetime

 ## Installation

-This repo is tested on Python 3.6+, PyTorch 1.0.0+ and TensorFlow 2.0.
+This repo is tested on Python 3.6+, PyTorch 1.0.0+ (PyTorch 1.3.1+ for examples) and TensorFlow 2.0.

 You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).

@@ -165,8 +165,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
-22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+22. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
+23. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

@@ -339,8 +340,8 @@ python ./examples/text-classification/run_glue.py \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/
@@ -366,8 +367,8 @@ python ./examples/text-classification/run_glue.py \
    --data_dir=${GLUE_DIR}/STS-B  \
    --output_dir=./proc_data/sts-b-110   \
    --max_seq_length=128   \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=1200  \
    --model_name=xlnet-large-cased   \
@@ -390,8 +391,8 @@ python -m torch.distributed.launch --nproc_per_node 8 ./examples/text-classifica
    --do_eval   \
    --data_dir $GLUE_DIR/MRPC/   \
    --max_seq_length 128   \
-    --per_gpu_eval_batch_size=8   \
-    --per_gpu_train_batch_size=8   \
+    --per_device_eval_batch_size=8   \
+    --per_device_train_batch_size=8   \
    --learning_rate 2e-5   \
    --num_train_epochs 3.0  \
    --output_dir /tmp/mrpc_output/ \
@@ -427,8 +428,8 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=3   \
-    --per_gpu_train_batch_size=3   \
+    --per_device_eval_batch_size=3   \
+    --per_device_train_batch_size=3   \
 ```

 Training with these hyper-parameters gave us the following results:
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.9.1'
+release = u'2.11.0'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -1,649 +0,0 @@
-# Examples
-
-In this section a few examples are put together. All of these examples work for several models, making use of the very
-similar API between the different models.
-
-**Important**
-To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
-Execute the following steps in a new virtual environment:
-
-```bash
-git clone https://github.com/huggingface/transformers
-cd transformers
-pip install .
-pip install -r ./examples/requirements.txt
-```
-
-| Section                    | Description                                                                                                                                                |
-|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
-| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
-
-## TensorFlow 2.0 Bert models on GLUE
-
-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).
-
-Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).
-
-This script has an option for mixed precision (Automatic Mixed Precision / AMP) to run models on Tensor Cores (NVIDIA Volta/Turing GPUs) and future hardware and an option for XLA, which uses the XLA compiler to reduce model runtime.
-Options are toggled using `USE_XLA` or `USE_AMP` variables in the script.
-These options and the below benchmark are provided by @tlkh.
-
-Quick benchmarks from the script (no other modifications):
-
-| GPU    | Mode | Time (2nd epoch) | Val Acc (3 runs) |
-| --------- | -------- | ----------------------- | ----------------------|
-| Titan V | FP32 | 41s | 0.8438/0.8281/0.8333 |
-| Titan V | AMP | 26s | 0.8281/0.8568/0.8411 |
-| V100    | FP32 | 35s | 0.8646/0.8359/0.8464 |
-| V100    | AMP | 22s | 0.8646/0.8385/0.8411 |
-| 1080 Ti | FP32 | 55s | - |
-
-Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).
-
-## Running on TPUs
-
-You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
-[README](https://github.com/pytorch/xla/blob/master/README.md).
-
-The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
-identical to your normal GPU + Huggingface setup.
-
-### GLUE
-
-Before running anyone of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-For running your GLUE task on MNLI dataset you can run something like the following:
-
-```
-export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MNLI
-
-python run_glue_tpu.py \
-  --model_type bert \
-  --model_name_or_path bert-base-cased \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME \
-  --overwrite_output_dir \
-  --logging_steps 50 \
-  --save_steps 200 \
-  --num_cores=8 \
-  --only_log_master
-```
-
-
-## Language model training
-
-Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).
-
-Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
-to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
-are fine-tuned using a masked language modeling (MLM) loss.
-
-Before running the following example, you should get a file that contains text on which the language model will be
-trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
-
-We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
-text that will be used for evaluation.
-
-### GPT-2/GPT and causal language modeling
-
-The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before
-the tokenization). The loss here is that of causal language modeling.
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export TEST_FILE=/path/to/dataset/wiki.test.raw
-
-python run_language_modeling.py \
-    --output_dir=output \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2 \
-    --do_train \
-    --train_data_file=$TRAIN_FILE \
-    --do_eval \
-    --eval_data_file=$TEST_FILE
-```
-
-This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
-a score of ~20 perplexity once fine-tuned on the dataset.
-
-### RoBERTa/BERT and masked language modeling
-
-The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
-as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
-pre-training: masked language modeling.
-
-In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
-slightly slower (over-fitting takes more epochs).
-
-We use the `--mlm` flag so that the script may change its loss function.
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export TEST_FILE=/path/to/dataset/wiki.test.raw
-
-python run_language_modeling.py \
-    --output_dir=output \
-    --model_type=roberta \
-    --model_name_or_path=roberta-base \
-    --do_train \
-    --train_data_file=$TRAIN_FILE \
-    --do_eval \
-    --eval_data_file=$TEST_FILE \
-    --mlm
-```
-
-## Language generation
-
-Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).
-
-Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
-A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
-can try out the different models available in the library.
-
-Example usage:
-
-```bash
-python run_generation.py \
-    --model_type=gpt2 \
-    --model_name_or_path=gpt2
-```
-
-## GLUE
-
-Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_glue.py).
-
-Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
-Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.
-
-GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
-uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
-batch sizes between 16 and 64. Some of these tasks have a small dataset and training can lead to high variance in the results
-between different runs. We report the median on 5 runs (with different seeds) for each of the metrics.
-
-| Task  | Metric                       | Result      |
-|-------|------------------------------|-------------|
-| CoLA  | Matthew's corr               | 49.23       |
-| SST-2 | Accuracy                     | 91.97       |
-| MRPC  | F1/Accuracy                  | 89.47/85.29 |
-| STS-B | Person/Spearman corr.        | 83.95/83.70 |
-| QQP   | Accuracy/F1                  | 88.40/84.31 |
-| MNLI  | Matched acc./Mismatched acc. | 80.61/81.08 |
-| QNLI  | Accuracy                     | 87.46       |
-| RTE   | Accuracy                     | 61.73       |
-| WNLI  | Accuracy                     | 45.07       |
-
-Some of these results are significantly different from the ones reported on the test set
-of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```bash
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MRPC
-
-python run_glue.py \
-  --model_type bert \
-  --model_name_or_path bert-base-cased \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME/
-```
-
-where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
-
-The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
-In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
-output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.
-
-The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
-CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
-said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
-since the data processor for each task inherits from the base class DataProcessor.
-
-### MRPC
-
-#### Fine-tuning example
-
-The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
-than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
-
-Before running any one of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python run_glue.py \
-  --model_name_or_path bert-base-cased \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/
-```
-
-Our test ran on a few seeds with [the original implementation hyper-
-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
-results between 84% and 88%.
-
-#### Using Apex and mixed-precision
-
-Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
-[apex](https://github.com/NVIDIA/apex), then run the following example:
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python run_glue.py \
-  --model_name_or_path bert-base-cased \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/ \
-  --fp16
-```
-
-#### Distributed training
-
-Here is an example using distributed training on 8 V100 GPUs. The model used is the BERT whole-word-masking and it
-reaches F1 > 92 on MRPC.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python -m torch.distributed.launch \
-    --nproc_per_node 8 run_glue.py \
-    --model_name_or_path bert-base-cased \
-    --task_name MRPC \
-    --do_train \
-    --do_eval \
-    --data_dir $GLUE_DIR/MRPC/ \
-    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
-    --learning_rate 2e-5 \
-    --num_train_epochs 3.0 \
-    --output_dir /tmp/mrpc_output/
-```
-
-Training with these hyper-parameters gave us the following results:
-
-```bash
-acc = 0.8823529411764706
-acc_and_f1 = 0.901702786377709
-eval_loss = 0.3418912578906332
-f1 = 0.9210526315789473
-global_step = 174
-loss = 0.07231863956341798
-```
-
-### MNLI
-
-The following example uses the BERT-large, uncased, whole-word-masking model and fine-tunes it on the MNLI task.
-
-```bash
-export GLUE_DIR=/path/to/glue
-
-python -m torch.distributed.launch \
-    --nproc_per_node 8 run_glue.py \
-    --model_name_or_path bert-base-cased \
-    --task_name mnli \
-    --do_train \
-    --do_eval \
-    --data_dir $GLUE_DIR/MNLI/ \
-    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
-    --learning_rate 2e-5 \
-    --num_train_epochs 3.0 \
-    --output_dir output_dir \
-```
-
-The results  are the following:
-
-```bash
-***** Eval results *****
-  acc = 0.8679706601466992
-  eval_loss = 0.4911287787382479
-  global_step = 18408
-  loss = 0.04755385363816904
-
-***** Eval results *****
-  acc = 0.8747965825874695
-  eval_loss = 0.45516540421714036
-  global_step = 18408
-  loss = 0.04755385363816904
-```
-
-## Multiple Choice
-
-Based on the script [`run_multiple_choice.py`]().
-
-#### Fine-tuning on SWAG
-Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
-
-```bash
-#training on 4 tesla V100(16GB) GPUS
-export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/multiple-choice/run_multiple_choice.py \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output
-```
-Training with the defined hyper-parameters yields the following results:
-```
-***** Eval results *****
-eval_acc = 0.8338998300509847
-eval_loss = 0.44457291918821606
-```
-
-## SQuAD
-
-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
-
-#### Fine-tuning BERT on SQuAD1.0
-
-This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
-on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
-$SQUAD_DIR directory.
-
-* [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
-* [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
-* [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
-
-And for SQuAD2.0, you need to download:
-
- [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
- [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
- [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-  --model_type bert \
-  --model_name_or_path bert-base-uncased \
-  --do_train \
-  --do_eval \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --per_gpu_train_batch_size 12 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/
-```
-
-Training with the previously defined hyper-parameters yields the following results:
-
-```bash
-f1 = 88.52
-exact_match = 81.22
-```
-
-#### Distributed training
-
-
-Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
-    --model_type bert \
-    --model_name_or_path bert-large-uncased-whole-word-masking \
-    --do_train \
-    --do_eval \
-    --train_file $SQUAD_DIR/train-v1.1.json \
-    --predict_file $SQUAD_DIR/dev-v1.1.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 2 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=3   \
-    --per_gpu_train_batch_size=3   \
-```
-
-Training with the previously defined hyper-parameters yields the following results:
-
-```bash
-f1 = 93.15
-exact_match = 86.91
-```
-
-This fine-tuned model is available as a checkpoint under the reference
-`bert-large-uncased-whole-word-masking-finetuned-squad`.
-
-#### Fine-tuning XLNet on SQuAD
-
-This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See above to download the data for SQuAD .
-
-##### Command for SQuAD1.0:
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-    --model_type xlnet \
-    --model_name_or_path xlnet-large-cased \
-    --do_train \
-    --do_eval \
-    --train_file $SQUAD_DIR/train-v1.1.json \
-    --predict_file $SQUAD_DIR/dev-v1.1.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 2 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./wwm_cased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=4  \
-    --per_gpu_train_batch_size=4   \
-    --save_steps 5000
-```
-
-##### Command for SQuAD2.0:
-
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-    --model_type xlnet \
-    --model_name_or_path xlnet-large-cased \
-    --do_train \
-    --do_eval \
-    --version_2_with_negative \
-    --train_file $SQUAD_DIR/train-v2.0.json \
-    --predict_file $SQUAD_DIR/dev-v2.0.json \
-    --learning_rate 3e-5 \
-    --num_train_epochs 4 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --output_dir ./wwm_cased_finetuned_squad/ \
-    --per_gpu_eval_batch_size=2  \
-    --per_gpu_train_batch_size=2   \
-    --save_steps 5000
-```
-
-Larger batch size may improve the performance while costing more memory.
-
-##### Results for SQuAD1.0 with the previously defined hyper-parameters:
-
-```python
-{
-"exact": 85.45884578997162,
-"f1": 92.5974600601065,
-"total": 10570,
-"HasAns_exact": 85.45884578997162,
-"HasAns_f1": 92.59746006010651,
-"HasAns_total": 10570
-}
-```
-
-##### Results for SQuAD2.0 with the previously defined hyper-parameters:
-
-```python
-{
-"exact": 80.4177545691906,
-"f1": 84.07154997729623,
-"total": 11873,
-"HasAns_exact": 76.73751686909581,
-"HasAns_f1": 84.05558584352873,
-"HasAns_total": 5928,
-"NoAns_exact": 84.0874684608915,
-"NoAns_f1": 84.0874684608915,
-"NoAns_total": 5945
-}
-```
-
-
-
-
-## XNLI
-
-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).
-
-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
-
-#### Fine-tuning on XNLI
-
-This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
-on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
-`$XNLI_DIR` directory.
-
-* [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
-* [XNLI-MT 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
-
-```bash
-export XNLI_DIR=/path/to/XNLI
-
-python run_xnli.py \
-  --model_type bert \
-  --model_name_or_path bert-base-multilingual-cased \
-  --language de \
-  --train_language en \
-  --do_train \
-  --do_eval \
-  --data_dir $XNLI_DIR \
-  --per_gpu_train_batch_size 32 \
-  --learning_rate 5e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 128 \
-  --output_dir /tmp/debug_xnli/ \
-  --save_steps -1
-```
-
-Training with the previously defined hyper-parameters yields the following results on the **test** set:
-
-```bash
-acc = 0.7093812375249501
-```
-
-## MM-IMDb
-
-Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/mm-imdb/run_mmimdb.py).
-
-[MM-IMDb](http://lisi1.unal.edu.co/mmimdb/) is a Multimodal dataset with around 26,000 movies including images, plots and other metadata.
-
-### Training on MM-IMDb
-
-```
-python run_mmimdb.py \
-    --data_dir /path/to/mmimdb/dataset/ \
-    --model_type bert \
-    --model_name_or_path bert-base-uncased \
-    --output_dir /path/to/save/dir/ \
-    --do_train \
-    --do_eval \
-    --max_seq_len 512 \
-    --gradient_accumulation_steps 20 \
-    --num_image_embeds 3 \
-    --num_train_epochs 100 \
-    --patience 5
-```
-
-## Adversarial evaluation of model performances
-
-Here is an example on evaluating a model using adversarial evaluation of natural language inference with the Heuristic Analysis for NLI Systems (HANS) dataset [McCoy et al., 2019](https://arxiv.org/abs/1902.01007). The example was gracefully provided by [Nafise Sadat Moosavi](https://github.com/ns-moosavi).
-
-The HANS dataset can be downloaded from [this location](https://github.com/tommccoy1/hans).
-
-This is an example of using test_hans.py:
-
-```bash
-export HANS_DIR=path-to-hans
-export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
-export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py
-
-python examples/hans/test_hans.py \
-        --task_name hans \
-        --model_type $MODEL_TYPE \
-        --do_eval \
-        --data_dir $HANS_DIR \
-        --model_name_or_path $MODEL_PATH \
-        --max_seq_length 128 \
-        --output_dir $MODEL_PATH \
-```
-
-This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
-
-The results of the BERT-base model that is trained on MNLI using batch size 8 and the random seed 42 on the HANS dataset is as follows:
-
-```bash
-Heuristic entailed results:
-lexical_overlap: 0.9702
-subsequence: 0.9942
-constituent: 0.9962
-
-Heuristic non-entailed results:
-lexical_overlap: 0.199
-subsequence: 0.0396
-constituent: 0.118
-```
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -0,0 +1 @@
+../../examples/README.md
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -109,3 +109,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/dialogpt
    model_doc/reformer
    model_doc/marian
+    model_doc/longformer
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -6,7 +6,7 @@ Overview

 The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
 by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
-two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
+two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

 - Splitting the embedding matrix into two smaller matrices
 - Using repeating layers split among groups
@@ -94,3 +94,17 @@ TFAlbertForSequenceClassification

 .. autoclass:: transformers.TFAlbertForSequenceClassification
    :members:
+
+
+TFAlbertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFAlbertForMultipleChoice
+    :members:
+
+
+TFAlbertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFAlbertForQuestionAnswering
+    :members:
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -22,7 +22,7 @@ Implementation Notes
 - The forward pass of ``BartModel`` will create decoder inputs (using the helper function ``transformers.modeling_bart._prepare_bart_decoder_inputs``)  if they are not passed. This is different than some other modeling APIs.
 - Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to ``fairseq.encode`` starts with a space.
 - ``BartForConditionalGeneration.generate`` should be used for conditional generation tasks like summarization, see the example in that docstrings
- Models that load the ``"bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.
+- Models that load the ``"facebook/bart-large-cnn"`` weights will not have a ``mask_token_id``, or be able to perform mask filling tasks.



--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@@ -0,0 +1,91 @@
+Longformer
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+
+Overview
+~~~~~
+The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+Here the abstract: 
+
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
+
+The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
+
+Longformer Self Attention
+~~~~~~~~~~~~~~~~~~~~
+Longformer self attention employs self attention on both a "local" context and a "global" context.
+Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer. 
+A selecetd few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
+
+Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
+Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
+
+The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor `global_attention_mask` at run-time appropriately. `Longformer` employs the following logic for `global_attention_mask`: `0` - the token attends "locally", `1` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
+
+Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
+
+For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
+
+
+Training
+~~~~~~~~~~~~~~~~~~~~
+``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and 
+should be used as follows:
+
+::
+
+  input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
+  mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+
+  loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
+
+
+LongformerConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerConfig
+    :members:
+
+
+LongformerTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerTokenizer
+    :members: 
+
+
+LongformerModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerModel
+    :members:
+
+
+LongformerForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForMaskedLM
+    :members:
+
+
+LongformerForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForQuestionAnswering
+    :members:
+
+
+LongformerForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForMultipleChoice
+    :members:
+
+
+LongformerForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForTokenClassification
+    :members:
+
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -5,7 +5,7 @@ file a `Github Issue <https://github.com/huggingface/transformers/issues/new?ass

 Overview
 ~~~~~
-The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 Here the abstract: 

 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*
@@ -62,7 +62,7 @@ For more information, see the `original Paper <https://arxiv.org/abs/2001.04451>

 Note that ``config.num_buckets`` can also be factorized into a ``list``:math:`(n_{\text{buckets}}^1, n_{\text{buckets}}^2)`. This way instead of assigning the query key embedding vectors to one of :math:`(1,\ldots, n_{\text{buckets}})` they are assigned to one of :math:`(1-1,\ldots, n_{\text{buckets}}^1-1, \ldots, 1-n_{\text{buckets}}^2, \ldots, n_{\text{buckets}}^1-n_{\text{buckets}}^2)`. This is crucial for very long sequences to save memory.

-It is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length, a good value for ``num_buckets`` are calculated on the fly.
+When training a model from scratch, it is recommended to leave ``config.num_buckets=None``, so that depending on the sequence length a good value for ``num_buckets`` is calculated on the fly. This value will then automatically be saved in the config and should be reused for inference.

 Using LSH self attention, the memory and time complexity of the query-key matmul operation can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times \log(n_s))`, which usually represents the memory and time bottleneck in a transformer model, with :math:`n_s` being the sequence length.

--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -74,6 +74,13 @@ RobertaForSequenceClassification
    :members:


+RobertaForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.RobertaForMultipleChoice
+    :members:
+
+
 RobertaForTokenClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -63,33 +63,33 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | Trained on uncased German text by DBMDZ                                                                                             |
 |                   |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese``                                     | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece.                                                               |
 |                   |                                                            | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization.                                                          |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-whole-word-masking``                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-whole-word-masking``        | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece.                                      |
 |                   |                                                            | | `MeCab <https://taku910.github.io/mecab/>`__ is required for tokenization.                                                          |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-char``                                | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-char``                      | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text. Text is tokenized into characters.                                                                        |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-japanese-char-whole-word-masking``             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``cl-tohoku/bert-base-japanese-char-whole-word-masking``   | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters.                                               |
 |                   |                                                            | (see `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__).                                               |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-finnish-cased-v1``                             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``TurkuNLP/bert-base-finnish-cased-v1``                    | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on cased Finnish text.                                                                                                      |
 |                   |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-finnish-uncased-v1``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``TurkuNLP/bert-base-finnish-uncased-v1``                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on uncased Finnish text.                                                                                                    |
 |                   |                                                            | (see `details on turkunlp.org <http://turkunlp.org/FinBERT/>`__).                                                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bert-base-dutch-cased``                                  | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   | ``wietsedv/bert-base-dutch-cased``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | Trained on cased Dutch text.                                                                                                        |
 |                   |                                                            | (see `details on wietsedv repository <https://github.com/wietsedv/bertje/>`__).                                                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -259,32 +259,32 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``xlm-roberta-large``                                      | | ~355M parameters with 24-layers, 1027-hidden-state, 4096 feed-forward hidden-state, 16-heads,                                       |
 |                   |                                                            | | Trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages                                                          |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| FlauBERT          | ``flaubert-small-cased``                                   | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
+| FlauBERT          | ``flaubert/flaubert_small_cased``                          | | 6-layer, 512-hidden, 8-heads, 54M parameters                                                                                        |
 |                   |                                                            | | FlauBERT small architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-base-uncased``                                  | | 12-layer, 768-hidden, 12-heads, 137M parameters                                                                                     |
+|                   | ``flaubert/flaubert_base_uncased``                         | | 12-layer, 768-hidden, 12-heads, 137M parameters                                                                                     |
 |                   |                                                            | | FlauBERT base architecture with uncased vocabulary                                                                                  |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-base-cased``                                    | | 12-layer, 768-hidden, 12-heads, 138M parameters                                                                                     |
+|                   | ``flaubert/flaubert_base_cased``                           | | 12-layer, 768-hidden, 12-heads, 138M parameters                                                                                     |
 |                   |                                                            | | FlauBERT base architecture with cased vocabulary                                                                                    |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``flaubert-large-cased``                                   | | 24-layer, 1024-hidden, 16-heads, 373M parameters                                                                                    |
+|                   | ``flaubert/flaubert_large_cased``                          | | 24-layer, 1024-hidden, 16-heads, 373M parameters                                                                                    |
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Bart              | ``bart-large``                                             | | 24-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
+| Bart              | ``facebook/bart-large``                                    | | 24-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bart-large-mnli``                                        | | Adds a 2 layer classification head with 1 million parameters                                                                        |
+|                   | ``facebook/bart-large-mnli``                               | | Adds a 2 layer classification head with 1 million parameters                                                                        |
 |                   |                                                            | | bart-large base architecture with a classification head, finetuned on MNLI                                                          |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``bart-large-cnn``                                         | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
+|                   | ``facebook/bart-large-cnn``                                | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
 |                   |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-|                   | ``mbart-large-en-ro``                                      | | 12-layer, 1024-hidden, 16-heads, 880M parameters                                                                                    |
+|                   | ``facebook/mbart-large-en-ro``                             | | 12-layer, 1024-hidden, 16-heads, 880M parameters                                                                                    |
 |                   |                                                            | | bart-large architecture pretrained on cc25 multilingual data , finetuned on WMT english romanian translation.                       |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DialoGPT          | ``DialoGPT-small``                                         | | 12-layer, 768-hidden, 12-heads, 124M parameters                                                                                     |
@@ -305,3 +305,9 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 | MarianMT          | ``Helsinki-NLP/opus-mt-{src}-{tgt}``                       | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.            |
 |                   |                                                            | | (see `model list <https://huggingface.co/Helsinki-NLP>`_)                                                                           |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Longformer        | ``allenai/longformer-base-4096``                           | | 12-layer, 768-hidden, 12-heads, ~149M parameters                                                                                    |
+|                   |                                                            | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096                                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``allenai/longformer-large-4096``                          | | 24-layer, 1024-hidden, 16-heads, ~435M parameters                                                                                   |
+|                   |                                                            | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096                                                    |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,6 +1,7 @@
-# Examples
+## Examples

-Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
+Version 2.9 of `transformers` introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
+Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.0+.

 Here is the list of all our examples:
 - **grouped by task** (all official examples work for multiple models)
@@ -12,32 +13,24 @@ Here is the list of all our examples:
 This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**


-## Tasks built on Trainer
+# The Big Table of Tasks

-| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
-|---|---|:---:|:---:|:---:|:---:|:---:|
-| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
-| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
-| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
-| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
-| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
+| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
+|---|---|:---:|:---:|:---:|:---:|
+| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling)       | Raw text        | ✅ | -  | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
+| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
+| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
+| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
+| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering)     | SQuAD           | -  | ✅ | -  | -
+| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation)     | -           | -  | - | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
+| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation)       | All               | -  | -  | -  | -
+| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/summarization)     | CNN/Daily Mail    | -  | -  | -  | -
+| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/translation)         | WMT               | -  | -  | -  | -
+| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology)             | -                 | -  | -  | -  | -
+| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial)         | HANS              | -  | -  | -  | -


-
-## Other examples and how-to's
-
-| Section | Description |
-|---|---|
-| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+<br>

 ## Important note

@@ -52,6 +45,12 @@ pip install .
 pip install -r ./examples/requirements.txt
 ```

+## One-click Deploy to Cloud (wip)
+
+#### Azure
+
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
+
 ## Running on TPUs

 When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
@@ -59,7 +58,7 @@ When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Str
 When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
 very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).

-In this repo, we provide a very simple launcher script named [xla_spawn.py](./xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
+In this repo, we provide a very simple launcher script named [xla_spawn.py](https://github.com/huggingface/transformers/tree/master/examples/xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
 Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).

 For example for `run_glue`:
--- a/examples/adversarial/test_hans.py
+++ b/examples/adversarial/test_hans.py
@@ -65,13 +65,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (
-        tuple(conf.pretrained_config_archive_map.keys())
-        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
-    ),
-    (),
-)

 MODEL_CLASSES = {
    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
@@ -389,7 +382,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--task_name",
--- a/examples/benchmarking/plot_csv_file.py
+++ b/examples/benchmarking/plot_csv_file.py
@@ -0,0 +1,113 @@
+import csv
+from collections import defaultdict
+from dataclasses import dataclass, field
+from typing import Optional
+
+import numpy as np
+
+import matplotlib.pyplot as plt
+from transformers import HfArgumentParser
+
+
+@dataclass
+class PlotArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
+    """
+
+    csv_file: str = field(metadata={"help": "The csv file to plot."},)
+    plot_along_batch: bool = field(
+        default=False,
+        metadata={"help": "Whether to plot along batch size or sequence lengh. Defaults to sequence length."},
+    )
+    is_time: bool = field(
+        default=False,
+        metadata={"help": "Whether the csv file has time results or memory results. Defaults to memory results."},
+    )
+    is_train: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether the csv file has training results or inference results. Defaults to inference results."
+        },
+    )
+    figure_png_file: Optional[str] = field(
+        default=None, metadata={"help": "Filename under which the plot will be saved. If unused no plot is saved."},
+    )
+
+
+class Plot:
+    def __init__(self, args):
+        self.args = args
+        self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={}))
+
+        with open(self.args.csv_file, newline="") as csv_file:
+            reader = csv.DictReader(csv_file)
+            for row in reader:
+                model_name = row["model"]
+                self.result_dict[model_name]["bsz"].append(int(row["batch_size"]))
+                self.result_dict[model_name]["seq_len"].append(int(row["sequence_length"]))
+                self.result_dict[model_name]["result"][(int(row["batch_size"]), int(row["sequence_length"]))] = row[
+                    "result"
+                ]
+
+    def plot(self):
+        fig, ax = plt.subplots()
+        title_str = "Time usage" if self.args.is_time else "Memory usage"
+        title_str = title_str + " for training" if self.args.is_train else title_str + " for inference"
+
+        for model_name in self.result_dict.keys():
+            batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"])))
+            sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"])))
+            results = self.result_dict[model_name]["result"]
+
+            (x_axis_array, inner_loop_array) = (
+                (batch_sizes, sequence_lengths) if self.args.plot_along_batch else (sequence_lengths, batch_sizes)
+            )
+
+            plt.xlim(min(x_axis_array), max(x_axis_array))
+
+            for inner_loop_value in inner_loop_array:
+                if self.args.plot_along_batch:
+                    y_axis_array = np.asarray([results[(x, inner_loop_value)] for x in x_axis_array], dtype=np.int)
+                else:
+                    y_axis_array = np.asarray([results[(inner_loop_value, x)] for x in x_axis_array], dtype=np.float32)
+
+                ax.set_xscale("log", basex=2)
+                ax.set_yscale("log", basey=10)
+
+                (x_axis_label, inner_loop_label) = (
+                    ("batch_size", "sequence_length in #tokens")
+                    if self.args.plot_along_batch
+                    else ("sequence_length in #tokens", "batch_size")
+                )
+
+                x_axis_array = np.asarray(x_axis_array, np.int)
+                plt.scatter(x_axis_array, y_axis_array, label=f"{model_name} - {inner_loop_label}: {inner_loop_value}")
+                plt.plot(x_axis_array, y_axis_array, "--")
+
+            title_str += f" {model_name} vs."
+
+        title_str = title_str[:-4]
+        y_axis_label = "Time in s" if self.args.is_time else "Memory in MB"
+
+        # plot
+        plt.title(title_str)
+        plt.xlabel(x_axis_label)
+        plt.ylabel(y_axis_label)
+        plt.legend()
+
+        if self.args.figure_png_file is not None:
+            plt.savefig(self.args.figure_png_file)
+        else:
+            plt.show()
+
+
+def main():
+    parser = HfArgumentParser(PlotArguments)
+    plot_args = parser.parse_args_into_dataclasses()[0]
+    plot = Plot(args=plot_args)
+    plot.plot()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/benchmarking/run_benchmark.py
+++ b/examples/benchmarking/run_benchmark.py
@@ -0,0 +1,29 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Benchmarking the library on inference and training """
+
+from transformers import HfArgumentParser, PyTorchBenchmark, PyTorchBenchmarkArguments
+
+
+def main():
+    parser = HfArgumentParser(PyTorchBenchmarkArguments)
+    benchmark_args = parser.parse_args_into_dataclasses()[0]
+    benchmark = PyTorchBenchmark(args=benchmark_args)
+    benchmark.run()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/benchmarks.py
+++ b/examples/benchmarks.py
@@ -1,710 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" Benchmarking the library on inference and training """
-
-# If checking the tensors placement
-# tf.debugging.set_log_device_placement(True)
-
-import argparse
-import csv
-import logging
-import timeit
-from time import time
-from typing import Callable, List
-
-from transformers import (
-    AutoConfig,
-    AutoTokenizer,
-    MemorySummary,
-    is_tf_available,
-    is_torch_available,
-    start_memory_tracing,
-    stop_memory_tracing,
-)
-
-
-if is_tf_available():
-    import tensorflow as tf
-    from transformers import TFAutoModel
-
-if is_torch_available():
-    import torch
-    from transformers import AutoModel
-
-
-input_text = """Bent over their instruments, three hundred Fertilizers were plunged, as
-the Director of Hatcheries and Conditioning entered the room, in the
-scarcely breathing silence, the absent-minded, soliloquizing hum or
-
-whistle, of absorbed concentration. A troop of newly arrived students,
-very young, pink and callow, followed nervously, rather abjectly, at the
-Director's heels. Each of them carried a notebook, in which, whenever
-the great man spoke, he desperately scribbled. Straight from the
-horse's mouth. It was a rare privilege. The D. H. C. for Central London
-always made a point of personally conducting his new students round
-the various departments.
-
-"Just to give you a general idea," he would explain to them. For of
-course some sort of general idea they must have, if they were to do
-their work intelligently-though as little of one, if they were to be good
-and happy members of society, as possible. For particulars, as every
-one knows, make for virtue and happiness; generalities are intellectu-
-ally necessary evils. Not philosophers but fret-sawyers and stamp col-
-lectors compose the backbone of society.
-
-"To-morrow," he would add, smiling at them with a slightly menacing
-geniality, "you'll be settling down to serious work. You won't have time
-for generalities. Meanwhile ..."
-
-Meanwhile, it was a privilege. Straight from the horse's mouth into the
-notebook. The boys scribbled like mad.
-
-Tall and rather thin but upright, the Director advanced into the room.
-He had a long chin and big rather prominent teeth, just covered, when
-he was not talking, by his full, floridly curved lips. Old, young? Thirty?
-Fifty? Fifty-five? It was hard to say. And anyhow the question didn't
-arise; in this year of stability, A. F. 632, it didn't occur to you to ask it.
-
-"I shall begin at the beginning," said the D.H.C. and the more zealous
-students recorded his intention in their notebooks: Begin at the begin-
-ning. "These," he waved his hand, "are the incubators." And opening
-an insulated door he showed them racks upon racks of numbered test-
-tubes. "The week's supply of ova. Kept," he explained, "at blood heat;
-whereas the male gametes," and here he opened another door, "they
-have to be kept at thirty-five instead of thirty-seven. Full blood heat
-sterilizes." Rams wrapped in theremogene beget no lambs.
-
-Still leaning against the incubators he gave them, while the pencils
-scurried illegibly across the pages, a brief description of the modern
-
-
-
-fertilizing process; spoke first, of course, of its surgical introduc-
-tion-"the operation undergone voluntarily for the good of Society, not
-to mention the fact that it carries a bonus amounting to six months'
-salary"; continued with some account of the technique for preserving
-the excised ovary alive and actively developing; passed on to a consid-
-eration of optimum temperature, salinity, viscosity; referred to the liq-
-uor in which the detached and ripened eggs were kept; and, leading
-his charges to the work tables, actually showed them how this liquor
-was drawn off from the test-tubes; how it was let out drop by drop
-onto the specially warmed slides of the microscopes; how the eggs
-which it contained were inspected for abnormalities, counted and
-transferred to a porous receptacle; how (and he now took them to
-watch the operation) this receptacle was immersed in a warm bouillon
-containing free-swimming spermatozoa-at a minimum concentration
-of one hundred thousand per cubic centimetre, he insisted; and how,
-after ten minutes, the container was lifted out of the liquor and its
-contents re-examined; how, if any of the eggs remained unfertilized, it
-was again immersed, and, if necessary, yet again; how the fertilized
-ova went back to the incubators; where the Alphas and Betas re-
-mained until definitely bottled; while the Gammas, Deltas and Epsilons
-were brought out again, after only thirty-six hours, to undergo Bo-
-kanovsky's Process.
-
-"Bokanovsky's Process," repeated the Director, and the students un-
-derlined the words in their little notebooks.
-
-One egg, one embryo, one adult-normality. But a bokanovskified egg
-will bud, will proliferate, will divide. From eight to ninety-six buds, and
-every bud will grow into a perfectly formed embryo, and every embryo
-into a full-sized adult. Making ninety-six human beings grow where
-only one grew before. Progress.
-
-"Essentially," the D.H.C. concluded, "bokanovskification consists of a
-series of arrests of development. We check the normal growth and,
-paradoxically enough, the egg responds by budding."
-
-Responds by budding. The pencils were busy.
-
-He pointed. On a very slowly moving band a rack-full of test-tubes was
-entering a large metal box, another, rack-full was emerging. Machinery
-faintly purred. It took eight minutes for the tubes to go through, he
-
-
-
-told them. Eight minutes of hard X-rays being about as much as an
-egg can stand. A few died; of the rest, the least susceptible divided
-into two; most put out four buds; some eight; all were returned to the
-incubators, where the buds began to develop; then, after two days,
-were suddenly chilled, chilled and checked. Two, four, eight, the buds
-in their turn budded; and having budded were dosed almost to death
-with alcohol; consequently burgeoned again and having budded-bud
-out of bud out of bud-were thereafter-further arrest being generally
-fatal-left to develop in peace. By which time the original egg was in a
-fair way to becoming anything from eight to ninety-six embryos- a
-prodigious improvement, you will agree, on nature. Identical twins-but
-not in piddling twos and threes as in the old viviparous days, when an
-egg would sometimes accidentally divide; actually by dozens, by
-scores at a time.
-
-"Scores," the Director repeated and flung out his arms, as though he
-were distributing largesse. "Scores."
-
-But one of the students was fool enough to ask where the advantage
-lay.
-
-"My good boy!" The Director wheeled sharply round on him. "Can't you
-see? Can't you see?" He raised a hand; his expression was solemn.
-"Bokanovsky's Process is one of the major instruments of social stabil-
-ity!"
-
-Major instruments of social stability.
-
-Standard men and women; in uniform batches. The whole of a small
-factory staffed with the products of a single bokanovskified egg.
-
-"Ninety-six identical twins working ninety-six identical machines!" The
-voice was almost tremulous with enthusiasm. "You really know where
-you are. For the first time in history." He quoted the planetary motto.
-"Community, Identity, Stability." Grand words. "If we could bo-
-kanovskify indefinitely the whole problem would be solved."
-
-Solved by standard Gammas, unvarying Deltas, uniform Epsilons. Mil-
-lions of identical twins. The principle of mass production at last applied
-to biology.
-
-
-
-"But, alas," the Director shook his head, "we can't bokanovskify indefi-
-nitely."
-
-Ninety-six seemed to be the limit; seventy-two a good average. From
-the same ovary and with gametes of the same male to manufacture as
-many batches of identical twins as possible-that was the best (sadly a
-second best) that they could do. And even that was difficult.
-
-"For in nature it takes thirty years for two hundred eggs to reach ma-
-turity. But our business is to stabilize the population at this moment,
-here and now. Dribbling out twins over a quarter of a century-what
-would be the use of that?"
-
-Obviously, no use at all. But Podsnap's Technique had immensely ac-
-celerated the process of ripening. They could make sure of at least a
-hundred and fifty mature eggs within two years. Fertilize and bo-
-kanovskify-in other words, multiply by seventy-two-and you get an
-average of nearly eleven thousand brothers and sisters in a hundred
-and fifty batches of identical twins, all within two years of the same
-age.
-
-"And in exceptional cases we can make one ovary yield us over fifteen
-thousand adult individuals."
-
-Beckoning to a fair-haired, ruddy young man who happened to be
-passing at the moment. "Mr. Foster," he called. The ruddy young man
-approached. "Can you tell us the record for a single ovary, Mr. Foster?"
-
-"Sixteen thousand and twelve in this Centre," Mr. Foster replied with-
-out hesitation. He spoke very quickly, had a vivacious blue eye, and
-took an evident pleasure in quoting figures. "Sixteen thousand and
-twelve; in one hundred and eighty-nine batches of identicals. But of
-course they've done much better," he rattled on, "in some of the tropi-
-cal Centres. Singapore has often produced over sixteen thousand five
-hundred; and Mombasa has actually touched the seventeen thousand
-mark. But then they have unfair advantages. You should see the way a
-negro ovary responds to pituitary! It's quite astonishing, when you're
-used to working with European material. Still," he added, with a laugh
-(but the light of combat was in his eyes and the lift of his chin was
-challenging), "still, we mean to beat them if we can. I'm working on a
-wonderful Delta-Minus ovary at this moment. Only just eighteen
-
-
-
-months old. Over twelve thousand seven hundred children already, ei-
-ther decanted or in embryo. And still going strong. We'll beat them
-yet."
-
-"That's the spirit I like!" cried the Director, and clapped Mr. Foster on
-the shoulder. "Come along with us, and give these boys the benefit of
-your expert knowledge."
-
-Mr. Foster smiled modestly. "With pleasure." They went.
-In the Bottling Room all was harmonious bustle and ordered activity.
-Flaps of fresh sow's peritoneum ready cut to the proper size came
-shooting up in little lifts from the Organ Store in the sub-basement.
-Whizz and then, click! the lift-hatches hew open; the bottle-liner had
-only to reach out a hand, take the flap, insert, smooth-down, and be-
-fore the lined bottle had had time to travel out of reach along the end-
-less band, whizz, click! another flap of peritoneum had shot up from
-the depths, ready to be slipped into yet another bottle, the next of that
-slow interminable procession on the band.
-
-Next to the Liners stood the Matriculators. The procession advanced;
-one by one the eggs were transferred from their test-tubes to the
-larger containers; deftly the peritoneal lining was slit, the morula
-dropped into place, the saline solution poured in ... and already the
-bottle had passed, and it was the turn of the labellers. Heredity, date
-of fertilization, membership of Bokanovsky Group-details were trans-
-ferred from test-tube to bottle. No longer anonymous, but named,
-identified, the procession marched slowly on; on through an opening in
-the wall, slowly on into the Social Predestination Room.
-"Eighty-eight cubic metres of card-index," said Mr. Foster with relish,
-as they entered."""
-
-
-def create_setup_and_compute(
-    model_names: List[str],
-    batch_sizes: List[int],
-    slice_sizes: List[int],
-    gpu: bool = True,
-    tensorflow: bool = False,
-    average_over: int = 3,
-    no_speed: bool = False,
-    no_memory: bool = False,
-    verbose: bool = False,
-    torchscript: bool = False,
-    xla: bool = False,
-    amp: bool = False,
-    fp16: bool = False,
-    save_to_csv: bool = False,
-    csv_time_filename: str = f"time_{round(time())}.csv",
-    csv_memory_filename: str = f"memory_{round(time())}.csv",
-    print_fn: Callable[[str], None] = print,
-):
-    if xla:
-        tf.config.optimizer.set_jit(True)
-    if amp:
-        tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
-
-    if tensorflow:
-        dictionary = {model_name: {} for model_name in model_names}
-        results = _compute_tensorflow(
-            model_names,
-            batch_sizes,
-            slice_sizes,
-            dictionary,
-            average_over,
-            amp,
-            no_speed,
-            no_memory,
-            verbose,
-            print_fn,
-        )
-    else:
-        device = "cuda" if (gpu and torch.cuda.is_available()) else "cpu"
-        dictionary = {model_name: {} for model_name in model_names}
-        results = _compute_pytorch(
-            model_names,
-            batch_sizes,
-            slice_sizes,
-            dictionary,
-            average_over,
-            device,
-            torchscript,
-            fp16,
-            no_speed,
-            no_memory,
-            verbose,
-            print_fn,
-        )
-
-    print_fn("=========== RESULTS ===========")
-    for model_name in model_names:
-        print_fn("\t" + f"======= MODEL CHECKPOINT: {model_name} =======")
-        for batch_size in results[model_name]["bs"]:
-            print_fn("\t\t" + f"===== BATCH SIZE: {batch_size} =====")
-            for slice_size in results[model_name]["ss"]:
-                time = results[model_name]["time"][batch_size][slice_size]
-                memory = results[model_name]["memory"][batch_size][slice_size]
-                if isinstance(time, str):
-                    print_fn(f"\t\t{model_name}/{batch_size}/{slice_size}: " f"{time} " f"{memory}")
-                else:
-                    print_fn(
-                        f"\t\t{model_name}/{batch_size}/{slice_size}: "
-                        f"{(round(1000 * time) / 1000)}"
-                        f"s "
-                        f"{memory}"
-                    )
-
-    if save_to_csv:
-        with open(csv_time_filename, mode="w") as csv_time_file, open(
-            csv_memory_filename, mode="w"
-        ) as csv_memory_file:
-
-            assert len(model_names) > 0, "At least 1 model should be defined, but got {}".format(model_names)
-
-            fieldnames = ["model", "batch_size", "sequence_length"]
-            time_writer = csv.DictWriter(csv_time_file, fieldnames=fieldnames + ["time_in_s"])
-            time_writer.writeheader()
-            memory_writer = csv.DictWriter(csv_memory_file, fieldnames=fieldnames + ["memory"])
-            memory_writer.writeheader()
-
-            for model_name in model_names:
-                time_dict = results[model_name]["time"]
-                memory_dict = results[model_name]["memory"]
-                for bs in time_dict:
-                    for ss in time_dict[bs]:
-                        time_writer.writerow(
-                            {
-                                "model": model_name,
-                                "batch_size": bs,
-                                "sequence_length": ss,
-                                "time_in_s": "{:.4f}".format(time_dict[bs][ss]),
-                            }
-                        )
-
-                for bs in memory_dict:
-                    for ss in time_dict[bs]:
-                        memory_writer.writerow(
-                            {
-                                "model": model_name,
-                                "batch_size": bs,
-                                "sequence_length": ss,
-                                "memory": memory_dict[bs][ss],
-                            }
-                        )
-
-
-def print_summary_statistics(summary: MemorySummary, print_fn: Callable[[str], None]):
-    print_fn(
-        "\nLines by line memory consumption:\n"
-        + "\n".join(
-            f"{state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.sequential
-        )
-    )
-    print_fn(
-        "\nLines with top memory consumption:\n"
-        + "\n".join(
-            f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.cumulative[:6]
-        )
-    )
-    print_fn(
-        "\nLines with lowest memory consumption:\n"
-        + "\n".join(
-            f"=> {state.frame.filename}:{state.frame.line_number}: mem {state.cpu_gpu}: {state.frame.line_text}"
-            for state in summary.cumulative[-6:]
-        )
-    )
-    print_fn(f"\nTotal memory increase: {summary.total}")
-
-
-def get_print_function(save_print_log, log_filename):
-    if save_print_log:
-        logging.basicConfig(
-            level=logging.DEBUG,
-            filename=log_filename,
-            filemode="a+",
-            format="%(asctime)-15s %(levelname)-8s %(message)s",
-        )
-
-        def print_with_print_log(*args):
-            logging.info(*args)
-            print(*args)
-
-        return print_with_print_log
-    else:
-        return print
-
-
-def _compute_pytorch(
-    model_names,
-    batch_sizes,
-    slice_sizes,
-    dictionary,
-    average_over,
-    device,
-    torchscript,
-    fp16,
-    no_speed,
-    no_memory,
-    verbose,
-    print_fn,
-):
-    for c, model_name in enumerate(model_names):
-        print_fn(f"{c + 1} / {len(model_names)}")
-        config = AutoConfig.from_pretrained(model_name, torchscript=torchscript)
-        model = AutoModel.from_pretrained(model_name, config=config)
-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-        tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
-
-        max_input_size = tokenizer.max_model_input_sizes[model_name]
-
-        dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
-        dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
-        dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
-
-        print_fn("Using model {}".format(model))
-        print_fn("Number of all parameters {}".format(model.num_parameters()))
-
-        for batch_size in batch_sizes:
-            if fp16:
-                model.half()
-            model.to(device)
-            model.eval()
-
-            for slice_size in slice_sizes:
-                if max_input_size is not None and slice_size > max_input_size:
-                    dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                else:
-                    sequence = torch.tensor(tokenized_sequence[:slice_size], device=device).repeat(batch_size, 1)
-                    try:
-                        if torchscript:
-                            print_fn("Tracing model with sequence size {}".format(sequence.shape))
-                            inference = torch.jit.trace(model, sequence)
-                            inference(sequence)
-                        else:
-                            inference = model
-                            inference(sequence)
-
-                        if not no_memory:
-                            # model.add_memory_hooks()  # Forward method tracing (only for PyTorch models)
-
-                            # Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
-                            trace = start_memory_tracing("transformers")
-                            inference(sequence)
-                            summary = stop_memory_tracing(trace)
-
-                            if verbose:
-                                print_summary_statistics(summary, print_fn)
-
-                            dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
-                        else:
-                            dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-
-                        if not no_speed:
-                            print_fn("Going through model with sequence of shape".format(sequence.shape))
-                            runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
-                            average_time = sum(runtimes) / float(len(runtimes)) / 3.0
-                            dictionary[model_name]["time"][batch_size][slice_size] = average_time
-                        else:
-                            dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-
-                    except RuntimeError as e:
-                        print_fn("Doesn't fit on GPU. {}".format(e))
-                        torch.cuda.empty_cache()
-                        dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                        dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-    return dictionary
-
-
-def _compute_tensorflow(
-    model_names, batch_sizes, slice_sizes, dictionary, average_over, amp, no_speed, no_memory, verbose, print_fn
-):
-    for c, model_name in enumerate(model_names):
-        print_fn(f"{c + 1} / {len(model_names)}")
-        config = AutoConfig.from_pretrained(model_name)
-        model = TFAutoModel.from_pretrained(model_name, config=config)
-        tokenizer = AutoTokenizer.from_pretrained(model_name)
-
-        tokenized_sequence = tokenizer.encode(input_text, add_special_tokens=False)
-
-        max_input_size = tokenizer.max_model_input_sizes[model_name]
-
-        dictionary[model_name] = {"bs": batch_sizes, "ss": slice_sizes, "time": {}, "memory": {}}
-        dictionary[model_name]["time"] = {i: {} for i in batch_sizes}
-        dictionary[model_name]["memory"] = {i: {} for i in batch_sizes}
-
-        print_fn("Using model {}".format(model))
-        print_fn("Number of all parameters {}".format(model.num_parameters()))
-
-        @tf.function
-        def inference(inputs):
-            return model(inputs)
-
-        for batch_size in batch_sizes:
-            for slice_size in slice_sizes:
-                if max_input_size is not None and slice_size > max_input_size:
-                    dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                else:
-                    sequence = tf.stack(
-                        [tf.squeeze(tf.constant(tokenized_sequence[:slice_size])[None, :])] * batch_size
-                    )
-
-                    try:
-                        print_fn("Going through model with sequence of shape {}".format(sequence.shape))
-                        # To make sure that the model is traced + that the tensors are on the appropriate device
-                        inference(sequence)
-
-                        if not no_memory:
-                            # Line by line memory tracing (all code in the module `transformers`) works for all models/arbitrary code
-                            trace = start_memory_tracing("transformers")
-                            inference(sequence)
-                            summary = stop_memory_tracing(trace)
-
-                            if verbose:
-                                print_summary_statistics(summary, print_fn)
-
-                            dictionary[model_name]["memory"][batch_size][slice_size] = str(summary.total)
-                        else:
-                            dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-
-                        if not no_speed:
-                            runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
-                            average_time = sum(runtimes) / float(len(runtimes)) / 3.0
-                            dictionary[model_name]["time"][batch_size][slice_size] = average_time
-                        else:
-                            dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-
-                    except tf.errors.ResourceExhaustedError as e:
-                        print_fn("Doesn't fit on GPU. {}".format(e))
-                        dictionary[model_name]["time"][batch_size][slice_size] = "N/A"
-                        dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"
-    return dictionary
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--models",
-        required=False,
-        type=str,
-        default="all",
-        help="Model checkpoints to be provided "
-        "to the AutoModel classes. Leave "
-        "blank to benchmark the base version "
-        "of all available model "
-        "architectures.",
-    )
-    parser.add_argument("--verbose", required=False, action="store_true", help="Verbose memory tracing")
-    parser.add_argument("--no_speed", required=False, action="store_true", help="Don't perform speed measurments")
-    parser.add_argument("--no_memory", required=False, action="store_true", help="Don't perform memory measurments")
-    parser.add_argument(
-        "--torch", required=False, action="store_true", help="Benchmark the Pytorch version of the " "models"
-    )
-    parser.add_argument(
-        "--torch_cuda", required=False, action="store_true", help="Pytorch only: run on available " "cuda devices"
-    )
-    parser.add_argument(
-        "--torchscript",
-        required=False,
-        action="store_true",
-        help="Pytorch only: trace the models " "using torchscript",
-    )
-    parser.add_argument(
-        "--tensorflow",
-        required=False,
-        action="store_true",
-        help="Benchmark the TensorFlow version "
-        "of the models. Will run on GPU if "
-        "the correct dependencies are "
-        "installed",
-    )
-    parser.add_argument("--xla", required=False, action="store_true", help="TensorFlow only: use XLA acceleration.")
-    parser.add_argument(
-        "--amp",
-        required=False,
-        action="store_true",
-        help="TensorFlow only: use automatic mixed precision acceleration.",
-    )
-    parser.add_argument(
-        "--fp16", required=False, action="store_true", help="PyTorch only: use FP16 to accelerate inference."
-    )
-    parser.add_argument(
-        "--keras_predict",
-        required=False,
-        action="store_true",
-        help="Whether to use model.predict " "instead of model() to do a " "forward pass.",
-    )
-    parser.add_argument("--save_to_csv", required=False, action="store_true", help="Save to a CSV file.")
-    parser.add_argument(
-        "--log_print", required=False, action="store_true", help="Save all print statements in log file."
-    )
-    parser.add_argument(
-        "--csv_time_filename",
-        required=False,
-        default=f"time_{round(time())}.csv",
-        help="CSV filename used if saving time results to csv.",
-    )
-    parser.add_argument(
-        "--csv_memory_filename",
-        required=False,
-        default=f"memory_{round(time())}.csv",
-        help="CSV filename used if saving memory results to csv.",
-    )
-    parser.add_argument(
-        "--log_filename",
-        required=False,
-        default=f"log_{round(time())}.txt",
-        help="Log filename used if print statements are saved in log.",
-    )
-    parser.add_argument(
-        "--average_over", required=False, default=30, type=int, help="Times an experiment will be run."
-    )
-    parser.add_argument("--batch_sizes", nargs="+", type=int, default=[1, 2, 4, 8])
-    parser.add_argument("--slice_sizes", nargs="+", type=int, default=[8, 64, 128, 256, 512, 1024])
-
-    args = parser.parse_args()
-    if args.models == "all":
-        args.models = [
-            "gpt2",
-            "bert-base-cased",
-            "xlnet-base-cased",
-            "xlm-mlm-en-2048",
-            "transfo-xl-wt103",
-            "openai-gpt",
-            "distilbert-base-uncased",
-            "distilgpt2",
-            "roberta-base",
-            "ctrl",
-            "t5-base",
-            "bart-large",
-        ]
-    else:
-        args.models = args.models.split()
-
-    print_fn = get_print_function(args.log_print, args.log_filename)
-    print_fn("Running with arguments: {}".format(args))
-
-    if args.torch:
-        if is_torch_available():
-            create_setup_and_compute(
-                model_names=args.models,
-                batch_sizes=args.batch_sizes,
-                slice_sizes=args.slice_sizes,
-                tensorflow=False,
-                gpu=args.torch_cuda,
-                torchscript=args.torchscript,
-                fp16=args.fp16,
-                save_to_csv=args.save_to_csv,
-                csv_time_filename=args.csv_time_filename,
-                csv_memory_filename=args.csv_memory_filename,
-                average_over=args.average_over,
-                no_speed=args.no_speed,
-                no_memory=args.no_memory,
-                verbose=args.verbose,
-                print_fn=print_fn,
-            )
-        else:
-            raise ImportError("Trying to run a PyTorch benchmark but PyTorch was not found in the environment.")
-
-    if args.tensorflow:
-        if is_tf_available():
-            create_setup_and_compute(
-                model_names=args.models,
-                batch_sizes=args.batch_sizes,
-                slice_sizes=args.slice_sizes,
-                tensorflow=True,
-                xla=args.xla,
-                amp=args.amp,
-                save_to_csv=args.save_to_csv,
-                csv_time_filename=args.csv_time_filename,
-                csv_memory_filename=args.csv_memory_filename,
-                average_over=args.average_over,
-                no_speed=args.no_speed,
-                no_memory=args.no_memory,
-                verbose=args.verbose,
-                print_fn=print_fn,
-            )
-        else:
-            raise ImportError("Trying to run a TensorFlow benchmark but TensorFlow was not found in the environment.")
-
-
-if __name__ == "__main__":
-    main()
--- a/examples/bertology/run_bertology.py
+++ b/examples/bertology/run_bertology.py
@@ -64,7 +64,7 @@ def print_2d_tensor(tensor):


 def compute_heads_importance(
-    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
+    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
 ):
    """ This method shows how to compute:
        - head attention entropy
@@ -77,7 +77,12 @@ def compute_heads_importance(

    if head_mask is None:
        head_mask = torch.ones(n_layers, n_heads).to(args.device)
+
    head_mask.requires_grad_(requires_grad=True)
+    # If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
+    if actually_pruned:
+        head_mask = None
+
    preds = None
    labels = None
    tot_tokens = 0.0
@@ -172,6 +177,7 @@ def mask_heads(args, model, eval_dataloader):
        new_head_mask = new_head_mask.view(-1)
        new_head_mask[current_heads_to_mask] = 0.0
        new_head_mask = new_head_mask.view_as(head_mask)
+        new_head_mask = new_head_mask.clone().detach()
        print_2d_tensor(new_head_mask)

        # Compute metric and head importance again
@@ -181,7 +187,7 @@ def mask_heads(args, model, eval_dataloader):
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
        logger.info(
-            "Masking: current score: %f, remaning heads %d (%.1f percents)",
+            "Masking: current score: %f, remaining heads %d (%.1f percents)",
            current_score,
            new_head_mask.sum(),
            new_head_mask.sum() / new_head_mask.numel() * 100,
@@ -209,14 +215,23 @@ def prune_heads(args, model, eval_dataloader, head_mask):
    original_time = datetime.now() - before_time

    original_num_params = sum(p.numel() for p in model.parameters())
-    heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
+    heads_to_prune = dict(
+        (layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
+    )
+
    assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
    model.prune_heads(heads_to_prune)
    pruned_num_params = sum(p.numel() for p in model.parameters())

    before_time = datetime.now()
    _, _, preds, labels = compute_heads_importance(
-        args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
+        args,
+        model,
+        eval_dataloader,
+        compute_entropy=False,
+        compute_importance=False,
+        head_mask=None,
+        actually_pruned=True,
    )
    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
    score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
@@ -404,7 +419,7 @@ def main():
    logger.info("Training/evaluation parameters %s", args)

    # Prepare dataset for the GLUE task
-    eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
+    eval_dataset = GlueDataset(args, tokenizer=tokenizer, mode="dev")
    if args.data_subset > 0:
        eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
--- a/examples/contrib/mm-imdb/run_mmimdb.py
+++ b/examples/contrib/mm-imdb/run_mmimdb.py
@@ -34,26 +34,11 @@ from tqdm import tqdm, trange
 from transformers import (
    WEIGHTS_NAME,
    AdamW,
-    AlbertConfig,
-    AlbertModel,
-    AlbertTokenizer,
-    BertConfig,
-    BertModel,
-    BertTokenizer,
-    DistilBertConfig,
-    DistilBertModel,
-    DistilBertTokenizer,
+    AutoConfig,
+    AutoModel,
+    AutoTokenizer,
    MMBTConfig,
    MMBTForClassification,
-    RobertaConfig,
-    RobertaModel,
-    RobertaTokenizer,
-    XLMConfig,
-    XLMModel,
-    XLMTokenizer,
-    XLNetConfig,
-    XLNetModel,
-    XLNetTokenizer,
    get_linear_schedule_with_warmup,
 )
 from utils_mmimdb import ImageEncoder, JsonlDataset, collate_fn, get_image_transforms, get_mmimdb_labels
@@ -67,23 +52,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (
-        tuple(conf.pretrained_config_archive_map.keys())
-        for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig, DistilBertConfig)
-    ),
-    (),
-)
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertModel, BertTokenizer),
-    "xlnet": (XLNetConfig, XLNetModel, XLNetTokenizer),
-    "xlm": (XLMConfig, XLMModel, XLMTokenizer),
-    "roberta": (RobertaConfig, RobertaModel, RobertaTokenizer),
-    "distilbert": (DistilBertConfig, DistilBertModel, DistilBertTokenizer),
-    "albert": (AlbertConfig, AlbertModel, AlbertTokenizer),
-}
-

 def set_seed(args):
    random.seed(args.seed)
@@ -351,19 +319,12 @@ def main():
        required=True,
        help="The input data dir. Should contain the .jsonl files for MMIMDB.",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
@@ -385,7 +346,7 @@ def main():
    )
    parser.add_argument(
        "--cache_dir",
-        default="",
+        default=None,
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
@@ -526,18 +487,14 @@ def main():
    # Setup model
    labels = get_mmimdb_labels()
    num_labels = len(labels)
-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    transformer_config = config_class.from_pretrained(
-        args.config_name if args.config_name else args.model_name_or_path
-    )
-    tokenizer = tokenizer_class.from_pretrained(
+    transformer_config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    transformer = model_class.from_pretrained(
-        args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
+    transformer = AutoModel.from_pretrained(
+        args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir
    )
    img_encoder = ImageEncoder(args)
    config = MMBTConfig(transformer_config, num_labels=num_labels)
@@ -583,13 +540,12 @@ def main():
        # Load a trained model and vocabulary that you have fine-tuned
        model = MMBTForClassification(config, transformer, img_encoder)
        model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
--- a/examples/contrib/run_swag.py
+++ b/examples/contrib/run_swag.py
@@ -31,14 +31,8 @@ from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Tenso
 from torch.utils.data.distributed import DistributedSampler
 from tqdm import tqdm, trange

-from transformers import (
-    WEIGHTS_NAME,
-    AdamW,
-    BertConfig,
-    BertForMultipleChoice,
-    BertTokenizer,
-    get_linear_schedule_with_warmup,
-)
+from transformers import WEIGHTS_NAME, AdamW, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup
+from transformers.modeling_auto import AutoModelForMultipleChoice


 try:
@@ -49,12 +43,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in [BertConfig]), ())
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertForMultipleChoice, BertTokenizer),
-}
-

 class SwagExample(object):
    """A single training/test example for the SWAG dataset."""
@@ -492,19 +480,12 @@ def main():
        required=True,
        help="SWAG csv for predictions. E.g., val.csv or test.csv",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
@@ -536,9 +517,6 @@ def main():
    parser.add_argument(
        "--evaluate_during_training", action="store_true", help="Rul evaluation during training at each logging step."
    )
-    parser.add_argument(
-        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
-    )

    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument(
@@ -652,13 +630,9 @@ def main():
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
-    tokenizer = tokenizer_class.from_pretrained(
-        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case
-    )
-    model = model_class.from_pretrained(
+    config = AutoConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,)
+    model = AutoModelForMultipleChoice.from_pretrained(
        args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config
    )

@@ -694,8 +668,8 @@ def main():
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
-        model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model = AutoModelForMultipleChoice.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
@@ -718,8 +692,8 @@ def main():
        for checkpoint in checkpoints:
            # Reload the model
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
-            model = model_class.from_pretrained(checkpoint)
-            tokenizer = tokenizer_class.from_pretrained(checkpoint)
+            model = AutoModelForMultipleChoice.from_pretrained(checkpoint)
+            tokenizer = AutoTokenizer.from_pretrained(checkpoint)
            model.to(args.device)

            # Evaluate
--- a/examples/contrib/run_transfo_xl.py
+++ b/examples/contrib/run_transfo_xl.py
@@ -80,7 +80,7 @@ def main():

    # Load a pre-trained model
    model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
-    model = model.to(device)
+    model.to(device)

    logger.info(
        "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -80,7 +80,7 @@ class Distiller:

        self.mlm = params.mlm
        if self.mlm:
-            logger.info(f"Using MLM loss for LM step.")
+            logger.info("Using MLM loss for LM step.")
            self.mlm_mask_prop = params.mlm_mask_prop
            assert 0.0 <= self.mlm_mask_prop <= 1.0
            assert params.word_mask + params.word_keep + params.word_rand == 1.0
@@ -91,7 +91,7 @@ class Distiller:
                self.pred_probs = self.pred_probs.half()
                self.token_probs = self.token_probs.half()
        else:
-            logger.info(f"Using CLM loss for LM step.")
+            logger.info("Using CLM loss for LM step.")

        self.epoch = 0
        self.n_iter = 0
@@ -365,8 +365,8 @@ class Distiller:
            self.end_epoch()

        if self.is_master:
-            logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
-            self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
+            logger.info("Save very last checkpoint as `pytorch_model.bin`.")
+            self.save_checkpoint(checkpoint_name="pytorch_model.bin")
            logger.info("Training is finished")

    def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):
--- a/examples/distillation/run_squad_w_distillation.py
+++ b/examples/distillation/run_squad_w_distillation.py
@@ -67,9 +67,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig)), ()
-)

 MODEL_CLASSES = {
    "bert": (BertConfig, BertForQuestionAnswering, BertTokenizer),
@@ -505,7 +502,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -60,7 +60,7 @@ def main():
    with open(args.file_path, "r", encoding="utf8") as fp:
        data = fp.readlines()

-    logger.info(f"Start encoding")
+    logger.info("Start encoding")
    logger.info(f"{len(data)} examples to process.")

    rslt = []
--- a/examples/distillation/scripts/extract.py
+++ b/examples/distillation/scripts/extract.py
@@ -93,7 +93,7 @@ if __name__ == "__main__":
    elif args.model_type == "gpt2":
        for w in ["weight", "bias"]:
            compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
-        compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
+        compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]

    print(f"N layers selected for distillation: {std_idx}")
    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -37,7 +37,7 @@ if __name__ == "__main__":
        model = BertForMaskedLM.from_pretrained(args.model_name)
        prefix = "bert"
    else:
-        raise ValueError(f'args.model_type should be "bert".')
+        raise ValueError('args.model_type should be "bert".')

    state_dict = model.state_dict()
    compressed_sd = {}
@@ -78,8 +78,8 @@ if __name__ == "__main__":
            ]
        std_idx += 1

-    compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
-    compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
+    compressed_sd["vocab_projector.weight"] = state_dict["cls.predictions.decoder.weight"]
+    compressed_sd["vocab_projector.bias"] = state_dict["cls.predictions.bias"]
    if args.vocab_transform:
        for w in ["weight", "bias"]:
            compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -273,7 +273,7 @@ def main():
        token_probs = None

    train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
-    logger.info(f"Data loader created.")
+    logger.info("Data loader created.")

    # STUDENT #
    logger.info(f"Loading student config from {args.student_config}")
@@ -288,7 +288,7 @@ def main():

    if args.n_gpu > 0:
        student.to(f"cuda:{args.local_rank}")
-    logger.info(f"Student loaded.")
+    logger.info("Student loaded.")

    # TEACHER #
    teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -3,8 +3,7 @@

 Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).

-Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
-to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
+Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT, DistilBERT and RoBERTa. GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT, DistilBERT and RoBERTa
 are fine-tuned using a masked language modeling (MLM) loss.

 Before running the following example, you should get a file that contains text on which the language model will be
@@ -35,7 +34,7 @@ python run_language_modeling.py \
 This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches
 a score of ~20 perplexity once fine-tuned on the dataset.

-### RoBERTa/BERT and masked language modeling
+### RoBERTa/BERT/DistilBERT and masked language modeling

 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
@@ -115,15 +115,13 @@ class DataTrainingArguments:
    )


-def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False, local_rank=-1):
+def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
-        return LineByLineTextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
-        )
+        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank,
+            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )


@@ -220,16 +218,9 @@ def main():
        data_args.block_size = min(data_args.block_size, tokenizer.max_len)

    # Get datasets
-    train_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-        if training_args.do_eval
-        else None
-    )
+
+    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
+    eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
    )
@@ -260,7 +251,7 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()
@@ -269,11 +260,12 @@ def main():
        result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key in sorted(result.keys()):
-                logger.info("  %s = %s", key, str(result[key]))
-                writer.write("%s = %s\n" % (key, str(result[key])))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key in sorted(result.keys()):
+                    logger.info("  %s = %s", key, str(result[key]))
+                    writer.write("%s = %s\n" % (key, str(result[key])))

        results.update(result)

--- a/examples/movement-pruning/README.md
+++ b/examples/movement-pruning/README.md
@@ -0,0 +1,183 @@
+# Movement Pruning: Adaptive Sparsity by Fine-Tuning
+
+*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
+
+| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning      | L0 Regularization      | Movement Pruning       | Soft Movement Pruning          |
+| :---:                                                       | :---:                   | :---:                    | :---:                  | :---:                  | :---:                  | :---:                          |
+| SQuAD - Dev<br>EM/F1                                        | 80.4/88.1               | 10%<br>3%                | 70.2/80.1<br>45.5/59.6 | 72.4/81.9<br>64.3/75.8 | 75.6/84.3<br>67.5/78.0 | **76.6/84.9**<br>**72.7/82.3** |
+| MNLI - Dev<br>acc/MM acc                                    | 84.5/84.9               | 10%<br>3%                | 78.3/79.3<br>69.4/70.6 | 78.7/79.7<br>76.0/76.2 | 80.1/80.4<br>76.5/77.4 | **81.2/81.8**<br>**79.5/80.1** |
+| QQP - Dev<br>acc/F1                                         | 91.4/88.4               | 10%<br>3%                | 79.8/65.0<br>72.4/57.8 | 88.1/82.8<br>87.0/81.9 | 89.7/86.2<br>86.1/81.5 | **90.2/86.8**<br>**89.1/85.5** |
+
+This page contains information on how to fine-prune pre-trained models such as `BERT` to obtain extremely sparse models with movement pruning. In contrast to magnitude pruning which selects weights that are far from 0, movement pruning retains weights that are moving away from 0.
+
+For more information, we invite you to check out [our paper](https://arxiv.org/abs/2005.07683).
+You can also have a look at this fun *Explain Like I'm Five* introductory [slide deck](https://www.slideshare.net/VictorSanh/movement-pruning-explain-like-im-five-234205241).
+
+<div align="center">
+<img src="https://www.seekpng.com/png/detail/166-1669328_how-to-make-emmental-cheese-at-home-icooker.png" width="400">
+</div>
+
+## Extreme sparsity and efficient storage
+
+One promise of extreme pruning is to obtain extremely small models that can be easily sent (and stored) on edge devices. By setting weights to 0., we reduce the amount of information we need to store, and thus decreasing the memory size. We are able to obtain extremely sparse fine-pruned models with movement pruning: ~95% of the dense performance with ~5% of total remaining weights in the BERT encoder.
+
+In [this notebook](https://github.com/huggingface/transformers/blob/master/examples/movement-pruning/Saving_PruneBERT.ipynb), we showcase how we can leverage standard tools that exist out-of-the-box to efficiently store an extremely sparse question answering model (only 6% of total remaining weights in the encoder). We are able to reduce the memory size of the encoder **from the 340MB (the orignal dense BERT) to 11MB**, without any additional training of the model (every operation is performed *post fine-pruning*). It is sufficiently small to store it on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical) 📎!
+
+While movement pruning does not directly optimize for memory footprint (but rather the number of non-null weights), we hypothetize that further memory compression ratios can be achieved with specific quantization aware trainings (see for instance [Q8BERT](https://arxiv.org/abs/1910.06188), [And the Bit Goes Down](https://arxiv.org/abs/1907.05686) or [Quant-Noise](https://arxiv.org/abs/2004.07320)).
+
+## Fine-pruned models
+
+As examples, we release two English PruneBERT checkpoints (models fine-pruned from a pre-trained `BERT` checkpoint), one on SQuAD and the other on MNLI.
+
+- **`prunebert-base-uncased-6-finepruned-w-distil-squad`**<br/>
+Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on SQuAD v1.1. We use an additional distillation signal from `BERT-base-uncased` finetuned on SQuAD. The encoder counts 6% of total non-null weights and reaches 83.8 F1 score. The model can be accessed with: `pruned_bert = BertForQuestionAnswering.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad")`
+- **`prunebert-base-uncased-6-finepruned-w-distil-mnli`**<br/>
+Pre-trained `BERT-base-uncased` fine-pruned with soft movement pruning on MNLI. We use an additional distillation signal from `BERT-base-uncased` finetuned on MNLI. The encoder counts 6% of total non-null weights and reaches 80.7 (matched) accuracy. The model can be accessed with: `pruned_bert = BertForSequenceClassification.from_pretrained("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")`
+
+## How to fine-prune?
+
+### Setup
+
+The code relies on the 🤗 Transformers library. In addition to the dependencies listed in the [`examples`](https://github.com/huggingface/transformers/tree/master/examples) folder, you should install a few additional dependencies listed in the `requirements.txt` file: `pip install -r requirements.txt`.
+
+Note that we built our experiments on top of a stabilized version of the library (commit https://github.com/huggingface/transformers/commit/352d5472b0c1dec0f420d606d16747d851b4bda8): we do not guarantee that everything is still compatible with the latest version of the master branch.
+
+### Fine-pruning with movement pruning
+
+Below, we detail how to reproduce the results reported in the paper. We use SQuAD as a running example. Commands (and scripts) can be easily adapted for other tasks.
+
+The following command fine-prunes a pre-trained `BERT-base` on SQuAD using movement pruning towards 15% of remaining weights (85% sparsity). Note that we freeze all the embeddings modules (from their pre-trained value) and only prune the Fully Connected layers in the encoder (12 layers of Transformer Block).
+
+```bash
+SERIALIZATION_DIR=<OUTPUT_DIR>
+SQUAD_DATA=<SQUAD_DATA>
+
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
+    --initial_threshold 1 --final_threshold 0.15 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method topK --mask_init constant --mask_scale 0.
+```
+
+### Fine-pruning with other methods
+
+We can also explore other fine-pruning methods by changing the `pruning_method` parameter:
+
+Soft movement pruning
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-2 \
+    --initial_threshold 0 --final_threshold 0.1 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method sigmoied_threshold --mask_init constant --mask_scale 0. \
+    --regularization l1 --final_lambda 400.
+```
+
+L0 regularization
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir $SERIALIZATION_DIR \
+    --data_dir $SQUAD_DATA \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 --mask_scores_learning_rate 1e-1 \
+    --initial_threshold 1. --final_threshold 1. \
+    --initial_warmup 1 --final_warmup 1 \
+    --pruning_method l0 --mask_init constant --mask_scale 2.197 \
+    --regularization l0 --final_lambda 125.
+```
+
+Iterative Magnitude Pruning
+```bash
+python examples/movement-pruning/masked_run_squad.py \
+    --output_dir ./dbg \
+    --data_dir examples/distillation/data/squad_data \
+    --train_file train-v1.1.json \
+    --predict_file dev-v1.1.json \
+    --do_train --do_eval --do_lower_case \
+    --model_type masked_bert \
+    --model_name_or_path bert-base-uncased \
+    --per_gpu_train_batch_size 16 \
+    --warmup_steps 5400 \
+    --num_train_epochs 10 \
+    --learning_rate 3e-5 \
+    --initial_threshold 1 --final_threshold 0.15 \
+    --initial_warmup 1 --final_warmup 2 \
+    --pruning_method magnitude
+```
+
+### After fine-pruning
+
+**Counting parameters**
+
+Regularization based pruning methods (soft movement pruning and L0 regularization) rely on the penalty to induce sparsity. The multiplicative coefficient controls the sparsity level.
+To obtain the effective sparsity level in the encoder, we simply count the number of activated (non-null) weights:
+
+```bash
+python examples/movement-pruning/count_parameters.py \
+    --pruning_method sigmoied_threshold \
+    --threshold 0.1 \
+    --serialization_dir $SERIALIZATION_DIR
+```
+
+**Pruning once for all**
+
+Once the model has been fine-pruned, the pruned weights can be set to 0. once for all (reducing the amount of information to store). In our running experiments, we can convert a `MaskedBertForQuestionAnswering` (a BERT model augmented to enable on-the-fly pruning capabilities) to a standard `BertForQuestionAnswering`:
+
+```bash
+python examples/movement-pruning/bertarize.py \
+    --pruning_method sigmoied_threshold \
+    --threshold 0.1 \
+    --model_name_or_path $SERIALIZATION_DIR
+```
+
+## Hyper-parameters
+
+For reproducibility purposes, we share the detailed results presented in the paper. These [tables](https://docs.google.com/spreadsheets/d/17JgRq_OFFTniUrz6BZWW_87DjFkKXpI1kYDSsseT_7g/edit?usp=sharing) exhaustively describe the individual hyper-parameters used for each data point.
+
+## Inference speed
+
+Early experiments show that even though models fine-pruned with (soft) movement pruning are extremely sparse, they do not benefit from significant improvement in terms of inference speed when using the standard PyTorch inference.
+We are currently benchmarking and exploring inference setups specifically for sparse architectures.
+In particular, hardware manufacturers are announcing devices that will speedup inference for sparse networks considerably.
+
+## Citation
+
+If you find this resource useful, please consider citing the following paper:
+
+```
+@article{sanh2020movement,
+    title={Movement Pruning: Adaptive Sparsity by Fine-Tuning},
+    author={Victor Sanh and Thomas Wolf and Alexander M. Rush},
+    year={2020},
+    eprint={2005.07683},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
--- a/examples/movement-pruning/Saving_PruneBERT.ipynb
+++ b/examples/movement-pruning/Saving_PruneBERT.ipynb
@@ -0,0 +1,612 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Saving PruneBERT\n",
+    "\n",
+    "\n",
+    "This notebook aims at showcasing how we can leverage standard tools to save (and load) an extremely sparse model fine-pruned with [movement pruning](https://arxiv.org/abs/2005.07683) (or any other unstructured pruning mehtod).\n",
+    "\n",
+    "In this example, we used BERT (base-uncased, but the procedure described here is not specific to BERT and can be applied to a large variety of models.\n",
+    "\n",
+    "We first obtain an extremely sparse model by fine-pruning with movement pruning on SQuAD v1.1. We then used the following combination of standard tools:\n",
+    "- We reduce the precision of the model with Int8 dynamic quantization using [PyTorch implementation](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html). We only quantized the Fully Connected Layers.\n",
+    "- Sparse quantized matrices are converted into the [Compressed Sparse Row format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).\n",
+    "- We use HDF5 with `gzip` compression to store the weights.\n",
+    "\n",
+    "We experiment with a question answering model with only 6% of total remaining weights in the encoder (previously obtained with movement pruning). **We are able to reduce the memory size of the encoder from 340MB (original dense BERT) to 11MB**, which fits on a [91' floppy disk](https://en.wikipedia.org/wiki/Floptical)!\n",
+    "\n",
+    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Floptical_disk_21MB.jpg/440px-Floptical_disk_21MB.jpg\" width=\"200\">"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Includes\n",
+    "\n",
+    "import h5py\n",
+    "import os\n",
+    "import json\n",
+    "from collections import OrderedDict\n",
+    "\n",
+    "from scipy import sparse\n",
+    "import numpy as np\n",
+    "\n",
+    "import torch\n",
+    "from torch import nn\n",
+    "\n",
+    "from transformers import *\n",
+    "\n",
+    "os.chdir('../../')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Saving"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dynamic quantization induces little or no loss of performance while significantly reducing the memory footprint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load fine-pruned model and quantize the model\n",
+    "\n",
+    "model_path = \"serialization_dir/bert-base-uncased/92/squad/l1\"\n",
+    "model_name = \"bertarized_l1_with_distil_0._0.1_1_2_l1_1100._3e-5_1e-2_sigmoied_threshold_constant_0._10_epochs\"\n",
+    "\n",
+    "model = BertForQuestionAnswering.from_pretrained(os.path.join(model_path, model_name))\n",
+    "model.to('cpu')\n",
+    "\n",
+    "quantized_model = torch.quantization.quantize_dynamic(\n",
+    "                    model=model,\n",
+    "                    qconfig_spec = {\n",
+    "                        torch.nn.Linear : torch.quantization.default_dynamic_qconfig,\n",
+    "                    },\n",
+    "                    dtype=torch.qint8,\n",
+    "                )\n",
+    "# print(quantized_model)\n",
+    "\n",
+    "qtz_st = quantized_model.state_dict()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Saving the original (encoder + classifier) in the standard torch.save format\n",
+    "\n",
+    "dense_st = {name: param for name, param in model.state_dict().items() \n",
+    "                            if \"embedding\" not in name and \"pooler\" not in name}\n",
+    "torch.save(dense_st, 'dbg/dense_squad.pt',)\n",
+    "dense_mb_size = os.path.getsize(\"dbg/dense_squad.pt\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Decompose quantization for bert.encoder.layer.0.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.0.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.1.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.2.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.3.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.4.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.5.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.6.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.7.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.8.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.9.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.10.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.query._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.key._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.self.value._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.attention.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.intermediate.dense._packed_params.weight\n",
+      "Decompose quantization for bert.encoder.layer.11.output.dense._packed_params.weight\n",
+      "Decompose quantization for bert.pooler.dense._packed_params.weight\n",
+      "Decompose quantization for qa_outputs._packed_params.weight\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Elementary representation: we decompose the quantized tensors into (scale, zero_point, int_repr).\n",
+    "# See https://pytorch.org/docs/stable/quantization.html\n",
+    "\n",
+    "# We further leverage the fact that int_repr is sparse matrix to optimize the storage: we decompose int_repr into\n",
+    "# its CSR representation (data, indptr, indices).\n",
+    "\n",
+    "elementary_qtz_st = {}\n",
+    "for name, param in qtz_st.items():\n",
+    "    if param.is_quantized:\n",
+    "        print(\"Decompose quantization for\", name)\n",
+    "        # We need to extract the scale, the zero_point and the int_repr for the quantized tensor and modules\n",
+    "        scale = param.q_scale()                                # torch.tensor(1,) - float32\n",
+    "        zero_point = param.q_zero_point()                      # torch.tensor(1,) - int32\n",
+    "        elementary_qtz_st[f\"{name}.scale\"] = scale\n",
+    "        elementary_qtz_st[f\"{name}.zero_point\"] = zero_point\n",
+    "\n",
+    "        # We assume the int_repr is sparse and compute its CSR representation\n",
+    "        # Only the FCs in the encoder are actually sparse\n",
+    "        int_repr = param.int_repr()                         # torch.tensor(nb_rows, nb_columns) - int8\n",
+    "        int_repr_cs = sparse.csr_matrix(int_repr)           # scipy.sparse.csr.csr_matrix\n",
+    "\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.data\"] = int_repr_cs.data                  # np.array int8\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.indptr\"] = int_repr_cs.indptr              # np.array int32\n",
+    "        assert max(int_repr_cs.indices) < 65535 # If not, we shall fall back to int32\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.indices\"] = np.uint16(int_repr_cs.indices) # np.array uint16\n",
+    "        elementary_qtz_st[f\"{name}.int_repr.shape\"] = int_repr_cs.shape                # tuple(int, int)\n",
+    "    else:\n",
+    "        elementary_qtz_st[name] = param\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Encoder Size (MB) - Sparse & Quantized - `torch.save`: 21.29\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Saving the pruned (encoder + classifier) in the standard torch.save format\n",
+    "\n",
+    "dense_optimized_st = {name: param for name, param in elementary_qtz_st.items() \n",
+    "                                    if \"embedding\" not in name and \"pooler\" not in name}\n",
+    "torch.save(dense_optimized_st, 'dbg/dense_squad_optimized.pt',)\n",
+    "print(\"Encoder Size (MB) - Sparse & Quantized - `torch.save`:\",\n",
+    "      round(os.path.getsize(\"dbg/dense_squad_optimized.pt\")/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Skip bert.embeddings.word_embeddings.weight\n",
+      "Skip bert.embeddings.position_embeddings.weight\n",
+      "Skip bert.embeddings.token_type_embeddings.weight\n",
+      "Skip bert.embeddings.LayerNorm.weight\n",
+      "Skip bert.embeddings.LayerNorm.bias\n",
+      "Skip bert.pooler.dense.scale\n",
+      "Skip bert.pooler.dense.zero_point\n",
+      "Skip bert.pooler.dense._packed_params.weight.scale\n",
+      "Skip bert.pooler.dense._packed_params.weight.zero_point\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.data\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.indptr\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.indices\n",
+      "Skip bert.pooler.dense._packed_params.weight.int_repr.shape\n",
+      "Skip bert.pooler.dense._packed_params.bias\n",
+      "\n",
+      "Encoder Size (MB) - Dense:              340.25\n",
+      "Encoder Size (MB) - Sparse & Quantized: 11.27\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the decomposed state_dict with an HDF5 file\n",
+    "# Saving only the encoder + QA Head\n",
+    "\n",
+    "with h5py.File('dbg/squad_sparse.h5','w') as hf:\n",
+    "    for name, param in elementary_qtz_st.items():\n",
+    "        if \"embedding\" in name:\n",
+    "            print(f\"Skip {name}\")\n",
+    "            continue\n",
+    "\n",
+    "        if \"pooler\" in name:\n",
+    "            print(f\"Skip {name}\")\n",
+    "            continue\n",
+    "\n",
+    "        if type(param) == torch.Tensor:\n",
+    "            if param.numel() == 1:\n",
+    "                # module scale\n",
+    "                # module zero_point\n",
+    "                hf.attrs[name] = param\n",
+    "                continue\n",
+    "\n",
+    "            if param.requires_grad:\n",
+    "                # LayerNorm\n",
+    "                param = param.detach().numpy()\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "        elif type(param) == float or type(param) == int or type(param) == tuple:\n",
+    "            # float - tensor _packed_params.weight.scale\n",
+    "            # int   - tensor_packed_params.weight.zero_point\n",
+    "            # tuple - tensor _packed_params.weight.shape\n",
+    "            hf.attrs[name] = param\n",
+    "\n",
+    "        else:\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "\n",
+    "with open('dbg/metadata.json', 'w') as f:\n",
+    "    f.write(json.dumps(qtz_st._metadata))  \n",
+    "\n",
+    "size = os.path.getsize(\"dbg/squad_sparse.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
+    "print(\"\")\n",
+    "print(\"Encoder Size (MB) - Dense:             \", round(dense_mb_size/1e6, 2))\n",
+    "print(\"Encoder Size (MB) - Sparse & Quantized:\", round(size/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Size (MB): 99.39\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Save the decomposed state_dict to HDF5 storage\n",
+    "# Save everything in the architecutre (embedding + encoder + QA Head)\n",
+    "\n",
+    "with h5py.File('dbg/squad_sparse_with_embs.h5','w') as hf:\n",
+    "    for name, param in elementary_qtz_st.items():\n",
+    "#         if \"embedding\" in name:\n",
+    "#             print(f\"Skip {name}\")\n",
+    "#             continue\n",
+    "\n",
+    "#         if \"pooler\" in name:\n",
+    "#             print(f\"Skip {name}\")\n",
+    "#             continue\n",
+    "\n",
+    "        if type(param) == torch.Tensor:\n",
+    "            if param.numel() == 1:\n",
+    "                # module scale\n",
+    "                # module zero_point\n",
+    "                hf.attrs[name] = param\n",
+    "                continue\n",
+    "\n",
+    "            if param.requires_grad:\n",
+    "                # LayerNorm\n",
+    "                param = param.detach().numpy()\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "        elif type(param) == float or type(param) == int or type(param) == tuple:\n",
+    "            # float - tensor _packed_params.weight.scale\n",
+    "            # int   - tensor _packed_params.weight.zero_point\n",
+    "            # tuple - tensor _packed_params.weight.shape\n",
+    "            hf.attrs[name] = param\n",
+    "\n",
+    "        else:\n",
+    "            hf.create_dataset(name, data=param, compression=\"gzip\", compression_opts=9)\n",
+    "\n",
+    "\n",
+    "with open('dbg/metadata.json', 'w') as f:\n",
+    "    f.write(json.dumps(qtz_st._metadata))   \n",
+    "\n",
+    "size = os.path.getsize(\"dbg/squad_sparse_with_embs.h5\") + os.path.getsize(\"dbg/metadata.json\")\n",
+    "print('\\nSize (MB):', round(size/1e6, 2))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reconstruct the elementary state dict\n",
+    "\n",
+    "reconstructed_elementary_qtz_st = {}\n",
+    "\n",
+    "hf = h5py.File('dbg/squad_sparse_with_embs.h5','r')\n",
+    "\n",
+    "for attr_name, attr_param in hf.attrs.items():\n",
+    "    if 'shape' in attr_name:\n",
+    "        attr_param = tuple(attr_param)\n",
+    "    elif \".scale\" in attr_name:\n",
+    "        if \"_packed_params\" in attr_name:\n",
+    "            attr_param = float(attr_param)\n",
+    "        else:\n",
+    "            attr_param = torch.tensor(attr_param)\n",
+    "    elif \".zero_point\" in attr_name:\n",
+    "        if \"_packed_params\" in attr_name:\n",
+    "            attr_param = int(attr_param)\n",
+    "        else:\n",
+    "            attr_param = torch.tensor(attr_param)\n",
+    "    reconstructed_elementary_qtz_st[attr_name] = attr_param\n",
+    "    # print(f\"Unpack {attr_name}\")\n",
+    "    \n",
+    "# Get the tensors/arrays\n",
+    "for data_name, data_param in hf.items():\n",
+    "    if \"LayerNorm\" in data_name or \"_packed_params.bias\" in data_name:\n",
+    "        reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
+    "    elif \"embedding\" in data_name:\n",
+    "        reconstructed_elementary_qtz_st[data_name] = torch.from_numpy(np.array(data_param))\n",
+    "    else: # _packed_params.weight.int_repr.data, _packed_params.weight.int_repr.indices and _packed_params.weight.int_repr.indptr\n",
+    "        data_param = np.array(data_param)\n",
+    "        if \"indices\" in data_name:\n",
+    "            data_param = np.array(data_param, dtype=np.int32)\n",
+    "        reconstructed_elementary_qtz_st[data_name] = data_param\n",
+    "    # print(f\"Unpack {data_name}\")\n",
+    "    \n",
+    "\n",
+    "hf.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity checks\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    assert name in elementary_qtz_st\n",
+    "for name, param in elementary_qtz_st.items():\n",
+    "    assert name in reconstructed_elementary_qtz_st, name\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    assert type(param) == type(elementary_qtz_st[name]), name\n",
+    "    if type(param) == torch.Tensor:\n",
+    "        assert torch.all(torch.eq(param, elementary_qtz_st[name])), name\n",
+    "    elif type(param) == np.ndarray:\n",
+    "        assert (param == elementary_qtz_st[name]).all(), name\n",
+    "    else:\n",
+    "        assert param == elementary_qtz_st[name], name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Re-assemble the sparse int_repr from the CSR format\n",
+    "\n",
+    "reconstructed_qtz_st = {}\n",
+    "\n",
+    "for name, param in reconstructed_elementary_qtz_st.items():\n",
+    "    if \"weight.int_repr.indptr\" in name:\n",
+    "        prefix_ = name[:-16]\n",
+    "        data    = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.data\"]\n",
+    "        indptr  = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indptr\"]\n",
+    "        indices = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.indices\"]\n",
+    "        shape   = reconstructed_elementary_qtz_st[f\"{prefix_}.int_repr.shape\"]\n",
+    "\n",
+    "        int_repr = sparse.csr_matrix(arg1=(data, indices, indptr),\n",
+    "                                     shape=shape)\n",
+    "        int_repr = torch.tensor(int_repr.todense())\n",
+    "\n",
+    "        scale = reconstructed_elementary_qtz_st[f\"{prefix_}.scale\"]\n",
+    "        zero_point = reconstructed_elementary_qtz_st[f\"{prefix_}.zero_point\"]\n",
+    "        weight = torch._make_per_tensor_quantized_tensor(int_repr,\n",
+    "                                                         scale,\n",
+    "                                                         zero_point)\n",
+    "\n",
+    "        reconstructed_qtz_st[f\"{prefix_}\"] = weight\n",
+    "    elif \"int_repr.data\" in name or \"int_repr.shape\" in name or \"int_repr.indices\" in name or \\\n",
+    "         \"weight.scale\" in name or \"weight.zero_point\" in name:\n",
+    "        continue\n",
+    "    else:\n",
+    "        reconstructed_qtz_st[name] = param\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity checks\n",
+    "\n",
+    "for name, param in reconstructed_qtz_st.items():\n",
+    "    assert name in qtz_st\n",
+    "for name, param in qtz_st.items():\n",
+    "    assert name in reconstructed_qtz_st, name\n",
+    "\n",
+    "for name, param in reconstructed_qtz_st.items():\n",
+    "    assert type(param) == type(qtz_st[name]), name\n",
+    "    if type(param) == torch.Tensor:\n",
+    "        assert torch.all(torch.eq(param, qtz_st[name])), name\n",
+    "    elif type(param) == np.ndarray:\n",
+    "        assert (param == qtz_st[name]).all(), name\n",
+    "    else:\n",
+    "        assert param == qtz_st[name], name"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Sanity checks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<All keys matched successfully>"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Load the re-constructed state dict into a model\n",
+    "\n",
+    "dummy_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')\n",
+    "dummy_model.to('cpu')\n",
+    "\n",
+    "reconstructed_qtz_model = torch.quantization.quantize_dynamic(\n",
+    "                            model=dummy_model,\n",
+    "                            qconfig_spec = None,\n",
+    "                            dtype=torch.qint8,\n",
+    "                          )\n",
+    "\n",
+    "reconstructed_qtz_st = OrderedDict(reconstructed_qtz_st)\n",
+    "with open('dbg/metadata.json', 'r') as read_file:\n",
+    "    metadata = json.loads(read_file.read())\n",
+    "reconstructed_qtz_st._metadata = metadata\n",
+    "\n",
+    "reconstructed_qtz_model.load_state_dict(reconstructed_qtz_st)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Sanity check passed\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Sanity checks on the infernce\n",
+    "\n",
+    "N = 32\n",
+    "\n",
+    "for _ in range(25):\n",
+    "    inputs = torch.randint(low=0, high=30000, size=(N, 128))\n",
+    "    mask = torch.ones(size=(N, 128))\n",
+    "\n",
+    "    y_reconstructed = reconstructed_qtz_model(input_ids=inputs, attention_mask=mask)[0]\n",
+    "    y               = quantized_model(input_ids=inputs, attention_mask=mask)[0]\n",
+    "    \n",
+    "    assert torch.all(torch.eq(y, y_reconstructed))\n",
+    "print(\"Sanity check passed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/examples/movement-pruning/bertarize.py
+++ b/examples/movement-pruning/bertarize.py
@@ -0,0 +1,132 @@
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Once a model has been fine-pruned, the weights that are masked during the forward pass can be pruned once for all.
+For instance, once the a model from the :class:`~emmental.MaskedBertForSequenceClassification` is trained, it can be saved (and then loaded)
+as a standard :class:`~transformers.BertForSequenceClassification`.
+"""
+
+import argparse
+import os
+import shutil
+
+import torch
+
+from emmental.modules import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+
+
+def main(args):
+    pruning_method = args.pruning_method
+    threshold = args.threshold
+
+    model_name_or_path = args.model_name_or_path.rstrip("/")
+    target_model_path = args.target_model_path
+
+    print(f"Load fine-pruned model from {model_name_or_path}")
+    model = torch.load(os.path.join(model_name_or_path, "pytorch_model.bin"))
+    pruned_model = {}
+
+    for name, tensor in model.items():
+        if "embeddings" in name or "LayerNorm" in name or "pooler" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        elif "classifier" in name or "qa_output" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        elif "bias" in name:
+            pruned_model[name] = tensor
+            print(f"Copied layer {name}")
+        else:
+            if pruning_method == "magnitude":
+                mask = MagnitudeBinarizer.apply(inputs=tensor, threshold=threshold)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "topK":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                mask = TopKBinarizer.apply(scores, threshold)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "sigmoied_threshold":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                mask = ThresholdBinarizer.apply(scores, threshold, True)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            elif pruning_method == "l0":
+                if "mask_scores" in name:
+                    continue
+                prefix_ = name[:-6]
+                scores = model[f"{prefix_}mask_scores"]
+                l, r = -0.1, 1.1
+                s = torch.sigmoid(scores)
+                s_bar = s * (r - l) + l
+                mask = s_bar.clamp(min=0.0, max=1.0)
+                pruned_model[name] = tensor * mask
+                print(f"Pruned layer {name}")
+            else:
+                raise ValueError("Unknown pruning method")
+
+    if target_model_path is None:
+        target_model_path = os.path.join(
+            os.path.dirname(model_name_or_path), f"bertarized_{os.path.basename(model_name_or_path)}"
+        )
+
+    if not os.path.isdir(target_model_path):
+        shutil.copytree(model_name_or_path, target_model_path)
+        print(f"\nCreated folder {target_model_path}")
+
+    torch.save(pruned_model, os.path.join(target_model_path, "pytorch_model.bin"))
+    print("\nPruned model saved! See you later!")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--pruning_method",
+        choices=["l0", "magnitude", "topK", "sigmoied_threshold"],
+        type=str,
+        required=True,
+        help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        required=False,
+        help="For `magnitude` and `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
+        "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
+        "Not needed for `l0`",
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        type=str,
+        required=True,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+    parser.add_argument(
+        "--target_model_path",
+        default=None,
+        type=str,
+        required=False,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/movement-pruning/counts_parameters.py
+++ b/examples/movement-pruning/counts_parameters.py
@@ -0,0 +1,92 @@
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Count remaining (non-zero) weights in the encoder (i.e. the transformer layers).
+Sparsity and remaining weights levels are equivalent: sparsity % = 100 - remaining weights %.
+"""
+import argparse
+import os
+
+import torch
+
+from emmental.modules import ThresholdBinarizer, TopKBinarizer
+
+
+def main(args):
+    serialization_dir = args.serialization_dir
+    pruning_method = args.pruning_method
+    threshold = args.threshold
+
+    st = torch.load(os.path.join(serialization_dir, "pytorch_model.bin"), map_location="cpu")
+
+    remaining_count = 0  # Number of remaining (not pruned) params in the encoder
+    encoder_count = 0  # Number of params in the encoder
+
+    print("name".ljust(60, " "), "Remaining Weights %", "Remaning Weight")
+    for name, param in st.items():
+        if "encoder" not in name:
+            continue
+
+        if "mask_scores" in name:
+            if pruning_method == "topK":
+                mask_ones = TopKBinarizer.apply(param, threshold).sum().item()
+            elif pruning_method == "sigmoied_threshold":
+                mask_ones = ThresholdBinarizer.apply(param, threshold, True).sum().item()
+            elif pruning_method == "l0":
+                l, r = -0.1, 1.1
+                s = torch.sigmoid(param)
+                s_bar = s * (r - l) + l
+                mask = s_bar.clamp(min=0.0, max=1.0)
+                mask_ones = (mask > 0.0).sum().item()
+            else:
+                raise ValueError("Unknown pruning method")
+            remaining_count += mask_ones
+            print(name.ljust(60, " "), str(round(100 * mask_ones / param.numel(), 3)).ljust(20, " "), str(mask_ones))
+        else:
+            encoder_count += param.numel()
+            if "bias" in name or "LayerNorm" in name:
+                remaining_count += param.numel()
+
+    print("")
+    print("Remaining Weights (global) %: ", 100 * remaining_count / encoder_count)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--pruning_method",
+        choices=["l0", "topK", "sigmoied_threshold"],
+        type=str,
+        required=True,
+        help="Pruning Method (l0 = L0 regularization, topK = Movement pruning, sigmoied_threshold = Soft movement pruning)",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        required=False,
+        help="For `topK`, it is the level of remaining weights (in %) in the fine-pruned model."
+        "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared."
+        "Not needed for `l0`",
+    )
+    parser.add_argument(
+        "--serialization_dir",
+        type=str,
+        required=True,
+        help="Folder containing the model that was previously fine-pruned",
+    )
+
+    args = parser.parse_args()
+
+    main(args)
--- a/examples/movement-pruning/emmental/init.py
+++ b/examples/movement-pruning/emmental/init.py
@@ -0,0 +1,10 @@
+# flake8: noqa
+from .configuration_bert_masked import MaskedBertConfig
+from .modeling_bert_masked import (
+    MaskedBertForMultipleChoice,
+    MaskedBertForQuestionAnswering,
+    MaskedBertForSequenceClassification,
+    MaskedBertForTokenClassification,
+    MaskedBertModel,
+)
+from .modules import *
--- a/examples/movement-pruning/emmental/configuration_bert_masked.py
+++ b/examples/movement-pruning/emmental/configuration_bert_masked.py
@@ -0,0 +1,71 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Masked BERT model configuration. It replicates the class `~transformers.BertConfig`
+and adapts it to the specificities of MaskedBert (`pruning_method`, `mask_init` and `mask_scale`."""
+
+
+import logging
+
+from transformers.configuration_utils import PretrainedConfig
+
+
+logger = logging.getLogger(__name__)
+
+
+class MaskedBertConfig(PretrainedConfig):
+    """
+    A class replicating the `~transformers.BertConfig` with additional parameters for pruning/masking configuration.
+    """
+
+    model_type = "masked_bert"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        pruning_method="topK",
+        mask_init="constant",
+        mask_scale=0.0,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.pruning_method = pruning_method
+        self.mask_init = mask_init
+        self.mask_scale = mask_scale
--- a/examples/movement-pruning/emmental/modeling_bert_masked.py
+++ b/examples/movement-pruning/emmental/modeling_bert_masked.py
--- a/examples/movement-pruning/emmental/modules/init.py
+++ b/examples/movement-pruning/emmental/modules/init.py
@@ -0,0 +1,3 @@
+# flake8: noqa
+from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+from .masked_nn import MaskedLinear
--- a/examples/movement-pruning/emmental/modules/binarizer.py
+++ b/examples/movement-pruning/emmental/modules/binarizer.py
@@ -0,0 +1,144 @@
+# coding=utf-8
+# Copyright 2020-present, AllenAI Authors, University of Illinois Urbana-Champaign,
+# Intel Nervana Systems and the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Binarizers take a (real value) matrice as input and produce a binary (values in {0,1}) mask of the same shape.
+"""
+
+import torch
+from torch import autograd
+
+
+class ThresholdBinarizer(autograd.Function):
+    """
+    Thresholdd binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j} > \tau`
+    where `\tau` is a real value threshold.
+
+    Implementation is inspired from:
+        https://github.com/arunmallya/piggyback
+        Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
+        Arun Mallya, Dillon Davis, Svetlana Lazebnik
+    """
+
+    @staticmethod
+    def forward(ctx, inputs: torch.tensor, threshold: float, sigmoid: bool):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+            threshold (`float`)
+                The threshold value (in R).
+            sigmoid (`bool`)
+                If set to ``True``, we apply the sigmoid function to the `inputs` matrix before comparing to `threshold`.
+                In this case, `threshold` should be a value between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        nb_elems = inputs.numel()
+        nb_min = int(0.005 * nb_elems) + 1
+        if sigmoid:
+            mask = (torch.sigmoid(inputs) > threshold).type(inputs.type())
+        else:
+            mask = (inputs > threshold).type(inputs.type())
+        if mask.sum() < nb_min:
+            # We limit the pruning so that at least 0.5% (half a percent) of the weights are remaining
+            k_threshold = inputs.flatten().kthvalue(max(nb_elems - nb_min, 1)).values
+            mask = (inputs > k_threshold).type(inputs.type())
+        return mask
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput, None, None
+
+
+class TopKBinarizer(autograd.Function):
+    """
+    Top-k Binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
+    is among the k% highest values of S.
+
+    Implementation is inspired from:
+        https://github.com/allenai/hidden-networks
+        What's hidden in a randomly weighted neural network?
+        Vivek Ramanujan*, Mitchell Wortsman*, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari
+    """
+
+    @staticmethod
+    def forward(ctx, inputs: torch.tensor, threshold: float):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+            threshold (`float`)
+                The percentage of weights to keep (the rest is pruned).
+                `threshold` is a float between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        # Get the subnetwork by sorting the inputs and using the top threshold %
+        mask = inputs.clone()
+        _, idx = inputs.flatten().sort(descending=True)
+        j = int(threshold * inputs.numel())
+
+        # flat_out and mask access the same memory.
+        flat_out = mask.flatten()
+        flat_out[idx[j:]] = 0
+        flat_out[idx[:j]] = 1
+        return mask
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput, None
+
+
+class MagnitudeBinarizer(object):
+    """
+    Magnitude Binarizer.
+    Computes a binary mask M from a real value matrix S such that `M_{i,j} = 1` if and only if `S_{i,j}`
+    is among the k% highest values of |S| (absolute value).
+
+    Implementation is inspired from https://github.com/NervanaSystems/distiller/blob/2291fdcc2ea642a98d4e20629acb5a9e2e04b4e6/distiller/pruning/automated_gradual_pruner.py#L24
+    """
+
+    @staticmethod
+    def apply(inputs: torch.tensor, threshold: float):
+        """
+        Args:
+            inputs (`torch.FloatTensor`)
+                The input matrix from which the binarizer computes the binary mask.
+                This input marix is typically the weight matrix.
+            threshold (`float`)
+                The percentage of weights to keep (the rest is pruned).
+                `threshold` is a float between 0 and 1.
+        Returns:
+            mask (`torch.FloatTensor`)
+                Binary matrix of the same size as `inputs` acting as a mask (1 - the associated weight is
+                retained, 0 - the associated weight is pruned).
+        """
+        # Get the subnetwork by sorting the inputs and using the top threshold %
+        mask = inputs.clone()
+        _, idx = inputs.abs().flatten().sort(descending=True)
+        j = int(threshold * inputs.numel())
+
+        # flat_out and mask access the same memory.
+        flat_out = mask.flatten()
+        flat_out[idx[j:]] = 0
+        flat_out[idx[:j]] = 1
+        return mask
--- a/examples/movement-pruning/emmental/modules/masked_nn.py
+++ b/examples/movement-pruning/emmental/modules/masked_nn.py
@@ -0,0 +1,107 @@
+# coding=utf-8
+# Copyright 2020-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Masked Linear module: A fully connected layer that computes an adaptive binary mask on the fly.
+The mask (binary or not) is computed at each forward pass and multiplied against
+the weight matrix to prune a portion of the weights.
+The pruned weight matrix is then multiplied against the inputs (and if necessary, the bias is added).
+"""
+
+import math
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn import init
+
+from .binarizer import MagnitudeBinarizer, ThresholdBinarizer, TopKBinarizer
+
+
+class MaskedLinear(nn.Linear):
+    """
+    Fully Connected layer with on the fly adaptive mask.
+    If needed, a score matrix is created to store the importance of each associated weight.
+    """
+
+    def __init__(
+        self,
+        in_features: int,
+        out_features: int,
+        bias: bool = True,
+        mask_init: str = "constant",
+        mask_scale: float = 0.0,
+        pruning_method: str = "topK",
+    ):
+        """
+        Args:
+            in_features (`int`)
+                Size of each input sample
+            out_features (`int`)
+                Size of each output sample
+            bias (`bool`)
+                If set to ``False``, the layer will not learn an additive bias.
+                Default: ``True``
+            mask_init (`str`)
+                The initialization method for the score matrix if a score matrix is needed.
+                Choices: ["constant", "uniform", "kaiming"]
+                Default: ``constant``
+            mask_scale (`float`)
+                The initialization parameter for the chosen initialization method `mask_init`.
+                Default: ``0.``
+            pruning_method (`str`)
+                Method to compute the mask.
+                Choices: ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
+                Default: ``topK``
+        """
+        super(MaskedLinear, self).__init__(in_features=in_features, out_features=out_features, bias=bias)
+        assert pruning_method in ["topK", "threshold", "sigmoied_threshold", "magnitude", "l0"]
+        self.pruning_method = pruning_method
+
+        if self.pruning_method in ["topK", "threshold", "sigmoied_threshold", "l0"]:
+            self.mask_scale = mask_scale
+            self.mask_init = mask_init
+            self.mask_scores = nn.Parameter(torch.Tensor(self.weight.size()))
+            self.init_mask()
+
+    def init_mask(self):
+        if self.mask_init == "constant":
+            init.constant_(self.mask_scores, val=self.mask_scale)
+        elif self.mask_init == "uniform":
+            init.uniform_(self.mask_scores, a=-self.mask_scale, b=self.mask_scale)
+        elif self.mask_init == "kaiming":
+            init.kaiming_uniform_(self.mask_scores, a=math.sqrt(5))
+
+    def forward(self, input: torch.tensor, threshold: float):
+        # Get the mask
+        if self.pruning_method == "topK":
+            mask = TopKBinarizer.apply(self.mask_scores, threshold)
+        elif self.pruning_method in ["threshold", "sigmoied_threshold"]:
+            sig = "sigmoied" in self.pruning_method
+            mask = ThresholdBinarizer.apply(self.mask_scores, threshold, sig)
+        elif self.pruning_method == "magnitude":
+            mask = MagnitudeBinarizer.apply(self.weight, threshold)
+        elif self.pruning_method == "l0":
+            l, r, b = -0.1, 1.1, 2 / 3
+            if self.training:
+                u = torch.zeros_like(self.mask_scores).uniform_().clamp(0.0001, 0.9999)
+                s = torch.sigmoid((u.log() - (1 - u).log() + self.mask_scores) / b)
+            else:
+                s = torch.sigmoid(self.mask_scores)
+            s_bar = s * (r - l) + l
+            mask = s_bar.clamp(min=0.0, max=1.0)
+        # Mask weights with computed mask
+        weight_thresholded = mask * self.weight
+        # Compute output (linear layer) with masked weights
+        return F.linear(input, weight_thresholded, self.bias)
--- a/examples/movement-pruning/masked_run_glue.py
+++ b/examples/movement-pruning/masked_run_glue.py
@@ -0,0 +1,924 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Fine-pruning Masked BERT on sequence classification on GLUE."""
+
+import argparse
+import glob
+import json
+import logging
+import os
+import random
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm, trange
+
+from emmental import MaskedBertConfig, MaskedBertForSequenceClassification
+from transformers import (
+    WEIGHTS_NAME,
+    AdamW,
+    BertConfig,
+    BertForSequenceClassification,
+    BertTokenizer,
+    get_linear_schedule_with_warmup,
+)
+from transformers import glue_compute_metrics as compute_metrics
+from transformers import glue_convert_examples_to_features as convert_examples_to_features
+from transformers import glue_output_modes as output_modes
+from transformers import glue_processors as processors
+
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except ImportError:
+    from tensorboardX import SummaryWriter
+
+
+logger = logging.getLogger(__name__)
+
+MODEL_CLASSES = {
+    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
+    "masked_bert": (MaskedBertConfig, MaskedBertForSequenceClassification, BertTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+def schedule_threshold(
+    step: int,
+    total_step: int,
+    warmup_steps: int,
+    initial_threshold: float,
+    final_threshold: float,
+    initial_warmup: int,
+    final_warmup: int,
+    final_lambda: float,
+):
+    if step <= initial_warmup * warmup_steps:
+        threshold = initial_threshold
+    elif step > (total_step - final_warmup * warmup_steps):
+        threshold = final_threshold
+    else:
+        spars_warmup_steps = initial_warmup * warmup_steps
+        spars_schedu_steps = (final_warmup + initial_warmup) * warmup_steps
+        mul_coeff = 1 - (step - spars_warmup_steps) / (total_step - spars_schedu_steps)
+        threshold = final_threshold + (initial_threshold - final_threshold) * (mul_coeff ** 3)
+    regu_lambda = final_lambda * threshold / final_threshold
+    return threshold, regu_lambda
+
+
+def regularization(model: nn.Module, mode: str):
+    regu, counter = 0, 0
+    for name, param in model.named_parameters():
+        if "mask_scores" in name:
+            if mode == "l1":
+                regu += torch.norm(torch.sigmoid(param), p=1) / param.numel()
+            elif mode == "l0":
+                regu += torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1)).sum() / param.numel()
+            else:
+                ValueError("Don't know this mode.")
+            counter += 1
+    return regu / counter
+
+
+def train(args, train_dataset, model, tokenizer, teacher=None):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter(log_dir=args.output_dir)
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ["bias", "LayerNorm.weight"]
+    optimizer_grouped_parameters = [
+        {
+            "params": [p for n, p in model.named_parameters() if "mask_score" in n and p.requires_grad],
+            "lr": args.mask_scores_learning_rate,
+        },
+        {
+            "params": [
+                p
+                for n, p in model.named_parameters()
+                if "mask_score" not in n and p.requires_grad and not any(nd in n for nd in no_decay)
+            ],
+            "lr": args.learning_rate,
+            "weight_decay": args.weight_decay,
+        },
+        {
+            "params": [
+                p
+                for n, p in model.named_parameters()
+                if "mask_score" not in n and p.requires_grad and any(nd in n for nd in no_decay)
+            ],
+            "lr": args.learning_rate,
+            "weight_decay": 0.0,
+        },
+    ]
+
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = get_linear_schedule_with_warmup(
+        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
+    )
+
+    # Check if saved optimizer or scheduler states exist
+    if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
+        os.path.join(args.model_name_or_path, "scheduler.pt")
+    ):
+        # Load in optimizer and scheduler states
+        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
+        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
+
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(
+            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True,
+        )
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info(
+        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
+        args.train_batch_size
+        * args.gradient_accumulation_steps
+        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
+    )
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+    # Distillation
+    if teacher is not None:
+        logger.info("  Training with distillation")
+
+    global_step = 0
+    # Global TopK
+    if args.global_topk:
+        threshold_mem = None
+    epochs_trained = 0
+    steps_trained_in_current_epoch = 0
+    # Check if continuing training from a checkpoint
+    if os.path.exists(args.model_name_or_path):
+        # set global_step to global_step of last saved checkpoint from model path
+        try:
+            global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
+        except ValueError:
+            global_step = 0
+        epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
+        steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
+
+        logger.info("  Continuing training from checkpoint, will skip to saved global_step")
+        logger.info("  Continuing training from epoch %d", epochs_trained)
+        logger.info("  Continuing training from global step %d", global_step)
+        logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
+
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(
+        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0],
+    )
+    set_seed(args)  # Added here for reproductibility
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+
+            # Skip past any already trained steps if resuming training
+            if steps_trained_in_current_epoch > 0:
+                steps_trained_in_current_epoch -= 1
+                continue
+
+            model.train()
+            batch = tuple(t.to(args.device) for t in batch)
+            threshold, regu_lambda = schedule_threshold(
+                step=global_step,
+                total_step=t_total,
+                warmup_steps=args.warmup_steps,
+                final_threshold=args.final_threshold,
+                initial_threshold=args.initial_threshold,
+                final_warmup=args.final_warmup,
+                initial_warmup=args.initial_warmup,
+                final_lambda=args.final_lambda,
+            )
+            # Global TopK
+            if args.global_topk:
+                if threshold == 1.0:
+                    threshold = -1e2  # Or an indefinitely low quantity
+                else:
+                    if (threshold_mem is None) or (global_step % args.global_topk_frequency_compute == 0):
+                        # Sort all the values to get the global topK
+                        concat = torch.cat(
+                            [param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
+                        )
+                        n = concat.numel()
+                        kth = max(n - (int(n * threshold) + 1), 1)
+                        threshold_mem = concat.kthvalue(kth).values.item()
+                        threshold = threshold_mem
+                    else:
+                        threshold = threshold_mem
+            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+            if args.model_type != "distilbert":
+                inputs["token_type_ids"] = (
+                    batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
+                )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
+
+            if "masked" in args.model_type:
+                inputs["threshold"] = threshold
+
+            outputs = model(**inputs)
+            loss, logits_stu = outputs  # model outputs are always tuple in transformers (see doc)
+
+            # Distillation loss
+            if teacher is not None:
+                if "token_type_ids" not in inputs:
+                    inputs["token_type_ids"] = None if args.teacher_type == "xlm" else batch[2]
+                with torch.no_grad():
+                    (logits_tea,) = teacher(
+                        input_ids=inputs["input_ids"],
+                        token_type_ids=inputs["token_type_ids"],
+                        attention_mask=inputs["attention_mask"],
+                    )
+
+                loss_logits = F.kl_div(
+                    input=F.log_softmax(logits_stu / args.temperature, dim=-1),
+                    target=F.softmax(logits_tea / args.temperature, dim=-1),
+                    reduction="batchmean",
+                ) * (args.temperature ** 2)
+
+                loss = args.alpha_distil * loss_logits + args.alpha_ce * loss
+
+            # Regularization
+            if args.regularization is not None:
+                regu_ = regularization(model=model, mode=args.regularization)
+                loss = loss + regu_lambda * regu_
+
+            if args.n_gpu > 1:
+                loss = loss.mean()  # mean() to average on multi-gpu parallel training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+            else:
+                loss.backward()
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0 or (
+                # last step in epoch but step is always smaller than gradient_accumulation_steps
+                len(epoch_iterator) <= args.gradient_accumulation_steps
+                and (step + 1) == len(epoch_iterator)
+            ):
+                if args.fp16:
+                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+                else:
+                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    tb_writer.add_scalar("threshold", threshold, global_step)
+                    for name, param in model.named_parameters():
+                        if not param.requires_grad:
+                            continue
+                        tb_writer.add_scalar("parameter_mean/" + name, param.data.mean(), global_step)
+                        tb_writer.add_scalar("parameter_std/" + name, param.data.std(), global_step)
+                        tb_writer.add_scalar("parameter_min/" + name, param.data.min(), global_step)
+                        tb_writer.add_scalar("parameter_max/" + name, param.data.max(), global_step)
+                        tb_writer.add_scalar("grad_mean/" + name, param.grad.data.mean(), global_step)
+                        tb_writer.add_scalar("grad_std/" + name, param.grad.data.std(), global_step)
+                        if args.regularization is not None and "mask_scores" in name:
+                            if args.regularization == "l1":
+                                perc = (torch.sigmoid(param) > threshold).sum().item() / param.numel()
+                            elif args.regularization == "l0":
+                                perc = (torch.sigmoid(param - 2 / 3 * np.log(0.1 / 1.1))).sum().item() / param.numel()
+                            tb_writer.add_scalar("retained_weights_perc/" + name, perc, global_step)
+
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    logs = {}
+                    if (
+                        args.local_rank == -1 and args.evaluate_during_training
+                    ):  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            eval_key = "eval_{}".format(key)
+                            logs[eval_key] = value
+
+                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
+                    learning_rate_scalar = scheduler.get_lr()
+                    logs["learning_rate"] = learning_rate_scalar[0]
+                    if len(learning_rate_scalar) > 1:
+                        for idx, lr in enumerate(learning_rate_scalar[1:]):
+                            logs[f"learning_rate/{idx+1}"] = lr
+                    logs["loss"] = loss_scalar
+                    if teacher is not None:
+                        logs["loss/distil"] = loss_logits.item()
+                    if args.regularization is not None:
+                        logs["loss/regularization"] = regu_.item()
+                    if (teacher is not None) or (args.regularization is not None):
+                        if (teacher is not None) and (args.regularization is not None):
+                            logs["loss/instant_ce"] = (
+                                loss.item()
+                                - regu_lambda * logs["loss/regularization"]
+                                - args.alpha_distil * logs["loss/distil"]
+                            ) / args.alpha_ce
+                        elif teacher is not None:
+                            logs["loss/instant_ce"] = (
+                                loss.item() - args.alpha_distil * logs["loss/distil"]
+                            ) / args.alpha_ce
+                        else:
+                            logs["loss/instant_ce"] = loss.item() - regu_lambda * logs["loss/regularization"]
+                    logging_loss = tr_loss
+
+                    for key, value in logs.items():
+                        tb_writer.add_scalar(key, value, global_step)
+                    print(json.dumps({**logs, **{"step": global_step}}))
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = (
+                        model.module if hasattr(model, "module") else model
+                    )  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    tokenizer.save_pretrained(output_dir)
+
+                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+                    logger.info("Saving optimizer and scheduler states to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    # Loop to handle MNLI double evaluation (matched, mis-matched)
+    eval_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
+    eval_outputs_dirs = (args.output_dir, args.output_dir + "/MM") if args.task_name == "mnli" else (args.output_dir,)
+
+    results = {}
+    for eval_task, eval_output_dir in zip(eval_task_names, eval_outputs_dirs):
+        eval_dataset = load_and_cache_examples(args, eval_task, tokenizer, evaluate=True)
+
+        if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(eval_output_dir)
+
+        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+        # Note that DistributedSampler samples randomly
+        eval_sampler = SequentialSampler(eval_dataset)
+        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+        # multi-gpu eval
+        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
+            model = torch.nn.DataParallel(model)
+
+        # Eval!
+        logger.info("***** Running evaluation {} *****".format(prefix))
+        logger.info("  Num examples = %d", len(eval_dataset))
+        logger.info("  Batch size = %d", args.eval_batch_size)
+        eval_loss = 0.0
+        nb_eval_steps = 0
+        preds = None
+        out_label_ids = None
+
+        # Global TopK
+        if args.global_topk:
+            threshold_mem = None
+
+        for batch in tqdm(eval_dataloader, desc="Evaluating"):
+            model.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+
+            with torch.no_grad():
+                inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+                if args.model_type != "distilbert":
+                    inputs["token_type_ids"] = (
+                        batch[2] if args.model_type in ["bert", "masked_bert", "xlnet", "albert"] else None
+                    )  # XLM, DistilBERT, RoBERTa, and XLM-RoBERTa don't use segment_ids
+                if "masked" in args.model_type:
+                    inputs["threshold"] = args.final_threshold
+                    if args.global_topk:
+                        if threshold_mem is None:
+                            concat = torch.cat(
+                                [param.view(-1) for name, param in model.named_parameters() if "mask_scores" in name]
+                            )
+                            n = concat.numel()
+                            kth = max(n - (int(n * args.final_threshold) + 1), 1)
+                            threshold_mem = concat.kthvalue(kth).values.item()
+                        inputs["threshold"] = threshold_mem
+                outputs = model(**inputs)
+                tmp_eval_loss, logits = outputs[:2]
+
+                eval_loss += tmp_eval_loss.mean().item()
+            nb_eval_steps += 1
+            if preds is None:
+                preds = logits.detach().cpu().numpy()
+                out_label_ids = inputs["labels"].detach().cpu().numpy()
+            else:
+                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
+
+        eval_loss = eval_loss / nb_eval_steps
+        if args.output_mode == "classification":
+            from scipy.special import softmax
+
+            probs = softmax(preds, axis=-1)
+            entropy = np.exp((-probs * np.log(probs)).sum(axis=-1).mean())
+            preds = np.argmax(preds, axis=1)
+        elif args.output_mode == "regression":
+            preds = np.squeeze(preds)
+        result = compute_metrics(eval_task, preds, out_label_ids)
+        results.update(result)
+        if entropy is not None:
+            result["eval_avg_entropy"] = entropy
+
+        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results {} *****".format(prefix))
+            for key in sorted(result.keys()):
+                logger.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+    return results
+
+
+def load_and_cache_examples(args, task, tokenizer, evaluate=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    processor = processors[task]()
+    output_mode = output_modes[task]
+    # Load data features from cache or dataset file
+    cached_features_file = os.path.join(
+        args.data_dir,
+        "cached_{}_{}_{}_{}".format(
+            "dev" if evaluate else "train",
+            list(filter(None, args.model_name_or_path.split("/"))).pop(),
+            str(args.max_seq_length),
+            str(task),
+        ),
+    )
+    if os.path.exists(cached_features_file) and not args.overwrite_cache:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", args.data_dir)
+        label_list = processor.get_labels()
+        if task in ["mnli", "mnli-mm"] and args.model_type in ["roberta", "xlmroberta"]:
+            # HACK(label indices are swapped in RoBERTa pretrained model)
+            label_list[1], label_list[2] = label_list[2], label_list[1]
+        examples = (
+            processor.get_dev_examples(args.data_dir) if evaluate else processor.get_train_examples(args.data_dir)
+        )
+        features = convert_examples_to_features(
+            examples, tokenizer, max_length=args.max_seq_length, label_list=label_list, output_mode=output_mode,
+        )
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
+    all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
+    if output_mode == "classification":
+        all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
+    elif output_mode == "regression":
+        all_labels = torch.tensor([f.label for f in features], dtype=torch.float)
+
+    dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_labels)
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    # Required parameters
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
+    )
+    parser.add_argument(
+        "--model_type",
+        default=None,
+        type=str,
+        required=True,
+        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+    )
+    parser.add_argument(
+        "--model_name_or_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to pretrained model or model identifier from huggingface.co/models",
+    )
+    parser.add_argument(
+        "--task_name",
+        default=None,
+        type=str,
+        required=True,
+        help="The name of the task to train selected in the list: " + ", ".join(processors.keys()),
+    )
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+    # Other parameters
+    parser.add_argument(
+        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--tokenizer_name",
+        default="",
+        type=str,
+        help="Pretrained tokenizer name or path if not the same as model_name",
+    )
+    parser.add_argument(
+        "--cache_dir",
+        default="",
+        type=str,
+        help="Where do you want to store the pre-trained models downloaded from s3",
+    )
+    parser.add_argument(
+        "--max_seq_length",
+        default=128,
+        type=int,
+        help="The maximum total input sequence length after tokenization. Sequences longer "
+        "than this will be truncated, sequences shorter will be padded.",
+    )
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
+    parser.add_argument(
+        "--evaluate_during_training", action="store_true", help="Run evaluation during training at each logging step.",
+    )
+    parser.add_argument(
+        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model.",
+    )
+
+    parser.add_argument(
+        "--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",
+    )
+    parser.add_argument(
+        "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.",
+    )
+    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+
+    # Pruning parameters
+    parser.add_argument(
+        "--mask_scores_learning_rate",
+        default=1e-2,
+        type=float,
+        help="The Adam initial learning rate of the mask scores.",
+    )
+    parser.add_argument(
+        "--initial_threshold", default=1.0, type=float, help="Initial value of the threshold (for scheduling)."
+    )
+    parser.add_argument(
+        "--final_threshold", default=0.7, type=float, help="Final value of the threshold (for scheduling)."
+    )
+    parser.add_argument(
+        "--initial_warmup",
+        default=1,
+        type=int,
+        help="Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays"
+        "at its `initial_threshold` value (sparsity schedule).",
+    )
+    parser.add_argument(
+        "--final_warmup",
+        default=2,
+        type=int,
+        help="Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays"
+        "at its final_threshold value (sparsity schedule).",
+    )
+
+    parser.add_argument(
+        "--pruning_method",
+        default="topK",
+        type=str,
+        help="Pruning Method (l0 = L0 regularization, magnitude = Magnitude pruning, topK = Movement pruning, sigmoied_threshold = Soft movement pruning).",
+    )
+    parser.add_argument(
+        "--mask_init",
+        default="constant",
+        type=str,
+        help="Initialization method for the mask scores. Choices: constant, uniform, kaiming.",
+    )
+    parser.add_argument(
+        "--mask_scale", default=0.0, type=float, help="Initialization parameter for the chosen initialization method."
+    )
+
+    parser.add_argument("--regularization", default=None, help="Add L0 or L1 regularization to the mask scores.")
+    parser.add_argument(
+        "--final_lambda",
+        default=0.0,
+        type=float,
+        help="Regularization intensity (used in conjunction with `regulariation`.",
+    )
+
+    parser.add_argument("--global_topk", action="store_true", help="Global TopK on the Scores.")
+    parser.add_argument(
+        "--global_topk_frequency_compute",
+        default=25,
+        type=int,
+        help="Frequency at which we compute the TopK global threshold.",
+    )
+
+    # Distillation parameters (optional)
+    parser.add_argument(
+        "--teacher_type",
+        default=None,
+        type=str,
+        help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.",
+    )
+    parser.add_argument(
+        "--teacher_name_or_path",
+        default=None,
+        type=str,
+        help="Path to the already fine-tuned teacher model. Only for distillation.",
+    )
+    parser.add_argument(
+        "--alpha_ce", default=0.5, type=float, help="Cross entropy loss linear weight. Only for distillation."
+    )
+    parser.add_argument(
+        "--alpha_distil", default=0.5, type=float, help="Distillation loss linear weight. Only for distillation."
+    )
+    parser.add_argument(
+        "--temperature", default=2.0, type=float, help="Distillation temperature. Only for distillation."
+    )
+
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument(
+        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform.",
+    )
+    parser.add_argument(
+        "--max_steps",
+        default=-1,
+        type=int,
+        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
+    )
+    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+
+    parser.add_argument("--logging_steps", type=int, default=50, help="Log every X updates steps.")
+    parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.")
+    parser.add_argument(
+        "--eval_all_checkpoints",
+        action="store_true",
+        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
+    )
+    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+    parser.add_argument(
+        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory",
+    )
+    parser.add_argument(
+        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets",
+    )
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+    )
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O1",
+        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://nvidia.github.io/apex/amp.html",
+    )
+    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+
+    args = parser.parse_args()
+
+    # Regularization
+    if args.regularization == "null":
+        args.regularization = None
+
+    if (
+        os.path.exists(args.output_dir)
+        and os.listdir(args.output_dir)
+        and args.do_train
+        and not args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend="nccl")
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
+    )
+    logger.warning(
+        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+        args.local_rank,
+        device,
+        args.n_gpu,
+        bool(args.local_rank != -1),
+        args.fp16,
+    )
+
+    # Set seed
+    set_seed(args)
+
+    # Prepare GLUE task
+    args.task_name = args.task_name.lower()
+    if args.task_name not in processors:
+        raise ValueError("Task not found: %s" % (args.task_name))
+    processor = processors[args.task_name]()
+    args.output_mode = output_modes[args.task_name]
+    label_list = processor.get_labels()
+    num_labels = len(label_list)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(
+        args.config_name if args.config_name else args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=args.task_name,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+        pruning_method=args.pruning_method,
+        mask_init=args.mask_init,
+        mask_scale=args.mask_scale,
+    )
+    tokenizer = tokenizer_class.from_pretrained(
+        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+        do_lower_case=args.do_lower_case,
+    )
+    model = model_class.from_pretrained(
+        args.model_name_or_path,
+        from_tf=bool(".ckpt" in args.model_name_or_path),
+        config=config,
+        cache_dir=args.cache_dir if args.cache_dir else None,
+    )
+
+    if args.teacher_type is not None:
+        assert args.teacher_name_or_path is not None
+        assert args.alpha_distil > 0.0
+        assert args.alpha_distil + args.alpha_ce > 0.0
+        teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
+        teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
+        teacher = teacher_model_class.from_pretrained(
+            args.teacher_name_or_path,
+            from_tf=False,
+            config=teacher_config,
+            cache_dir=args.cache_dir if args.cache_dir else None,
+        )
+        teacher.to(args.device)
+    else:
+        teacher = None
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = (
+            model.module if hasattr(model, "module") else model
+        )  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+    # Evaluation
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(
+                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
+            )
+            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+        for checkpoint in checkpoints:
+            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
+
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+            result = evaluate(args, model, tokenizer, prefix=prefix)
+            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/movement-pruning/masked_run_squad.py
+++ b/examples/movement-pruning/masked_run_squad.py
--- a/examples/movement-pruning/requirements.txt
+++ b/examples/movement-pruning/requirements.txt
@@ -0,0 +1,6 @@
+torch>=1.4.0
+-e git+https://github.com/huggingface/transformers.git@352d5472b0c1dec0f420d606d16747d851b4bda8#egg=transformers
+knockknock>=0.1.8.1
+h5py>=2.10.0
+numpy>=1.18.2
+scipy>=1.4.1
--- a/examples/multiple-choice/README.md
+++ b/examples/multiple-choice/README.md
@@ -19,7 +19,7 @@ python ./examples/multiple-choice/run_multiple_choice.py \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
+--per_device_train_batch_size=16 \
 --gradient_accumulation_steps 2 \
 --overwrite_output
 ```
@@ -46,7 +46,7 @@ python ./examples/multiple-choice/run_tf_multiple_choice.py \
 --max_seq_length 80 \
 --output_dir models_bert/swag_base \
 --per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
+--per_device_train_batch_size=16 \
 --logging-dir logs \
 --gradient_accumulation_steps 2 \
 --overwrite_output
--- a/examples/multiple-choice/run_multiple_choice.py
+++ b/examples/multiple-choice/run_multiple_choice.py
@@ -159,7 +159,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -172,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -204,19 +202,20 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+                results.update(result)

    return results

--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
@@ -26,6 +26,7 @@ from enum import Enum
 from typing import List, Optional

 import tqdm
+from filelock import FileLock

 from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available

@@ -77,7 +78,6 @@ class Split(Enum):
 if is_torch_available():
    import torch
    from torch.utils.data.dataset import Dataset
-    from transformers import torch_distributed_zero_first

    class MultipleChoiceDataset(Dataset):
        """
@@ -95,7 +95,6 @@ if is_torch_available():
            max_seq_length: Optional[int] = None,
            overwrite_cache=False,
            mode: Split = Split.train,
-            local_rank=-1,
        ):
            processor = processors[task]()

@@ -103,9 +102,11 @@ if is_torch_available():
                data_dir,
                "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
            )
-            with torch_distributed_zero_first(local_rank):
-                # Make sure only the first process in distributed training processes the dataset,
-                # and the others will use the cache.
+
+            # Make sure only the first process in distributed training processes the dataset,
+            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

                if os.path.exists(cached_features_file) and not overwrite_cache:
                    logger.info(f"Loading features from cached file {cached_features_file}")
@@ -130,9 +131,8 @@ if is_torch_available():
                        pad_token=tokenizer.pad_token_id,
                        pad_token_segment_id=tokenizer.pad_token_type_id,
                    )
-                    if local_rank in [-1, 0]:
-                        logger.info("Saving features into cached file %s", cached_features_file)
-                        torch.save(self.features, cached_features_file)
+                    logger.info("Saving features into cached file %s", cached_features_file)
+                    torch.save(self.features, cached_features_file)

        def __len__(self):
            return len(self.features)
@@ -535,7 +535,12 @@ def convert_examples_to_features(
                text_b = example.question + " " + ending

            inputs = tokenizer.encode_plus(
-                text_a, text_b, add_special_tokens=True, max_length=max_length, pad_to_max_length=True,
+                text_a,
+                text_b,
+                add_special_tokens=True,
+                max_length=max_length,
+                pad_to_max_length=True,
+                return_overflowing_tokens=True,
            )
            if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
                logger.info(
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -28,6 +28,7 @@ python run_squad.py \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --do_eval \
+  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
@@ -56,6 +57,7 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
+    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
--- a/examples/question-answering/run_squad.py
+++ b/examples/question-answering/run_squad.py
@@ -58,8 +58,6 @@ logger = logging.getLogger(__name__)
 MODEL_CONFIG_CLASSES = list(MODEL_FOR_QUESTION_ANSWERING_MAPPING.keys())
 MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in MODEL_CONFIG_CLASSES), (),)
-

 def set_seed(args):
    random.seed(args.seed)
@@ -491,7 +489,7 @@ def main():
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--output_dir",
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -6,3 +6,4 @@ sacrebleu
 rouge-score
 tensorflow_datasets
 pytorch-lightning==0.7.3  # April 10, 2020 release
+matplotlib
--- a/examples/summarization/bart/evaluate_cnn.py
+++ b/examples/summarization/bart/evaluate_cnn.py
@@ -21,7 +21,7 @@ def generate_summaries(
 ):
    fout = Path(out_file).open("w")
    model = BartForConditionalGeneration.from_pretrained(model_name).to(device)
-    tokenizer = BartTokenizer.from_pretrained("bart-large")
+    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")

    max_length = 140
    min_length = 55
@@ -54,7 +54,7 @@ def run_generate():
        "output_path", type=str, help="where to save summaries",
    )
    parser.add_argument(
-        "model_name", type=str, default="bart-large-cnn", help="like bart-large-cnn",
+        "model_name", type=str, default="facebook/bart-large-cnn", help="like bart-large-cnn",
    )
    parser.add_argument(
        "--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.",
--- a/examples/summarization/bart/test_bart_examples.py
+++ b/examples/summarization/bart/test_bart_examples.py
@@ -129,7 +129,7 @@ class TestBartExamples(unittest.TestCase):
        summaries = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"]
        _dump_articles((tmp_dir / "train.source"), articles)
        _dump_articles((tmp_dir / "train.target"), summaries)
-        tokenizer = BartTokenizer.from_pretrained("bart-large")
+        tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
        max_len_source = max(len(tokenizer.encode(a)) for a in articles)
        max_len_target = max(len(tokenizer.encode(a)) for a in summaries)
        trunc_target = 4
--- a/examples/summarization/bertabs/configuration_bertabs.py
+++ b/examples/summarization/bertabs/configuration_bertabs.py
@@ -61,7 +61,6 @@ class BertAbsConfig(PretrainedConfig):
            the decoder.
    """

-    pretrained_config_archive_map = BERTABS_FINETUNED_CONFIG_MAP
    model_type = "bertabs"

    def __init__(
--- a/examples/summarization/bertabs/modeling_bertabs.py
+++ b/examples/summarization/bertabs/modeling_bertabs.py
@@ -33,14 +33,13 @@ from transformers import BertConfig, BertModel, PreTrainedModel

 MAX_SIZE = 5000

-BERTABS_FINETUNED_MODEL_MAP = {
-    "bertabs-finetuned-cnndm": "https://cdn.huggingface.co/remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization/pytorch_model.bin",
-}
+BERTABS_FINETUNED_MODEL_ARCHIVE_LIST = [
+    "remi/bertabs-finetuned-cnndm-extractive-abstractive-summarization",
+]


 class BertAbsPreTrainedModel(PreTrainedModel):
    config_class = BertAbsConfig
-    pretrained_model_archive_map = BERTABS_FINETUNED_MODEL_MAP
    load_tf_weights = False
    base_model_prefix = "bert"

--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -61,8 +61,8 @@ class ExamplesTests(unittest.TestCase):
            --do_train
            --do_eval
            --output_dir ./tests/fixtures/tests_samples/temp_dir
-            --per_gpu_train_batch_size=2
-            --per_gpu_eval_batch_size=1
+            --per_device_train_batch_size=2
+            --per_device_eval_batch_size=1
            --learning_rate=1e-4
            --max_steps=10
            --warmup_steps=2
--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -68,7 +68,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
@@ -141,7 +141,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/
@@ -166,7 +166,7 @@ python run_glue.py \
  --do_eval \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
@@ -189,7 +189,7 @@ python -m torch.distributed.launch \
    --do_eval \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
+    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
@@ -221,7 +221,7 @@ python -m torch.distributed.launch \
    --do_eval \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
-    --per_gpu_train_batch_size 8 \
+    --per_device_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \
@@ -258,7 +258,7 @@ TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recal

 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).

-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
+[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is a crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

 #### Fine-tuning on XNLI

@@ -273,14 +273,13 @@ on a single tesla V100 16GB. The data for XNLI can be downloaded with the follow
 export XNLI_DIR=/path/to/XNLI

 python run_xnli.py \
-  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
-  --per_gpu_train_batch_size 32 \
+  --per_device_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
@@ -135,7 +135,8 @@ def main():

    # Get datasets
    train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
-    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
+    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
+    test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None

    def compute_metrics(p: EvalPrediction) -> Dict:
        if output_mode == "classification":
@@ -165,31 +166,57 @@ def main():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
-    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    eval_results = {}
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
-            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))
+            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))

        for eval_dataset in eval_datasets:
-            result = trainer.evaluate(eval_dataset=eval_dataset)
+            eval_result = trainer.evaluate(eval_dataset=eval_dataset)

            output_eval_file = os.path.join(
                training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
            )
-            with open(output_eval_file, "w") as writer:
-                logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
-                for key, value in result.items():
-                    logger.info("  %s = %s", key, value)
-                    writer.write("%s = %s\n" % (key, value))
+            if trainer.is_world_master():
+                with open(output_eval_file, "w") as writer:
+                    logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
+                    for key, value in eval_result.items():
+                        logger.info("  %s = %s", key, value)
+                        writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+            eval_results.update(eval_result)

-    return results
+    if training_args.do_predict:
+        logging.info("*** Test ***")
+        test_datasets = [test_dataset]
+        if data_args.task_name == "mnli":
+            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
+            test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
+
+        for test_dataset in test_datasets:
+            predictions = trainer.predict(test_dataset=test_dataset).predictions
+            if output_mode == "classification":
+                predictions = np.argmax(predictions, axis=1)
+
+            output_test_file = os.path.join(
+                training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
+            )
+            if trainer.is_world_master():
+                with open(output_test_file, "w") as writer:
+                    logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
+                    writer.write("index\tprediction\n")
+                    for index, item in enumerate(predictions):
+                        if output_mode == "regression":
+                            writer.write("%d\t%3.3f\n" % (index, item))
+                        else:
+                            item = test_dataset.get_labels()[item]
+                            writer.write("%d\t%s\n" % (index, item))
+    return eval_results


 def _mp_fn(index):
--- a/examples/text-classification/run_xnli.py
+++ b/examples/text-classification/run_xnli.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Finetuning multi-lingual models on XNLI (Bert, DistilBERT, XLM).
+""" Finetuning multi-lingual models on XNLI (e.g. Bert, DistilBERT, XLM).
    Adapted from `examples/text-classification/run_glue.py`"""


@@ -32,15 +32,9 @@ from tqdm import tqdm, trange
 from transformers import (
    WEIGHTS_NAME,
    AdamW,
-    BertConfig,
-    BertForSequenceClassification,
-    BertTokenizer,
-    DistilBertConfig,
-    DistilBertForSequenceClassification,
-    DistilBertTokenizer,
-    XLMConfig,
-    XLMForSequenceClassification,
-    XLMTokenizer,
+    AutoConfig,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
    get_linear_schedule_with_warmup,
 )
 from transformers import glue_convert_examples_to_features as convert_examples_to_features
@@ -57,16 +51,6 @@ except ImportError:

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum(
-    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, DistilBertConfig, XLMConfig)), ()
-)
-
-MODEL_CLASSES = {
-    "bert": (BertConfig, BertForSequenceClassification, BertTokenizer),
-    "xlm": (XLMConfig, XLMForSequenceClassification, XLMTokenizer),
-    "distilbert": (DistilBertConfig, DistilBertForSequenceClassification, DistilBertTokenizer),
-}
-

 def set_seed(args):
    random.seed(args.seed)
@@ -377,19 +361,12 @@ def main():
        required=True,
        help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
    )
-    parser.add_argument(
-        "--model_type",
-        default=None,
-        type=str,
-        required=True,
-        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
-    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        required=True,
-        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        help="Path to pretrained model or model identifier from huggingface.co/models",
    )
    parser.add_argument(
        "--language",
@@ -421,7 +398,7 @@ def main():
    )
    parser.add_argument(
        "--cache_dir",
-        default="",
+        default=None,
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
@@ -562,24 +539,23 @@ def main():
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

-    args.model_type = args.model_type.lower()
-    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
-    config = config_class.from_pretrained(
+    config = AutoConfig.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=args.task_name,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    tokenizer = tokenizer_class.from_pretrained(
+    args.model_type = config.model_type
+    tokenizer = AutoTokenizer.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )
-    model = model_class.from_pretrained(
+    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
-        cache_dir=args.cache_dir if args.cache_dir else None,
+        cache_dir=args.cache_dir,
    )

    if args.local_rank == 0:
@@ -614,14 +590,13 @@ def main():
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
-        model = model_class.from_pretrained(args.output_dir)
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
+        model = AutoModelForSequenceClassification.from_pretrained(args.output_dir)
+        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
-        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
@@ -633,7 +608,7 @@ def main():
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

-            model = model_class.from_pretrained(checkpoint)
+            model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
@@ -69,7 +69,7 @@ python3 run_ner.py --data_dir ./ \
 --output_dir $OUTPUT_DIR \
 --max_seq_length  $MAX_LENGTH \
 --num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
+--per_device_train_batch_size $BATCH_SIZE \
 --save_steps $SAVE_STEPS \
 --seed $SEED \
 --do_train \
@@ -91,7 +91,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc
    "output_dir": "germeval-model",
    "max_seq_length": 128,
    "num_train_epochs": 3,
-    "per_gpu_train_batch_size": 32,
+    "per_device_train_batch_size": 32,
    "save_steps": 750,
    "seed": 1,
    "do_train": true,
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Fine-tuning the library models for named entity recognition on CoNLL-2003 (Bert or Roberta). """
+""" Fine-tuning the library models for named entity recognition on CoNLL-2003. """


 import logging
@@ -171,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -185,7 +184,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -237,22 +235,23 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

            results.update(result)

    # Predict
-    if training_args.do_predict and training_args.local_rank in [-1, 0]:
+    if training_args.do_predict:
        test_dataset = NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
@@ -261,33 +260,36 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.test,
-            local_rank=training_args.local_rank,
        )

        predictions, label_ids, metrics = trainer.predict(test_dataset)
        preds_list, _ = align_predictions(predictions, label_ids)

        output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
-        with open(output_test_results_file, "w") as writer:
-            for key, value in metrics.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_test_results_file, "w") as writer:
+                for key, value in metrics.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

        # Save predictions
        output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
-        with open(output_test_predictions_file, "w") as writer:
-            with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
-                example_id = 0
-                for line in f:
-                    if line.startswith("-DOCSTART-") or line == "" or line == "\n":
-                        writer.write(line)
-                        if not preds_list[example_id]:
-                            example_id += 1
-                    elif preds_list[example_id]:
-                        output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
-                        writer.write(output_line)
-                    else:
-                        logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
+        if trainer.is_world_master():
+            with open(output_test_predictions_file, "w") as writer:
+                with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
+                    example_id = 0
+                    for line in f:
+                        if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                            writer.write(line)
+                            if not preds_list[example_id]:
+                                example_id += 1
+                        elif preds_list[example_id]:
+                            output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
+                            writer.write(output_line)
+                        else:
+                            logger.warning(
+                                "Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
+                            )

    return results

--- a/examples/token-classification/utils_ner.py
+++ b/examples/token-classification/utils_ner.py
@@ -22,6 +22,8 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List, Optional, Union

+from filelock import FileLock
+
 from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available


@@ -68,7 +70,6 @@ if is_torch_available():
    import torch
    from torch import nn
    from torch.utils.data.dataset import Dataset
-    from transformers import torch_distributed_zero_first

    class NerDataset(Dataset):
        """
@@ -90,16 +91,16 @@ if is_torch_available():
            max_seq_length: Optional[int] = None,
            overwrite_cache=False,
            mode: Split = Split.train,
-            local_rank=-1,
        ):
            # Load data features from cache or dataset file
            cached_features_file = os.path.join(
                data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
            )

-            with torch_distributed_zero_first(local_rank):
-                # Make sure only the first process in distributed training processes the dataset,
-                # and the others will use the cache.
+            # Make sure only the first process in distributed training processes the dataset,
+            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

                if os.path.exists(cached_features_file) and not overwrite_cache:
                    logger.info(f"Loading features from cached file {cached_features_file}")
@@ -125,9 +126,8 @@ if is_torch_available():
                        pad_token_segment_id=tokenizer.pad_token_type_id,
                        pad_token_label_id=self.pad_token_label_id,
                    )
-                    if local_rank in [-1, 0]:
-                        logger.info(f"Saving features into cached file {cached_features_file}")
-                        torch.save(self.features, cached_features_file)
+                    logger.info(f"Saving features into cached file {cached_features_file}")
+                    torch.save(self.features, cached_features_file)

        def __len__(self):
            return len(self.features)
--- a/model_cards/HooshvareLab/bert-base-parsbert-armanner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-armanner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-ner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-ner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-peymaner-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-peymaner-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+## Persian NER [ARMAN, PEYMA, ARMAN+PEYMA]
+
+This task aims to extract named entities in the text, such as names and label with appropriate `NER` classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with `IOB` format. In this format, tokens that are not part of an entity are tagged as `”O”` the `”B”`tag corresponds to the first word of an object, and the `”I”` tag corresponds to the rest of the terms of the same entity. Both `”B”` and `”I”` tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text. There are two primary datasets used in Persian NER, `ARMAN`, and `PEYMA`. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.
+
+
+
+### PEYMA
+
+PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes.
+
+1. Organization
+2. Money
+3. Location
+4. Date
+5. Time
+6. Person
+7. Percent
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 16964 |
+|     Money    |  2037 |
+|   Location   |  8782 |
+|     Date     |  4259 |
+|     Time     |  732  |
+|    Person    |  7675 |
+|    Percent   |  699  |
+
+
+
+**Download**
+You can download the dataset from [here](http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/)
+
+---
+
+### ARMAN
+
+ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.
+
+1. Organization
+2. Location
+3. Facility
+4. Event
+5. Product
+6. Person
+
+
+|     Label    |   #   |
+|:------------:|:-----:|
+| Organization | 30108 |
+|   Location   | 12924 |
+|   Facility   |  4458 |
+|     Event    |  7557 |
+|    Product   |  4389 |
+|    Person    | 15645 |
+
+
+
+**Download**
+You can download the dataset from [here](https://github.com/HaniehP/PersianNER)
+
+
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+| Dataset         | ParsBERT | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:---------------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  ARMAN + PEYMA  |   95.13* |      -     |        -       |      -     |         -        |       -      |
+|  PEYMA          |   98.79* |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN          |   93.10* |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+## How to use :hugs:
+| Notebook     |      Description      |   |
+|:----------|:-------------|------:|
+| [How to use Pipelines](https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb) |
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+ And a special thanks to Sara Tabrizi for her fantastic poster design. Follow her on: [Linkedin](https://www.linkedin.com/in/sara-tabrizi-64548b79/), [Behance](https://www.behance.net/saratabrizi), [Instagram](https://www.instagram.com/sara_b_tabrizi/)
+
+## Releases
+
+### Release v0.1 (May 29, 2019)
+This is the first version of our ParsBERT NER!
--- a/model_cards/HooshvareLab/bert-base-parsbert-uncased/README.md
+++ b/model_cards/HooshvareLab/bert-base-parsbert-uncased/README.md
@@ -0,0 +1,124 @@
+## ParsBERT: Transformer-based Model for Persian Language Understanding
+
+ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as BERT-Base. 
+
+Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
+
+All the models (downstream tasks) are uncased and trained with whole word masking. (coming soon stay tuned)
+
+
+---
+
+## Introduction
+
+This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
+
+As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpus into a proper format. This process produces more than 40M true sentences. 
+
+
+## Evaluation
+
+ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
+
+## Results
+
+The following table summarizes the F1 score obtained by ParsBERT as compared to other models and architectures.
+
+
+
+### Sentiment Analysis (SA) task
+
+|           Dataset          |  ParsBERT | mBERT | DeepSentiPers |
+|:--------------------------:|:---------:|:-----:|:-------------:|
+|   Digikala User Comments   |   81.74*  | 80.74 |       -       |
+|   SnappFood User Comments  |   88.12*  | 87.87 |       -       |
+|   SentiPers (Multi Class)  |   71.11*  |   -   |     69.33     |
+|  SentiPers (Binary Class)  |   92.13*  |   -   |     91.98     |
+
+
+
+### Text Classification (TC) task
+
+|      Dataset      | ParsBERT | mBERT |
+|:-----------------:|:--------:|:-----:|
+| Digikala Magazine |   93.59* | 90.72 |
+|    Persian News   |   97.19* | 95.79 |
+
+
+### Named Entity Recognition (NER) task
+
+| Dataset | ParsBERT |  mBERT   | MorphoBERT |  Beheshti-NER  |  LSTM-CRF  |  Rule-Based CRF  |  BiLSTM-CRF  |
+|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|:------------:|
+|  PEYMA  |   93.10* |   86.64  |      -     |      90.59     |      -     |       84.00      |       -      |
+|  ARMAN  |   98.79* |   95.89  |    89.9    |      84.03     |    86.55   |         -        |     77.45    |
+
+
+**If you tested ParsBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference**
+
+## How to use
+
+### TensorFlow 2.0
+
+```python
+from transformers import AutoConfig, AutoTokenizer, TFAutoModel
+
+config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+
+text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
+tokenizer.tokenize(text)
+
+>>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
+
+```
+
+### Pytorch
+
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModel
+
+config = AutoConfig.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
+```
+
+
+## NLP Tasks Tutorial 
+
+Coming soon stay tuned
+
+
+## Cite 
+
+Please cite the following paper in your publication if you are using [ParsBERT](https://arxiv.org/abs/2005.12515) in your research:
+
+```markdown
+@article{ParsBERT,
+    title={ParsBERT: Transformer-based Model for Persian Language Understanding},
+    author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
+    journal={ArXiv},
+    year={2020},
+    volume={abs/2005.12515}
+}
+```
+
+
+## Acknowledgments
+
+We hereby, express our gratitude to the [Tensorflow Research Cloud (TFRC) program](https://tensorflow.org/tfrc) for providing us with the necessary computation resources. We also thank [Hooshvare](https://hooshvare.com) Research Group for facilitating dataset gathering and scraping online text resources.
+
+
+## Contributors
+
+- Mehrdad Farahani: [Linkedin](https://www.linkedin.com/in/m3hrdadfi/), [Twitter](https://twitter.com/m3hrdadfi), [Github](https://github.com/m3hrdadfi)
+- Mohammad Gharachorloo:  [Linkedin](https://www.linkedin.com/in/mohammad-gharachorloo/), [Twitter](https://twitter.com/MGharachorloo), [Github](https://github.com/baarsaam)
+- Marzieh Farahani:  [Linkedin](https://www.linkedin.com/in/marziehphi/), [Twitter](https://twitter.com/marziehphi), [Github](https://github.com/marziehphi)
+- Mohammad Manthouri:  [Linkedin](https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/), [Twitter](https://twitter.com/mmanthouri), [Github](https://github.com/mmanthouri)
+- Hooshvare Team:  [Official Website](https://hooshvare.com/), [Linkedin](https://www.linkedin.com/company/hooshvare), [Twitter](https://twitter.com/hooshvare), [Github](https://github.com/hooshvare), [Instagram](https://www.instagram.com/hooshvare/)
+
+
+## Releases
+
+### Release v0.1 (May 27, 2019)
+This is the first version of our ParsBERT based on BERT<sub>BASE</sub>
--- a/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
+++ b/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
@@ -0,0 +1,39 @@
+---
+language: ukrainian
+---
+
+Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
+
+## GPT2 124M Trained on Ukranian Fiction
+
+### Training details
+
+Model was trained on corpus of 4040 fiction books, 2.77 GiB in total.
+Evaluation on [brown-uk](https://github.com/brown-uk/corpus) gives perplexity of 50.16. 
+
+### Example usage:
+```python
+from transformers import AlbertTokenizer, GPT2LMHeadModel
+
+tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+
+input_ids = tokenizer.encode("Но зла Юнона, суча дочка,", add_special_tokens=False, return_tensors='pt')
+
+outputs = model.generate(
+    input_ids,
+    do_sample=True,
+    num_return_sequences=3,
+    max_length=50
+)
+
+for i, out in enumerate(outputs):
+    print("{}: {}".format(i, tokenizer.decode(out)))
+```
+
+Prints something like this:
+```bash
+0: Но зла Юнона, суча дочка, яка затьмарила всі її таємниці: І хто з'їсть її душу, той помре». І, не дочекавшись гніву богів, посунула в пітьму, щоб не бачити перед собою. Але, за
+1: Но зла Юнона, суча дочка, і довела мене до божевілля. Але він не знав нічого. Після того як я його побачив, мені стало зле. Я втратив рівновагу. Але в мене не було часу на роздуми. Я вже втратив надію
+2: Но зла Юнона, суча дочка, не нарікала нам! — раптом вигукнула Юнона. — Це ти, старий йолопе! — мовила вона, не перестаючи сміятись. — Хіба ти не знаєш, що мені подобається ходити з тобою?
+```
--- a/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
+++ b/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
@@ -1,19 +1,25 @@
+---
+language: norwegian
+thumbnail: https://i.imgur.com/QqSEC5I.png
+---
+
 # Norwegian Electra
-Image incoming, im going to have som fun with this one.
+![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)

 Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)

 Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
-
+# Model
+## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
+Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
+- https://openreview.net/pdf?id=r1xMH1BtvB
+- https://github.com/google-research/electra
 # Acknowledgments
-
 ### TensorFlow Research Cloud
 Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
 - https://www.tensorflow.org/tfrc
-
 #### OSCAR corpus
 - https://oscar-corpus.com/
-
 #### OPUS
 - http://opus.nlpl.eu/
 - http://www.opensubtitles.org/
--- a/model_cards/activebus/BERT-DK_laptop/README.md
+++ b/model_cards/activebus/BERT-DK_laptop/README.md
@@ -0,0 +1,43 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-DK_rest/README.md
+++ b/model_cards/activebus/BERT-DK_rest/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
+model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_laptop/README.md
+++ b/model_cards/activebus/BERT-PT_laptop/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
+
+```
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_rest/README.md
+++ b/model_cards/activebus/BERT-PT_rest/README.md
@@ -0,0 +1,42 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
+model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-XD_Review/README.md
+++ b/model_cards/activebus/BERT-XD_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.  
+
+`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+## Model Description
+
+The original model is from `BERT-base-uncased`.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
+model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT_Review/README.md
+++ b/model_cards/activebus/BERT_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
+model = AutoModel.from_pretrained("activebus/BERT_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/allenai/longformer-base-4096-extra.pos.embd.only/README.md
+++ b/model_cards/allenai/longformer-base-4096-extra.pos.embd.only/README.md
@@ -0,0 +1,20 @@
+
+# longformer-base-4096-extra.pos.embd.only
+
+This model is similar to `longformer-base-4096` but it was pretrained to preserve RoBERTa weights by freezing all RoBERTa weights and only train the additional position embeddings. 
+
+
+### Citing
+
+If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
+```
+@article{Beltagy2020Longformer,
+  title={Longformer: The Long-Document Transformer},
+  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
+  journal={arXiv:2004.05150},
+  year={2020},
+}
+```
+
+`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
--- a/model_cards/allenai/longformer-base-4096/README.md
+++ b/model_cards/allenai/longformer-base-4096/README.md
@@ -0,0 +1,24 @@
+
+# longformer-base-4096
+[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents. 
+
+`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096. 
+ 
+Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
+Please refer to the examples in `modeling_longformer.py` and the paper for more details on how to set global attention.
+
+
+### Citing
+
+If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150).
+```
+@article{Beltagy2020Longformer,
+  title={Longformer: The Long-Document Transformer},
+  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
+  journal={arXiv:2004.05150},
+  year={2020},
+}
+```
+
+`Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
--- a/model_cards/asafaya/bert-base-arabic/README.md
+++ b/model_cards/asafaya/bert-base-arabic/README.md
@@ -6,6 +6,17 @@ language: arabic

 Pretrained BERT base language model for Arabic

+_If you use this model in your work, please cite this paper (to appear in 2020):_
+
+```
+@inproceedings{
+  title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
+  author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
+  booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
+  year={2020}
+}
+```
+
 ## Pretraining Corpus

 `arabic-bert-base` model was pretrained on ~8.2 Billion words:
--- a/model_cards/aubmindlab/bert-base-arabert/README.md
+++ b/model_cards/aubmindlab/bert-base-arabert/README.md
@@ -3,8 +3,9 @@ language: arabic
 ---

 # AraBERT : Pre-training BERT for Arabic Language Understanding
+<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>  

-**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
+**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)

 There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).

@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

 We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)

+**Update 2 (21/5/2020) :**
+Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
+
+**Update 1 (21/4/2020) :** 
+Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
 ## Results (Acc.)
 Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:|:---:|:---:
-HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
-ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
-ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
-AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
-LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
-ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
-ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
+HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
+ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
+ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
+AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
+LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
+ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
+ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**

-*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
+*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*

 ## How to use

-You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
-
-To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
+You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)

+To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
+**Not all libraries built on top of transformers support the `never_split` argument**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel
+from arabert.preprocess_arabert import never_split_tokens, preprocess
+from farasa.segmenter import FarasaSegmenter

 arabert_tokenizer = AutoTokenizer.from_pretrained(
    "aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
    never_split=never_split_tokens)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")

-arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
+#Preprocess the text to make it compatible with AraBERT using farasapy
+farasa_segmenter = FarasaSegmenter(interactive=True)
+
+#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower 
+#(see update 2)
+#from py4j.java_gateway import JavaGateway
+#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
+#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
+
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+text_preprocessed = preprocess( text,
+                                do_farasa_tokenization = True,
+                                farasa = farasa_segmenter,
+                                use_farasapy = True)
+
+>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
+
+arabert_tokenizer.tokenize(text_preprocessed)

 >>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
 ```

 **AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel

 arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")

-arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+arabert_tokenizer.tokenize(text)

 >>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
 ```


-The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
+The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).

+**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
+
+**AraBERT on ARCD**
+During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
+```bash
+python arcd_preprocessing.py \
+    --input_file="/PATH_TO/arcd-test.json" \
+    --output_file="arcd-test-pre.json" \
+    --do_farasa_tokenization=True \
+    --use_farasapy=True \
+```
+```bash
+python SOQAL/bert/run_squad.py \
+  --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
+  --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
+  --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
+  --do_train=True \
+  --train_file=turk_combined_all_pre.json \
+  --do_predict=True \
+  --predict_file=arcd-test-pre.json \
+  --train_batch_size=32 \
+  --predict_batch_size=24 \
+  --learning_rate=3e-5 \
+  --num_train_epochs=4 \
+  --max_seq_length=384 \
+  --doc_stride=128 \
+  --do_lower_case=False\
+  --output_dir="/PATH_TO/OUTPUT_PATH"/ \
+  --use_tpu=True \
+  --tpu_name=$TPU_ADDRESS \
+```
 ## Model Weights and Vocab Download
 Models | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7

 ## If you used this model please cite us as:
 ```
-@misc{antoun2020arabert,
-    title={AraBERT: Transformer-based Model for Arabic Language Understanding},
-    author={Wissam Antoun and Fady Baly and Hazem Hajj},
-    year={2020},
-    eprint={2003.00104},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
+  pages={9}
 }
 ```
 ## Acknowledgments 
-Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
+Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

 ## Contacts
 **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>

-**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
-
-***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
+**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
--- a/model_cards/aubmindlab/bert-base-arabertv01/README.md
+++ b/model_cards/aubmindlab/bert-base-arabertv01/README.md
@@ -3,8 +3,9 @@ language: arabic
 ---

 # AraBERT : Pre-training BERT for Arabic Language Understanding
+<img src="https://github.com/aub-mind/arabert/blob/master/arabert_logo.png" width="100" align="left"/>  

-**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config.
+**AraBERT** is an Arabic pretrained lanaguage model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT PAPER](https://arxiv.org/abs/2003.00104v2) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)

 There are two version off the model AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were splitted using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).

@@ -12,28 +13,34 @@ The model was trained on ~70M sentences or ~23GB of Arabic text with ~3B words.

 We evalaute both AraBERT models on different downstream tasks and compare it to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR), [ArSaS](http://lrec-conf.org/workshops/lrec2018/W30/pdf/22_W30.pdf)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)

+**Update 2 (21/5/2020) :**
+Added support for the farasapy segmenter https://github.com/MagedSaeed/farasapy in the ``preprocess_arabert.py`` which is ~6x faster than the ``py4j.java_gateway``, consider setting ``use_farasapy=True`` when calling preprocess and pass it an instance of ``FarasaSegmenter(interactive=True)`` with interactive set to ``True`` for faster segmentation.
+
+**Update 1 (21/4/2020) :** 
+Fixed an issue with ARCD fine-tuning which drastically improved performance. Initially we didn't account for the change of the ```answer_start``` during preprocessing.
 ## Results (Acc.)
 Task | prev. SOTA | mBERT | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:|:---:|:---:
-HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|96.2|96.1
-ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|92.6
-ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|59.4
-AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|94.1|93.8
-LABR|87.5 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
-ANERcorp|81.7 (BiLSTM-CRF)|78.4|84.2|81.9
-ARCD|mBERT|EM:34.2 F1: 61.3|EM:30.1 F1:61.2|EM:30.6 F1: 62.7
+HARD |95.7 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|95.7|**96.2**|96.1
+ASTD |86.5 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)| 80.1|92.2|**92.6**
+ArsenTD-Lev|52.4 [ElJundi et.al.](https://www.aclweb.org/anthology/W19-4608/)|51|58.9|**59.4**
+AJGT|93 [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)| 83.6|93.1|**93.8**
+LABR|**87.5** [Dahou et.al.](https://dl.acm.org/doi/fullHtml/10.1145/3314941)|83|85.9|86.7
+ANERcorp|81.7 (BiLSTM-CRF)|78.4|**84.2**|81.9
+ARCD|mBERT|EM:34.2 F1: 61.3|EM:51.14 F1:82.13|**EM:54.84 F1: 82.15**

-*We would be extremly thankful if everyone can contibute to the Results table by adding more scores on different datasets*
+*If you tested AraBERT on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference*

 ## How to use

-You can easily use AraBERT since it is almost fully compatible with existing codebases (You can use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)
-
-To use HuggingFace's Transformer repository you only need to provide a lost of token that forces the model to not split them, also make sure that the text is pre-segmented:
+You can easily use AraBERT since it is almost fully compatible with existing codebases (Use this repo instead of the official BERT one, the only difference is in the ```tokenization.py``` file where we modify the _is_punctuation function to make it compatible with the "+" symbol and the "[" and "]" characters)

+To use HuggingFace's Transformer repository you only need to provide a list of token that forces the model to not split them, also make sure that the text is pre-segmented:
+**Not all libraries built on top of transformers support the `never_split` argument**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel
+from arabert.preprocess_arabert import never_split_tokens, preprocess
+from farasa.segmenter import FarasaSegmenter

 arabert_tokenizer = AutoTokenizer.from_pretrained(
    "aubmindlab/bert-base-arabert",
@@ -42,27 +49,75 @@ arabert_tokenizer = AutoTokenizer.from_pretrained(
    never_split=never_split_tokens)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabert")

-arabert_tokenizer.tokenize("و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري")
+#Preprocess the text to make it compatible with AraBERT using farasapy
+farasa_segmenter = FarasaSegmenter(interactive=True)
+
+#or you can use a py4j JavaGateway to the farasa Segmneter .jar but it's slower 
+#(see update 2)
+#from py4j.java_gateway import JavaGateway
+#gateway = JavaGateway.launch_gateway(classpath='./PATH_TO_FARASA/FarasaSegmenterJar.jar')
+#farasa = gateway.jvm.com.qcri.farasa.segmenter.Farasa()
+
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+text_preprocessed = preprocess( text,
+                                do_farasa_tokenization = True,
+                                farasa = farasa_segmenter,
+                                use_farasapy = True)
+
+>>>text_preprocessed: "و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري"
+
+arabert_tokenizer.tokenize(text_preprocessed)

 >>> ['و+', 'لن', 'نبال', '##غ', 'إذا', 'قل', '+نا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'ال+', 'مكتب', 'في', 'زمن', '+نا', 'هذا', 'ضروري']
 ```

 **AraBERTv0.1 is compatible with all existing libraries, since it needs no pre-segmentation.**
 ```python
-from transformers import AutoTokenizer
-from preprocess_arabert import never_split_tokens
+from transformers import AutoTokenizer, AutoModel

 arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
 arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")

-arabert_tokenizer.tokenize("ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري")
+text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
+arabert_tokenizer.tokenize(text)

 >>> ['ولن', 'ن', '##بالغ', 'إذا', 'قلنا', 'إن', 'هاتف', 'أو', 'كمبيوتر', 'المكتب', 'في', 'زمن', '##ن', '##ا', 'هذا', 'ضروري']
 ```


-The ```araBERT_(initial_Demo_TF)_.ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).
+The ```araBERT_(Updated_Demo_TF).ipynb``` Notebook is a small demo using the AJGT dataset using TensorFlow (GPU and TPU compatible).

+**Coming Soon :** Fine-tunning demo using HuggingFace's Trainer API
+
+**AraBERT on ARCD**
+During the preprocessing step the ```answer_start``` character position needs to be recalculated. You can use the file ```arcd_preprocessing.py``` as shown below to clean, preprocess the ARCD dataset before running ```run_squad.py```. More detailed Colab notebook is available in the [SOQAL repo](https://github.com/husseinmozannar/SOQAL).
+```bash
+python arcd_preprocessing.py \
+    --input_file="/PATH_TO/arcd-test.json" \
+    --output_file="arcd-test-pre.json" \
+    --do_farasa_tokenization=True \
+    --use_farasapy=True \
+```
+```bash
+python SOQAL/bert/run_squad.py \
+  --vocab_file="/PATH_TO_PRETRAINED_TF_CKPT/vocab.txt" \
+  --bert_config_file="/PATH_TO_PRETRAINED_TF_CKPT/config.json" \
+  --init_checkpoint="/PATH_TO_PRETRAINED_TF_CKPT/" \
+  --do_train=True \
+  --train_file=turk_combined_all_pre.json \
+  --do_predict=True \
+  --predict_file=arcd-test-pre.json \
+  --train_batch_size=32 \
+  --predict_batch_size=24 \
+  --learning_rate=3e-5 \
+  --num_train_epochs=4 \
+  --max_seq_length=384 \
+  --doc_stride=128 \
+  --do_lower_case=False\
+  --output_dir="/PATH_TO/OUTPUT_PATH"/ \
+  --use_tpu=True \
+  --tpu_name=$TPU_ADDRESS \
+```
 ## Model Weights and Vocab Download
 Models | AraBERTv0.1 | AraBERTv1
 ---|:---:|:---:
@@ -73,21 +128,17 @@ PyTorch| [Drive_Link](https://drive.google.com/open?id=1-_3te42mQCPD8SxwZ3l-VBL7

 ## If you used this model please cite us as:
 ```
-@misc{antoun2020arabert,
-    title={AraBERT: Transformer-based Model for Arabic Language Understanding},
-    author={Wissam Antoun and Fady Baly and Hazem Hajj},
-    year={2020},
-    eprint={2003.00104},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
+  pages={9}
 }
 ```
 ## Acknowledgments 
-Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access.
+Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

 ## Contacts
 **Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/giulio-ravasio-3a81a9110/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>

-**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/BalyFady) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
-
-***We are looking for sponsors to train BERT-Large and other Transformer models, the sponsor only needs to cover to data storage and compute cost of the generating the pretraining data***
+**Fady Baly**: [Linkedin](https://www.linkedin.com/in/fadybaly/) | [Twitter](https://twitter.com/fadybaly) | [Github](https://github.com/fadybaly) | <fgb06@mail.aub.edu> | <baly.fady@gmail.com>
--- a/model_cards/bayartsogt/albert-mongolian/README.md
+++ b/model_cards/bayartsogt/albert-mongolian/README.md
@@ -0,0 +1,55 @@
+# ALBERT-Mongolian
+[pretraining repo link](https://github.com/bayartsogt-ya/albert-mongolian)
+## Model description
+Here we provide pretrained ALBERT model and trained SentencePiece model for Mongolia text. Training data is the Mongolian wikipedia corpus from Wikipedia Downloads and Mongolian News corpus.
+
+## Evaluation Result:
+```
+loss = 1.7478163
+masked_lm_accuracy = 0.6838185
+masked_lm_loss = 1.6687671
+sentence_order_accuracy = 0.998125
+sentence_order_loss = 0.007942731
+```
+
+## Fine-tuning Result on Eduge Dataset:
+```
+                precision    recall  f1-score   support
+
+  байгал орчин       0.83      0.76      0.80       483
+     боловсрол       0.79      0.75      0.77       420
+         спорт       0.98      0.96      0.97      1391
+     технологи       0.85      0.83      0.84       543
+       улс төр       0.88      0.87      0.87      1336
+    урлаг соёл       0.89      0.94      0.91       726
+         хууль       0.87      0.83      0.85       840
+   эдийн засаг       0.80      0.84      0.82      1265
+    эрүүл мэнд       0.84      0.90      0.87       562
+
+      accuracy                           0.87      7566
+     macro avg       0.86      0.85      0.86      7566
+  weighted avg       0.87      0.87      0.87      7566
+```
+
+## Reference
+1. [ALBERT - official repo](https://github.com/google-research/albert)
+2. [WikiExtrator](https://github.com/attardi/wikiextractor)
+3. [Mongolian BERT](https://github.com/tugstugi/mongolian-bert)
+4. [ALBERT - Japanese](https://github.com/alinear-corp/albert-japanese)
+5. [Mongolian Text Classification](https://github.com/sharavsambuu/mongolian-text-classification)
+6. [You's paper](https://arxiv.org/abs/1904.00962)
+
+## Citation
+```
+@misc{albert-mongolian,
+  author = {Bayartsogt Yadamsuren},
+  title = {ALBERT Pretrained Model on Mongolian Datasets},
+  year = {2020},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
+}
+```
+
+## For More Information
+Please contact by bayartsogtyadamsuren@icloud.com
--- a/model_cards/bert-base-german-cased-README.md
+++ b/model_cards/bert-base-german-cased-README.md
@@ -18,13 +18,16 @@ tags:
 **Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)  
 **Infrastructure**: 1x TPU v2  
 **Published**: Jun 14th, 2019
+
+**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. 
+For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.
 
 ## Details
 - We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
 - We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
 - As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
 - We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
+

 See https://deepset.ai/german-bert for more details

--- a/model_cards/deepset/bert-base-german-cased-oldvocab/README.md
+++ b/model_cards/deepset/bert-base-german-cased-oldvocab/README.md
@@ -0,0 +1,28 @@
+---
+language: german
+thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
+tags:
+- exbert
+---
+
+<a href="https://huggingface.co/exbert/?model=bert-base-german-cased">
+	<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
+</a>
+
+# German BERT with old vocabulary
+For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60).
+
+
+## About us
+![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
+
+We bring NLP to the industry via open source!  
+Our focus: Industry specific language models & large scale QA systems.  
+  
+Some of our work: 
+- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
+- [FARM](https://github.com/deepset-ai/FARM)
+- [Haystack](https://github.com/deepset-ai/haystack/)
+
+Get in touch:
+[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)  
--- a/model_cards/digitalepidemiologylab/covid-twitter-bert/README.md
+++ b/model_cards/digitalepidemiologylab/covid-twitter-bert/README.md
@@ -0,0 +1,18 @@
+# COVID-Twitter-BERT (CT-BERT)
+BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19
+
+## Overview
+This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
+
+This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings. 
+
+In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
+
+## Example usage
+```python
+tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
+model = TFAutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
+```
+
+## References
+[1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).
--- a/model_cards/dkleczek/bert-base-polish-cased-v1/README.md
+++ b/model_cards/dkleczek/bert-base-polish-cased-v1/README.md
@@ -0,0 +1,135 @@
+---
+language: polish
+thumbnail: https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png
+---
+
+# Polbert - Polish BERT
+Polish version of BERT language model is here! It is now available in two variants: cased and uncased, both can be downloaded and used via HuggingFace transformers library. I recommend using the cased model, more info on the differences and benchmark results below. 
+
+![PolBERT image](https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png)
+
+## Cased and uncased variants
+
+* I initially trained the uncased model, the corpus and training details are referenced below. Here are some issues I found after I published the uncased model:
+    * Some Polish characters and accents are not tokenized correctly through the BERT tokenizer when applying lowercase. This doesn't impact sequence classification much, but may influence token classfication tasks significantly.
+    * I noticed a lot of duplicates in the Open Subtitles dataset, which dominates the training corpus.
+    * I didn't use Whole Word Masking. 
+* The cased model improves on the uncased model in the following ways:
+    * All Polish characters and accents should now be tokenized correctly. 
+    * I removed duplicates from Open Subtitles dataset. The corpus is smaller, but more balanced now. 
+    * The model is trained with Whole Word Masking. 
+
+## Pre-training corpora
+
+Below is the list of corpora used along with the output of `wc` command (counting lines, words and characters). These corpora were divided into sentences with srxsegmenter (see references), concatenated and tokenized with HuggingFace BERT Tokenizer. 
+
+### Uncased
+
+| Tables        | Lines           | Words  | Characters  |
+| ------------- |--------------:| -----:| -----:|
+| [Polish subset of Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php)      | 236635408| 1431199601 | 7628097730 |
+| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php)     | 8470950      |   176670885 | 1163505275 |
+| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859      |    121154785 | 938896963 |
+| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206      |    132067986 | 1015849191 |
+| Total | 262920423      |    1861093257 | 10746349159 |
+
+### Cased
+
+| Tables        | Lines           | Words  | Characters  |
+| ------------- |--------------:| -----:| -----:|
+| [Polish subset of Open Subtitles (Deduplicated) ](http://opus.nlpl.eu/OpenSubtitles-v2018.php)      | 41998942| 213590656 | 1424873235 |
+| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php)     | 8470950      |   176670885 | 1163505275 |
+| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859      |    121154785 | 938896963 |
+| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206      |    132067986 | 1015849191 |
+| Total | 68283960      |    646479197 | 4543124667 |
+
+
+## Pre-training details
+
+### Uncased 
+
+* Polbert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
+* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
+* Training set-up: in total 1 million training steps: 
+    * 100.000 steps - 128 sequence length, batch size 512, learning rate 1e-4 (10.000 steps warmup)
+    * 800.000 steps - 128 sequence length, batch size 512, learning rate 5e-5
+    * 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
+* The model was trained on a single Google Cloud TPU v3-8 
+
+### Cased
+
+* Same approach as uncased model, with the following differences:
+    * Whole Word Masking
+* Training set-up:
+    * 100.000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10.000 steps warmup)
+    * 100.000 steps - 128 sequence length, batch size 2048, learning rate 5e-5
+    * 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
+
+
+## Usage
+Polbert is released via [HuggingFace Transformers library](https://huggingface.co/transformers/).
+
+For an example use as language model, see [this notebook](/LM_testing.ipynb) file. 
+
+### Uncased
+
+```python
+from transformers import *
+model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
+tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
+nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
+for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
+  print(pred)
+# Output:
+# {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
+# {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
+# {'sequence': '[CLS] adam mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.0647173821926117, 'token': 5182}
+# {'sequence': '[CLS] adam mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.05232388526201248, 'token': 24293}
+# {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
+```
+
+### Cased
+
+```python
+model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
+tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
+nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
+for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
+  print(pred)
+# Output:
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
+```
+
+See the next section for an example usage of Polbert in downstream tasks. 
+
+## Evaluation
+Thanks to Allegro, we now have the [KLEJ benchmark](https://klejbenchmark.com/leaderboard/), a set of nine evaluation tasks for the Polish language understanding. The following results are achieved by running standard set of evaluation scripts (no tricks!) utilizing both cased and uncased variants of Polbert.
+
+| Model	| Average |	NKJP-NER | CDSC-E |	CDSC-R |	CBD	| PolEmo2.0-IN |	PolEmo2.0-OUT |	DYK	| PSC |	AR |
+| ------------- |--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+| Polbert cased | 81.7 |	93.6 |	93.4 |	93.8 |	52.7 |	87.4 |	71.1 |	59.1 |	98.6 |	85.2 |
+| Polbert uncased | 81.4 |	90.1 |	93.9 |	93.5 |	55.0 |	88.1 |	68.8 |	59.4 |	98.8 |	85.4 |
+
+Note how the uncased model performs better than cased on some tasks? My guess this is because of the oversampling of Open Subtitles dataset and its similarity to data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not so visible here. 
+
+## Bias
+The data used to train the model is biased. It may reflect stereotypes related to gender, ethnicity etc. Please be careful when using the model for downstream task to consider these biases and mitigate them.  
+
+## Acknowledgements
+* I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
+* Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model.
+* Big thanks to Allegro for releasing KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some issues with the tokenization. 
+* Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses! 
+
+## Author
+Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
+
+## References
+* https://github.com/google-research/bert
+* https://github.com/narusemotoki/srx_segmenter
+* SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
+* [KLEJ benchmark](https://klejbenchmark.com/leaderboard/)
--- a/model_cards/dkleczek/bert-base-polish-uncased-v1/README.md
+++ b/model_cards/dkleczek/bert-base-polish-uncased-v1/README.md
@@ -4,14 +4,27 @@ thumbnail: https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.
 ---

 # Polbert - Polish BERT
-Polish version of BERT language model is here! While this is still work in progress, I'm happy to share the first model, similar to BERT-Base and trained on a large Polish corpus. If you'd like to contribute to this project, please reach out to me!
+Polish version of BERT language model is here! It is now available in two variants: cased and uncased, both can be downloaded and used via HuggingFace transformers library. I recommend using the cased model, more info on the differences and benchmark results below. 

 ![PolBERT image](https://raw.githubusercontent.com/kldarek/polbert/master/img/polbert.png)

+## Cased and uncased variants
+
+* I initially trained the uncased model, the corpus and training details are referenced below. Here are some issues I found after I published the uncased model:
+    * Some Polish characters and accents are not tokenized correctly through the BERT tokenizer when applying lowercase. This doesn't impact sequence classification much, but may influence token classfication tasks significantly.
+    * I noticed a lot of duplicates in the Open Subtitles dataset, which dominates the training corpus.
+    * I didn't use Whole Word Masking. 
+* The cased model improves on the uncased model in the following ways:
+    * All Polish characters and accents should now be tokenized correctly. 
+    * I removed duplicates from Open Subtitles dataset. The corpus is smaller, but more balanced now. 
+    * The model is trained with Whole Word Masking. 
+
 ## Pre-training corpora

 Below is the list of corpora used along with the output of `wc` command (counting lines, words and characters). These corpora were divided into sentences with srxsegmenter (see references), concatenated and tokenized with HuggingFace BERT Tokenizer. 

+### Uncased
+
 | Tables        | Lines           | Words  | Characters  |
 | ------------- |--------------:| -----:| -----:|
 | [Polish subset of Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php)      | 236635408| 1431199601 | 7628097730 |
@@ -20,7 +33,21 @@ Below is the list of corpora used along with the output of `wc` command (countin
 | [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206      |    132067986 | 1015849191 |
 | Total | 262920423      |    1861093257 | 10746349159 |

+### Cased
+
+| Tables        | Lines           | Words  | Characters  |
+| ------------- |--------------:| -----:| -----:|
+| [Polish subset of Open Subtitles (Deduplicated) ](http://opus.nlpl.eu/OpenSubtitles-v2018.php)      | 41998942| 213590656 | 1424873235 |
+| [Polish subset of ParaCrawl](http://opus.nlpl.eu/ParaCrawl.php)     | 8470950      |   176670885 | 1163505275 |
+| [Polish Parliamentary Corpus](http://clip.ipipan.waw.pl/PPC) | 9799859      |    121154785 | 938896963 |
+| [Polish Wikipedia - Feb 2020](https://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2) | 8014206      |    132067986 | 1015849191 |
+| Total | 68283960      |    646479197 | 4543124667 |
+
+
 ## Pre-training details
+
+### Uncased 
+
 * Polbert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
 * Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
 * Training set-up: in total 1 million training steps: 
@@ -29,10 +56,22 @@ Below is the list of corpora used along with the output of `wc` command (countin
    * 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
 * The model was trained on a single Google Cloud TPU v3-8 

+### Cased
+
+* Same approach as uncased model, with the following differences:
+    * Whole Word Masking
+* Training set-up:
+    * 100.000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10.000 steps warmup)
+    * 100.000 steps - 128 sequence length, batch size 2048, learning rate 5e-5
+    * 100.000 steps - 512 sequence length, batch size 256, learning rate 2e-5
+
+
 ## Usage
 Polbert is released via [HuggingFace Transformers library](https://huggingface.co/transformers/).

-For an example use as language model, see [this notebook](https://github.com/kldarek/polbert/blob/master/LM_testing.ipynb) file. 
+For an example use as language model, see [this notebook](/LM_testing.ipynb) file. 
+
+### Uncased

 ```python
 from transformers import *
@@ -41,7 +80,6 @@ tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1"
 nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
 for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
  print(pred)
-
 # Output:
 # {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
 # {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
@@ -50,23 +88,42 @@ for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} by
 # {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
 ```

+### Cased
+
+```python
+model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
+tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
+nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
+for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
+  print(pred)
+# Output:
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
+# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
+```
+
 See the next section for an example usage of Polbert in downstream tasks. 

 ## Evaluation
-I'd love to get some help from the Polish NLP community here! If you feel like evaluating Polbert on some benchmark tasks, it would be great if you can share the results. 
+Thanks to Allegro, we now have the [KLEJ benchmark](https://klejbenchmark.com/leaderboard/), a set of nine evaluation tasks for the Polish language understanding. The following results are achieved by running standard set of evaluation scripts (no tricks!) utilizing both cased and uncased variants of Polbert.

-So far, I've compared the performance of Polbert vs Multilingual BERT on PolEmo 2.0 sentiment classification, here are the results. These results are are produced with a linear classification layer on top of pooled output, trained for 10 epochs with learning rate 3e-5. The checkpoint with the lowest loss on validation set is evaluated on the test set. 
+| Model	| Average |	NKJP-NER | CDSC-E |	CDSC-R |	CBD	| PolEmo2.0-IN |	PolEmo2.0-OUT |	DYK	| PSC |	AR |
+| ------------- |--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+| Polbert cased | 81.7 |	93.6 |	93.4 |	93.8 |	52.7 |	87.4 |	71.1 |	59.1 |	98.6 |	85.2 |
+| Polbert uncased | 81.4 |	90.1 |	93.9 |	93.5 |	55.0 |	88.1 |	68.8 |	59.4 |	98.8 |	85.4 |

-| PolEmo 2.0 Sentiment Classifcation | Test Accuracy | 
-| ------------- |--------------:|
-| Multilingual BERT | 0.78 |
-| Polbert | 0.85 |
+Note how the uncased model performs better than cased on some tasks? My guess this is because of the oversampling of Open Subtitles dataset and its similarity to data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not so visible here. 

 ## Bias
 The data used to train the model is biased. It may reflect stereotypes related to gender, ethnicity etc. Please be careful when using the model for downstream task to consider these biases and mitigate them.  

 ## Acknowledgements
-I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you! Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model. Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses!
+* I'd like to express my gratitude to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
+* Also appreciate the help from Timo Möller from [deepset](https://deepset.ai) for sharing tips and scripts based on their experience training German BERT model.
+* Big thanks to Allegro for releasing KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some issues with the tokenization. 
+* Finally, thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from [fastai](https://www.fast.ai) for their NLP and Deep Learning courses! 

 ## Author
 Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
@@ -75,5 +132,4 @@ Darek Kłeczek - contact me on Twitter [@dk21](https://twitter.com/dk21)
 * https://github.com/google-research/bert
 * https://github.com/narusemotoki/srx_segmenter
 * SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
-* PolEmo 2.0 Sentiment Analysis Dataset for CoNLL: https://clarin-pl.eu/dspace/handle/11321/710
-
+* [KLEJ benchmark](https://klejbenchmark.com/leaderboard/)
--- a/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md
+++ b/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md
@@ -0,0 +1,48 @@
+---
+language: romanian
+---
+
+# bert-base-romanian-cased-v1
+
+The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
+
+### How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
+model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
+# tokenize a sentence and run through the model
+input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+outputs = model(input_ids)
+# get encoding
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+### Evaluation
+
+Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). 
+
+The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
+
+| Model                          |  UPOS |  XPOS  |  NER  |  LAS  |
+|--------------------------------|:-----:|:------:|:-----:|:-----:|
+| bert-base-multilingual-cased   | 97.87 |  96.16 | 84.13 | 88.04 |
+| bert-base-romanian-cased-v1    | **98.00** |  **96.46** | **85.88** | **89.69** |
+
+### Corpus 
+
+The model is trained on the following corpora (stats in the table below are after cleaning):
+
+| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
+|-----------	|:--------:	|:--------:	|:--------:	|:--------:	|
+| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
+| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
+| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
+| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|
+
+#### Acknowledgements
+
+- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
--- a/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md
+++ b/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md
@@ -0,0 +1,51 @@
+---
+language: romanian
+---
+
+# bert-base-romanian-uncased-v1
+
+The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
+
+### How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+
+# load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
+model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
+
+# tokenize a sentence and run through the model
+input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+outputs = model(input_ids)
+
+# get encoding
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+### Evaluation
+
+Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). 
+
+The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
+
+| Model                          |  UPOS |  XPOS  |  NER  |  LAS  |
+|--------------------------------|:-----:|:------:|:-----:|:-----:|
+| bert-base-multilingual-uncased | 97.65 |  95.72 | 83.91 | 87.65 |
+| bert-base-romanian-uncased-v1  | **98.18** |  **96.84** | **85.26** | **89.61** |
+
+### Corpus 
+
+The model is trained on the following corpora (stats in the table below are after cleaning):
+
+| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
+|-----------	|:--------:	|:--------:	|:--------:	|:--------:	|
+| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
+| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
+| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
+| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|
+
+#### Acknowledgements
+
+- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
--- a/model_cards/facebook/bart-large-cnn/README.md
+++ b/model_cards/facebook/bart-large-cnn/README.md
@@ -0,0 +1,6 @@
+---
+tags:
+- summarization
+
+license: mit
+---
--- a/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
@@ -0,0 +1,74 @@
+---
+language: malay
+---
+
+# Bahasa T5 Model
+
+Pretrained T5 base language model for Malay and Indonesian. 
+
+## Pretraining Corpus
+
+`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
+
+1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
+14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
+15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
+16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
+17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
+18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
+
+Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
+
+## Pretraining details
+
+- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
+- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
+
+## Load Pretrained Model
+
+You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
+
+```python
+from transformers import T5Tokenizer, T5Model
+
+model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+```
+
+## Example using T5ForConditionalGeneration
+
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
+outputs = model.generate(input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+
+Output is,
+
+```
+'Mahathir Mohamad'
+```
+
+## Results
+
+For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
+
+## Acknowledgement
+
+Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/huseinzol05/t5-small-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/t5-small-bahasa-cased/README.md
@@ -0,0 +1,74 @@
+---
+language: malay
+---
+
+# Bahasa T5 Model
+
+Pretrained T5 small language model for Malay and Indonesian. 
+
+## Pretraining Corpus
+
+`t5-small-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
+
+1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
+14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
+15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
+16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
+17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
+18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
+
+Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
+
+## Pretraining details
+
+- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
+- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
+
+## Load Pretrained Model
+
+You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
+
+```python
+from transformers import T5Tokenizer, T5Model
+
+model = T5Model.from_pretrained('huseinzol05/t5-small-bahasa-cased')
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
+```
+
+## Example using T5ForConditionalGeneration
+
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-small-bahasa-cased')
+model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-small-bahasa-cased')
+input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
+outputs = model.generate(input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+
+Output is,
+
+```
+'Mahathir Mohamad'
+```
+
+## Results
+
+For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
+
+## Acknowledgement
+
+Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
+++ b/model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
@@ -0,0 +1,92 @@
+---
+language: spanish
+thumbnail:
+---
+
+# RuPERTa-base  (Spanish RoBERTa) + NER 🎃🏷
+
+This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **NER** downstream task.
+
+## Details of the downstream task (NER) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  |  329 K |
+| Dev                    | 40 K |
+
+
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
+
+- Labels covered:
+
+```
+B-LOC
+B-MISC
+B-ORG
+B-PER
+I-LOC
+I-MISC
+I-ORG
+I-PER
+O
+```
+
+## Metrics on evaluation set 🧾
+
+|                                                      Metric                                                       |  # score  |
+| :------------------------------------------------------------------------------------: | :-------: |
+| F1                                       | **77.55**  
+| Precision                                | **75.53** | 
+| Recall                                   | **79.68** |    
+
+## Model in action 🔨
+
+
+Example of usage:
+
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+id2label = {
+    "0": "B-LOC",
+    "1": "B-MISC",
+    "2": "B-ORG",
+    "3": "B-PER",
+    "4": "I-LOC",
+    "5": "I-MISC",
+    "6": "I-ORG",
+    "7": "I-PER",
+    "8": "O"
+}
+
+text ="Julien, CEO de HF, nació en Francia."
+input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
+
+outputs = model(input_ids)
+last_hidden_states = outputs[0]
+
+for m in last_hidden_states:
+  for index, n in enumerate(m):
+    if(index > 0 and index <= len(text.split(" "))):
+      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
+      
+'''
+Output:
+--------
+Julien,: I-PER
+CEO: O
+de: O
+HF,: B-ORG
+nació: I-PER
+en: I-PER
+Francia.: I-LOC
+'''
+```
+Yeah! Not too bad 🎉
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
+++ b/model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
@@ -0,0 +1,111 @@
+---
+language: spanish
+thumbnail:
+---
+
+# RuPERTa-base  (Spanish RoBERTa) + POS 🎃🏷
+
+This model is a fine-tuned on [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **POS** downstream task.
+
+## Details of the downstream task (POS) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  | 445 K |
+| Dev                    | 55 K |
+
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
+
+- Labels covered:
+
+```
+ADJ
+ADP
+ADV
+AUX
+CCONJ
+DET
+INTJ
+NOUN
+NUM
+PART
+PRON
+PROPN
+PUNCT
+SCONJ
+SYM
+VERB
+```
+
+## Metrics on evaluation set 🧾
+
+|                                                      Metric                                                       |  # score  |
+| :------------------------------------------------------------------------------------: | :-------: |
+| F1                                       | **97.39**  
+| Precision                                | **97.47** | 
+| Recall                                   | **9732** |    
+
+## Model in action 🔨
+
+
+Example of usage
+
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
+model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
+
+id2label = {
+    "0": "O",
+    "1": "ADJ",
+    "2": "ADP",
+    "3": "ADV",
+    "4": "AUX",
+    "5": "CCONJ",
+    "6": "DET",
+    "7": "INTJ",
+    "8": "NOUN",
+    "9": "NUM",
+    "10": "PART",
+    "11": "PRON",
+    "12": "PROPN",
+    "13": "PUNCT",
+    "14": "SCONJ",
+    "15": "SYM",
+    "16": "VERB"
+}
+
+text ="Mis amigos están pensando viajar a Londres este verano."
+input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
+
+outputs = model(input_ids)
+last_hidden_states = outputs[0]
+
+for m in last_hidden_states:
+  for index, n in enumerate(m):
+    if(index > 0 and index <= len(text.split(" "))):
+      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
+      
+'''
+Output:
+--------
+Mis: NUM
+amigos: PRON
+están: AUX
+pensando: ADV
+viajar: VERB
+a: ADP
+Londres: PROPN
+este: DET
+verano..: NOUN
+'''
+```
+Yeah! Not too bad 🎉
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/bert-italian-finedtuned-squadv1-it-alfa/README.md
+++ b/model_cards/mrm8488/bert-italian-finedtuned-squadv1-it-alfa/README.md
@@ -0,0 +1,110 @@
+---
+language: italian
+thumbnail:
+---
+
+# Italian BERT fine-tuned on SQuAD_it v1
+
+[Italian BERT base cased](https://huggingface.co/dbmdz/bert-base-italian-cased) fine-tuned on [italian SQuAD](https://github.com/crux82/squad-it) for **Q&A** downstream task.
+
+## Details of Italian BERT
+
+The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the OPUS corpora collection. The final training corpus has a size of 13GB and 2,050,057,573 tokens.
+
+For sentence splitting, we use NLTK (faster compared to spacy). Our cased and uncased models are training with an initial sequence length of 512 subwords for ~2-3M steps.
+
+For the XXL Italian models, we use the same training data from OPUS and extend it with data from the Italian part of the OSCAR corpus. Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
+More in its official [model card](https://huggingface.co/dbmdz/bert-base-italian-cased)
+
+Created by [Stefan](https://huggingface.co/stefan-it) at [MDZ](https://huggingface.co/dbmdz)
+
+## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
+
+[Italian SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset
+into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian.
+**The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.** The dataset is split into training and test sets to support the replicability of the benchmarking of QA systems:
+
+- `SQuAD_it-train.json`: it contains training examples derived from the original SQuAD 1.1 trainig material.
+- `SQuAD_it-test.json`: it contains test/benchmarking examples derived from the origial SQuAD 1.1 development material.
+
+More details about SQuAD-it can be found in [Croce et al. 2018]. The original paper can be found at this [link](https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29).
+
+## Model training 🏋️‍
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM.
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)
+
+## Results 📝
+
+| Metric | # Value   |
+| ------ | --------- |
+| **EM** | **62.51** |
+| **F1** | **74.16** |
+
+### Raw metrics
+
+```json
+{
+  "exact": 62.5180707057432,
+  "f1": 74.16038329042492,
+  "total": 7609,
+  "HasAns_exact": 62.5180707057432,
+  "HasAns_f1": 74.16038329042492,
+  "HasAns_total": 7609,
+  "best_exact": 62.5180707057432,
+  "best_exact_thresh": 0.0,
+  "best_f1": 74.16038329042492,
+  "best_f1_thresh": 0.0
+}
+```
+
+## Comparison ⚖️
+
+| Model                                                                                                                            | EM        | F1 score  |
+| -------------------------------------------------------------------------------------------------------------------------------- | --------- | --------- |
+| [DrQA-it trained on SQuAD-it ](https://github.com/crux82/squad-it/blob/master/README.md#evaluating-a-neural-model-over-squad-it) | 56.1      | 65.9      |
+| This one                                                                                                                         | **62.51** | **74.16** |
+
+## Model in action 🚀
+
+Fast usage with **pipelines** 🧪
+
+```python
+from transformers import pipeline
+
+nlp_qa = pipeline(
+    'question-answering',
+    model='mrm8488/bert-italian-finedtuned-squadv1-it-alfa',
+    tokenizer='mrm8488/bert-italian-finedtuned-squadv1-it-alfa'
+)
+
+nlp_qa(
+    {
+        'question': 'Per quale lingua stai lavorando?',
+        'context': 'Manuel Romero è colaborando attivamente con HF / trasformatori per il trader del poder de las últimas ' +
+       'técnicas di procesamiento de lenguaje natural al idioma español'
+    }
+)
+
+# Output: {'answer': 'español', 'end': 174, 'score': 0.9925341537498156, 'start': 168}
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
+
+Dataset citation
+
+<details>
+@InProceedings{10.1007/978-3-030-03840-3_29,
+	author="Croce, Danilo and Zelenanska, Alexandra and Basili, Roberto",
+	editor="Ghidini, Chiara and Magnini, Bernardo and Passerini, Andrea and Traverso, Paolo",
+	title="Neural Learning for Question Answering in Italian",
+	booktitle="AI*IA 2018 -- Advances in Artificial Intelligence",
+	year="2018",
+	publisher="Springer International Publishing",
+	address="Cham",
+	pages="389--402",
+	isbn="978-3-030-03840-3"
+}
+</detail>
--- a/model_cards/mrm8488/electricidad-small-discriminator/README.md
+++ b/model_cards/mrm8488/electricidad-small-discriminator/README.md
@@ -43,8 +43,8 @@ import torch
 discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
 tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")

-sentence = "El rápido zorro marrón salta sobre el perro perezoso"
-fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
+sentence = "el zorro rojo es muy rápido"
+fake_sentence = "el zorro rojo es muy ser"

 fake_tokens = tokenizer.tokenize(sentence)
 fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
@@ -53,9 +53,16 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

 [print("%7s" % token, end="") for token in fake_tokens]

-[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
+[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
+
+# Output:
+'''
+el  zorro   rojo     es    muy    ser      0      0      0      0      0      1[None, None, None, None, None, None]
+'''
 ```

+As you can see there is a **1** in the place where the model detected the fake token (**ser**). So, it works! 🎉
+
 ## Acknowledgments

 I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
--- a/model_cards/mrm8488/gpt2-imdb-neg/README.md
+++ b/model_cards/mrm8488/gpt2-imdb-neg/README.md
@@ -20,6 +20,9 @@ A few examples of the model response to a query before and after optimisation:
 |I have watched 3 episodes |with this guy and he is such a talented actor...|	but the show is just plain awful and there ne...|	2.681171|	-4.512792|
 |We know that firefighters and|	police officers are forced to become populari...|	other chains have going to get this disaster ...|	1.367811|	-3.34017|

+## Training logs and metrics <img src="https://gblobscdn.gitbook.com/spaces%2F-Lqya5RvLedGEWPhtkjU%2Favatar.png?alt=media" width="25" height="25">
+Watch the whole training logs and metrics on [W&B](https://app.wandb.ai/mrm8488/gpt2-sentiment-negative?workspace=user-mrm8488)
+


 > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
--- a/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md
@@ -0,0 +1,65 @@
+---
+language: english
+thumbnail:
+---
+
+# Longformer-base-4096 fine-tuned on SQuAD v2
+
+[Longformer-base-4096 model](https://huggingface.co/allenai/longformer-base-4096) fine-tuned on [SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
+
+## Longformer-base-4096
+
+[Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents. 
+
+`longformer-base-4096` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096. 
+ 
+Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.
+
+## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
+
+[SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
+
+| Dataset  | Split | # samples |
+| -------- | ----- | --------- |
+| SQuAD2.0 | train | 130k      |
+| SQuAD2.0 | eval  | 12.3k     |
+
+
+
+## Model fine-tuning 🏋️‍
+
+The training script is a slightly modified version of [this one](https://colab.research.google.com/drive/1zEl5D-DdkBKva-DdreVOmN0hrAfzKG1o?usp=sharing)
+
+
+
+## Model in Action 🚀
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering
+
+tokenizer = AutoTokenizer.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
+model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4096-finetuned-squadv2")
+
+text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
+question = "What has Huggingface done ?"
+encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
+input_ids = encoding["input_ids"]
+
+# default is local attention everywhere
+# the forward method will automatically set global attention on question tokens
+attention_mask = encoding["attention_mask"]
+
+start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
+all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
+
+answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
+answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))
+
+# output => democratized NLP
+```
+If given the same context we ask something that is not there, the output for **no answer** will be ```<s>```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -0,0 +1,68 @@
+---
+language: english
+thumbnail:
+---
+
+# T5-base fine-tuned on SQuAD v2
+
+[Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) fine-tuned on [SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) for **Q&A** downstream task.
+
+## Details of T5
+
+The **T5** model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by *Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu* in Here the abstract:
+
+Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
+
+
+## Details of the downstream task (Q&A) - Dataset 📚 🧐 ❓
+
+[SQuAD v2](https://rajpurkar.github.io/SQuAD-explorer/) combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
+
+| Dataset  | Split | # samples |
+| -------- | ----- | --------- |
+| SQuAD2.0 | train | 130k      |
+| SQuAD2.0 | eval  | 12.3k     |
+
+
+
+## Model fine-tuning 🏋️‍
+
+The training script is a slightly modified version of [this one](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb)
+
+## Results 📝
+
+| Metric | # Value   |
+| ------ | --------- |
+| **EM** | **77.64** |
+| **F1** | **81.32** |
+
+
+
+## Model in Action 🚀
+
+```python
+from transformers import AutoModelWithLMHead, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
+model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
+
+def get_answer(question, context):
+  input_text = "question: %s  context: %s </s>" % (question, context)
+  features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
+
+  output = model.generate(input_ids=features['input_ids'], 
+               attention_mask=features['attention_mask'])
+  
+  return tokenizer.decode(output[0])
+
+context = "Manuel have created RuPERTa-base with the support of HF-Transformers and Google"
+question = "Who has supported Manuel?"
+
+get_answer(question, context)
+
+# output: 'HF-Transformers and Google'
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/oliverguhr/german-sentiment-bert/README.md
+++ b/model_cards/oliverguhr/german-sentiment-bert/README.md
@@ -0,0 +1,125 @@
+# German Sentiment Classification with Bert
+
+This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model, 
+we provide a Python package that bundles the code need for the preprocessing and inferencing. 
+
+The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews. 
+You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf).
+
+## Using the Python package
+
+To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
+
+```bash
+pip install germansentiment
+```
+
+```python
+from germansentiment import SentimentModel
+
+model = SentimentModel()
+
+texts = [
+    "Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
+    "Total awesome!","nicht so schlecht wie erwartet",
+    "Der Test verlief positiv.","Sie fährt ein grünes Auto."]
+       
+result = model.predict_sentiment(texts)
+print(result)
+```
+
+The code above will output following list:
+
+```python
+["negative","negative","positive","positive","neutral", "neutral"]
+```
+
+## A minimal working Sample
+
+
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from typing import List
+import torch
+import re
+
+class SentimentModel():
+    def __init__(self, model_name: str):
+        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
+        self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
+        self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
+
+    def predict_sentiment(self, texts: List[str])-> List[str]:
+        texts = [self.clean_text(text) for text in texts]
+        # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
+        input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
+        input_ids = torch.tensor(input_ids["input_ids"])
+
+        with torch.no_grad():
+            logits = self.model(input_ids)    
+
+        label_ids = torch.argmax(logits[0], axis=1)
+
+        labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
+        return labels
+
+    def replace_numbers(self,text: str) -> str:
+            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")         
+
+    def clean_text(self,text: str)-> str:    
+            text = text.replace("\n", " ")        
+            text = self.clean_http_urls.sub('',text)
+            text = self.clean_at_mentions.sub('',text)        
+            text = self.replace_numbers(text)                
+            text = self.clean_chars.sub('', text) # use only text chars                          
+            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace   
+            text = text.strip().lower()
+            return text
+
+texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
+        "Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
+        "Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
+
+model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
+
+print(model.predict_sentiment(texts))
+```
+
+## Model and Data
+
+If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.202.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
+
+| Dataset                                                      | F1 micro Score |
+| :----------------------------------------------------------- | -------------: |
+| [holidaycheck](https://github.com/oliverguhr/german-sentiment) |         0.9568 |
+| [scare](https://www.romanklinger.de/scare/)                  |         0.9418 |
+| [filmstarts](https://github.com/oliverguhr/german-sentiment) |         0.9021 |
+| [germeval](https://sites.google.com/view/germeval2017-absa/home) |         0.7536 |
+| [PotTS](https://www.aclweb.org/anthology/L16-1181/)          |         0.6780 |
+| [emotions](https://github.com/oliverguhr/german-sentiment)  |         0.9649 |
+| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) |         0.7376 |
+| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) |         0.9967 |
+| all                                                          |         0.9639 |
+
+## Cite
+
+For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
+
+```
+@InProceedings{guhr-EtAl:2020:LREC,
+  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
+  title     = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
+  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
+  month          = {May},
+  year           = {2020},
+  address        = {Marseille, France},
+  publisher      = {European Language Resources Association},
+  pages     = {1620--1625},
+  url       = {https://www.aclweb.org/anthology/2020.lrec-1.201}
+}
+```
+
+
--- a/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
+++ b/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
@@ -5,12 +5,12 @@ https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
 This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased


-# Dataset
+## Dataset

-The dataset is taken from the studies [2] and [3] and merged.
+The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.

 * The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
-The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
+The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
 5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
 scale from 0 to 5 by the users who made the reviews. The study considered a review
 sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
@@ -19,9 +19,9 @@ Web page. They constructed benchmark dataset consisting of reviews regarding som
 products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
 and majority class of reviews are 5. Each category has 700 positive and 700 negative
 reviews in which average rating of negative reviews is 2.27 and of positive reviews
-is 4.5. This dataset is also used the study [1]
+is 4.5. This dataset is also used by the study [[1]](#paper-1).

-* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 
+* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 

 *Merged Dataset* 

@@ -32,20 +32,21 @@ is 4.5. This dataset is also used the study [1]
 |  32000 |train.tsv|
 |  *48290* |*total*|

+### The dataset is used by following papers

-The dataset is used by following papers
- 
-* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
-* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
+<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
+
+<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
 Discovery and Opinion Mining (WISDOM ’13)
-* [3] Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

-# Training
+<a id="paper-3">[3]</a> Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

-```
+
+## Training
+
+```shell
 export GLUE_DIR="./sst-2-newall"
 export TASK_NAME=SST-2
- 

 python3 run_glue.py \
  --model_type bert \
@@ -59,88 +60,79 @@ python3 run_glue.py \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir "./model"
-
 ```


+## Results
+
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -   \*\*\*\*\* Running Evaluation \*\*\*\*\*  
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999  
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8  
+> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -   \*\*\*\*\* Eval results sst-2 \*\*\*\*\*  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -     acc = 0.9539942492811602  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -     loss = 0.16348013816401363
+
+Accuracy is about **95.4%**


-# Results
+## Code Usage

-> 05/10/2020 17:00:43 - INFO - transformers.trainer -   ***** Running Evaluation *****
-
-> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999
-
-> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8
-
->Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
-
->05/10/2020 17:01:17 - INFO - __main__ -   ***** Eval results sst-2 *****
-
->05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602
-
->05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363
-
-
-Accuracy is about *%95.4*
-# Code Usage
-
-```
+```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+
 model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

-p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
+p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
 print(p)
-#[{'label': 'LABEL_1', 'score': 0.9871089}]
-print (p[0]['label']=='LABEL_1')
-#True
+# [{'label': 'LABEL_1', 'score': 0.9871089}]
+print(p[0]['label'] == 'LABEL_1')
+# True

-
-p= sa("Film çok kötü ve çok sahteydi")
+p = sa("Film çok kötü ve çok sahteydi")
 print(p)
-#[{'label': 'LABEL_0', 'score': 0.9975505}]
-print (p[0]['label']=='LABEL_1')
-#False
+# [{'label': 'LABEL_0', 'score': 0.9975505}]
+print(p[0]['label'] == 'LABEL_1')
+# False
 ```

-# Test your data
+
+## Test
+### Data

 Suppose your file has lots of lines of comment and label (1 or 0) at the end  (tab seperated)

-> comment1 ... \t label
-
-> comment2 ... \t label
- 
+> comment1 ... \t label  
+> comment2 ... \t label  
 > ...

+### Code

-
-```
+```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

-f="/path/to/your/file/yourfile.tsv"
 model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
-sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
+sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

-i,crr=0,0
-for line in open(f):
- lines=line.strip().split("\t")
- if len(lines)==2:
-  i=i+1
-  if i%100==0:
-   print(i)
-  pred= sa(lines[0])
-  pred=pred[0]["label"].split("_")[1]
-  if pred== lines[1]:
-   crr=crr+1
+input_file = "/path/to/your/file/yourfile.tsv"
+
+i, crr = 0, 0
+for line in open(input_file):
+    lines = line.strip().split("\t")
+    if len(lines) == 2:
+        
+        i = i + 1
+        if i%100 == 0:
+            print(i)
+        
+        pred = sa(lines[0])
+        pred = pred[0]["label"].split("_")[1]
+        
+        if pred == lines[1]:
+        crr = crr + 1

 print(crr, i, crr/i)
 ```
-
-
-
-
-
--- a/model_cards/savasy/bert-base-turkish-squad/README.md
+++ b/model_cards/savasy/bert-base-turkish-squad/README.md
@@ -0,0 +1,67 @@
+---
+language: turkish
+---
+# Turkish SQuAD  Model : Question Answering
+
+I fine-tuned Turkish-Bert-Model for Question-Answering problem with Turkish version of SQuAD; TQuAD 
+* BERT-base: https://huggingface.co/dbmdz/bert-base-turkish-uncased
+* TQuAD dataset:  https://github.com/TQuad/turkish-nlp-qa-dataset
+
+
+# Training Code
+
+```
+!python3 run_squad.py \
+  --model_type bert \
+  --model_name_or_path dbmdz/bert-base-turkish-uncased\
+  --do_train \
+  --do_eval \
+  --train_file trainQ.json \
+  --predict_file dev1.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 5.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir "./model"
+```
+
+
+# Example Usage
+
+> Load Model
+```
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("./model")
+model = AutoModelForQuestionAnswering.from_pretrained("./model")
+nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
+```
+
+> Apply the model
+```
+
+sait="ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
+İlk öğrenimine Adapazarı’nda Rehber-i Terakki Mektebi’nde başladı. İki yıl kadar Adapazarı İdadisi’nde okudu.\
+İstanbul Erkek Lisesi’nde devam ettiği orta öğrenimini Bursa Lisesi’nde tamamladı (1928). İstanbul Edebiyat \
+Fakültesi’ne iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçre’ye gitti. \
+Kısa süre sonra iktisat öğrenimini bırakarak Lozan’dan Grenoble’a geçti. Üç yıl başıboş bir edebiyat öğrenimi \
+gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
+gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
+yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
+Yazları Burgaz adasındaki köşklerinde, kışları Şişli’deki apartmanlarında annesi ile beraber geçen bu fazla \
+içkili bohem hayatı ömrünün sonuna kadar sürdü."
+
+print(nlp(question="Ne zaman avare bir hayata başladı?", context=sait))
+print(nlp(question="Sait Faik hangi Lisede orta öğrenimini tamamladı?", context=sait))
+
+```
+```
+# Ask your self ! type your question
+print(nlp(question="...?", context=sait))
+```
+
+
+Check My other Model
+https://huggingface.co/savasy
--- a/Show More
+++ b/Show More