Revert #4446 Since it introduces a new dependency

Release: v2.10.0
added functionality for electra classification head (#4257 )
2020-05-22 10:49:45 -04:00 · 2020-05-22 10:37:44 -04:00 · 2020-05-22 09:48:21 -04:00 · 2020-05-21 09:42:47 -04:00 · 2020-05-21 09:18:27 -04:00 · 2020-05-21 09:17:44 -04:00
118 changed files with 4560 additions and 939 deletions
--- a/.github/workflows/github-torch-hub.yml
+++ b/.github/workflows/github-torch-hub.yml
@@ -21,7 +21,7 @@ jobs:
    - name: Install dependencies
      run: |
        pip install torch
-        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses
+        pip install numpy tokenizers filelock requests tqdm regex sentencepiece sacremoses packaging

    - name: Torch hub list
      run: |
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -35,7 +35,7 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install torch==1.4.0
+        pip install torch
        pip install .[sklearn,testing]

    - name: Are GPUs recognized by our DL frameworks
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -198,11 +198,12 @@ Follow these steps to start contributing:
   are useful to avoid duplicated work, and to differentiate it from PRs ready
   to be merged;
 4. Make sure existing tests pass;
-5. Add high-coverage tests. No quality test, no merge. 
+5. Add high-coverage tests. No quality testing = no merge. 
 - If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
 - If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. 
+ - If you are adding a new tokenizer, write tests, and make sure `RUN_SLOW=1 python -m pytest tests/test_tokenization_{your_model_name}.py` passes.
 CircleCI does not run them. 
-6. All public methods must have informative docstrings;
+6. All public methods must have informative docstrings that work nicely with sphinx. See `modeling_ctrl.py` for an example.

 ### Tests

--- a/README.md
+++ b/README.md
@@ -165,8 +165,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
-21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
-22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+21. **[Longformer](https://huggingface.co/transformers/model_doc/longformer.html)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+22. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
+23. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.9.1'
+release = u'2.10.0'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -109,3 +109,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/dialogpt
    model_doc/reformer
    model_doc/marian
+    model_doc/longformer
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -6,7 +6,7 @@ Overview

 The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
 by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
-two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
+two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

 - Splitting the embedding matrix into two smaller matrices
 - Using repeating layers split among groups
--- a/docs/source/model_doc/longformer.rst
+++ b/docs/source/model_doc/longformer.rst
@@ -0,0 +1,69 @@
+Longformer
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+
+Overview
+~~~~~
+The Longformer model was presented in `Longformer: The Long-Document Transformer <https://arxiv.org/pdf/2004.05150.pdf>`_ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
+Here the abstract: 
+
+*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.*
+
+The Authors' code can be found `here <https://github.com/allenai/longformer>`_ .
+
+Longformer Self Attention
+~~~~~~~~~~~~~~~~~~~~
+Longformer self attention employs self attention on both a "local" context and a "global" context.
+Most tokens only attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and :math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in `config.attention_window`. Note that `config.attention_window` can be of type ``list`` to define a different :math:`w` for each layer. 
+A selecetd few tokens attend "globally" to all other tokens, as it is conventionally done for all tokens in *e.g.* `BertSelfAttention`.
+
+Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices.
+Also note that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally" attending tokens so that global attention is *symmetric*.
+
+The user can define which tokens are masked, which tokens attend "locally" and which tokens attend "globally" by setting the `config.attention_mask` `torch.Tensor` appropriately. In contrast to other models `Longformer` accepts the following values in `config.attention_mask`: `0` - the token is masked and not attended at all (as is done in other models), `1` - the token attends "locally", `2` - token attends "globally". For more information please also refer to :func:`~transformers.LongformerModel.forward` method.
+
+Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to :math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of "locally" attending tokens.
+
+For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`_ .
+
+
+Training
+~~~~~~~~~~~~~~~~~~~~
+``LongformerForMaskedLM`` is trained the exact same way, ``RobertaForMaskedLM`` is trained and 
+should be used as follows:
+
+::
+
+  input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
+  mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+
+  loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
+
+
+LongformerConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerConfig
+    :members:
+
+
+LongformerTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerTokenizer
+    :members: 
+
+
+LongformerModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerModel
+    :members:
+
+
+LongformerForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LongformerForMaskedLM
+    :members:
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -305,3 +305,9 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 | MarianMT          | ``Helsinki-NLP/opus-mt-{src}-{tgt}``                       | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.            |
 |                   |                                                            | | (see `model list <https://huggingface.co/Helsinki-NLP>`_)                                                                           |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Longformer        | ``longformer-base-4096``                                   | | 12-layer, 768-hidden, 12-heads, ~149M parameters                                                                                    |
+|                   |                                                            | | Starting from RoBERTa-base checkpoint, trained on documents of max length 4,096                                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``longformer-large-4096``                                  | | 24-layer, 1024-hidden, 16-heads, ~435M parameters                                                                                   |
+|                   |                                                            | | Starting from RoBERTa-large checkpoint, trained on documents of max length 4,096                                                    |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,6 +1,6 @@
-# Examples
+## Examples

-Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
+Version 2.9 of `transformers` introduces a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.

 Here is the list of all our examples:
 - **grouped by task** (all official examples work for multiple models)
@@ -12,32 +12,24 @@ Here is the list of all our examples:
 This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**


-## Tasks built on Trainer
+# The Big Table of Tasks

-| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
-|---|---|:---:|:---:|:---:|:---:|:---:|
-| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
-| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
-| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
-| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
-| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
+| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab
+|---|---|:---:|:---:|:---:|:---:|
+| [**`language-modeling`**](./language-modeling)       | Raw text        | ✅ | -  | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
+| [**`text-classification`**](./text-classification)   | GLUE, XNLI      | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb)
+| [**`token-classification`**](./token-classification) | CoNLL NER       | ✅ | ✅ | ✅ | -
+| [**`multiple-choice`**](./multiple-choice)           | SWAG, RACE, ARC | ✅ | ✅ | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
+| [**`question-answering`**](./question-answering)     | SQuAD           | -  | ✅ | -  | -
+| [**`text-generation`**](./text-generation)     | -           | -  | - | -  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
+| [**`distillation`**](./distillation)       | All               | -  | -  | -  | -
+| [**`summarization`**](./summarization)     | CNN/Daily Mail    | -  | -  | -  | -
+| [**`translation`**](./translation)         | WMT               | -  | -  | -  | -
+| [**`bertology`**](./bertology)             | -                 | -  | -  | -  | -
+| [**`adversarial`**](./adversarial)         | HANS              | -  | -  | -  | -


-
-## Other examples and how-to's
-
-| Section | Description |
-|---|---|
-| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+<br>

 ## Important note

@@ -52,6 +44,12 @@ pip install .
 pip install -r ./examples/requirements.txt
 ```

+## One-click Deploy to Cloud (wip)
+
+#### Azure
+
+[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json)
+
 ## Running on TPUs

 When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
--- a/examples/benchmarks.py
+++ b/examples/benchmarks.py
@@ -478,7 +478,7 @@ def _compute_pytorch(
                            dictionary[model_name]["memory"][batch_size][slice_size] = "N/A"

                        if not no_speed:
-                            print_fn("Going through model with sequence of shape".format(sequence.shape))
+                            print_fn("Going through model with sequence of shape {}".format(sequence.shape))
                            runtimes = timeit.repeat(lambda: inference(sequence), repeat=average_over, number=3)
                            average_time = sum(runtimes) / float(len(runtimes)) / 3.0
                            dictionary[model_name]["time"][batch_size][slice_size] = average_time
--- a/examples/bertology/run_bertology.py
+++ b/examples/bertology/run_bertology.py
@@ -64,7 +64,7 @@ def print_2d_tensor(tensor):


 def compute_heads_importance(
-    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None
+    args, model, eval_dataloader, compute_entropy=True, compute_importance=True, head_mask=None, actually_pruned=False
 ):
    """ This method shows how to compute:
        - head attention entropy
@@ -77,7 +77,12 @@ def compute_heads_importance(

    if head_mask is None:
        head_mask = torch.ones(n_layers, n_heads).to(args.device)
+
    head_mask.requires_grad_(requires_grad=True)
+    # If actually pruned attention multi-head, set head mask to None to avoid shape mismatch
+    if actually_pruned:
+        head_mask = None
+
    preds = None
    labels = None
    tot_tokens = 0.0
@@ -172,6 +177,7 @@ def mask_heads(args, model, eval_dataloader):
        new_head_mask = new_head_mask.view(-1)
        new_head_mask[current_heads_to_mask] = 0.0
        new_head_mask = new_head_mask.view_as(head_mask)
+        new_head_mask = new_head_mask.clone().detach()
        print_2d_tensor(new_head_mask)

        # Compute metric and head importance again
@@ -181,7 +187,7 @@ def mask_heads(args, model, eval_dataloader):
        preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
        current_score = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
        logger.info(
-            "Masking: current score: %f, remaning heads %d (%.1f percents)",
+            "Masking: current score: %f, remaining heads %d (%.1f percents)",
            current_score,
            new_head_mask.sum(),
            new_head_mask.sum() / new_head_mask.numel() * 100,
@@ -209,14 +215,23 @@ def prune_heads(args, model, eval_dataloader, head_mask):
    original_time = datetime.now() - before_time

    original_num_params = sum(p.numel() for p in model.parameters())
-    heads_to_prune = dict((layer, (1 - head_mask[layer].long()).nonzero().tolist()) for layer in range(len(head_mask)))
+    heads_to_prune = dict(
+        (layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask))
+    )
+
    assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item()
    model.prune_heads(heads_to_prune)
    pruned_num_params = sum(p.numel() for p in model.parameters())

    before_time = datetime.now()
    _, _, preds, labels = compute_heads_importance(
-        args, model, eval_dataloader, compute_entropy=False, compute_importance=False, head_mask=None
+        args,
+        model,
+        eval_dataloader,
+        compute_entropy=False,
+        compute_importance=False,
+        head_mask=None,
+        actually_pruned=True,
    )
    preds = np.argmax(preds, axis=1) if args.output_mode == "classification" else np.squeeze(preds)
    score_pruning = glue_compute_metrics(args.task_name, preds, labels)[args.metric_name]
@@ -404,7 +419,7 @@ def main():
    logger.info("Training/evaluation parameters %s", args)

    # Prepare dataset for the GLUE task
-    eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
+    eval_dataset = GlueDataset(args, tokenizer=tokenizer, mode="dev")
    if args.data_subset > 0:
        eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
--- a/examples/contrib/run_transfo_xl.py
+++ b/examples/contrib/run_transfo_xl.py
@@ -80,7 +80,7 @@ def main():

    # Load a pre-trained model
    model = TransfoXLLMHeadModel.from_pretrained(args.model_name)
-    model = model.to(device)
+    model.to(device)

    logger.info(
        "Evaluating with bsz {} tgt_len {} ext_len {} mem_len {} clamp_len {}".format(
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -80,7 +80,7 @@ class Distiller:

        self.mlm = params.mlm
        if self.mlm:
-            logger.info(f"Using MLM loss for LM step.")
+            logger.info("Using MLM loss for LM step.")
            self.mlm_mask_prop = params.mlm_mask_prop
            assert 0.0 <= self.mlm_mask_prop <= 1.0
            assert params.word_mask + params.word_keep + params.word_rand == 1.0
@@ -91,7 +91,7 @@ class Distiller:
                self.pred_probs = self.pred_probs.half()
                self.token_probs = self.token_probs.half()
        else:
-            logger.info(f"Using CLM loss for LM step.")
+            logger.info("Using CLM loss for LM step.")

        self.epoch = 0
        self.n_iter = 0
@@ -365,8 +365,8 @@ class Distiller:
            self.end_epoch()

        if self.is_master:
-            logger.info(f"Save very last checkpoint as `pytorch_model.bin`.")
-            self.save_checkpoint(checkpoint_name=f"pytorch_model.bin")
+            logger.info("Save very last checkpoint as `pytorch_model.bin`.")
+            self.save_checkpoint(checkpoint_name="pytorch_model.bin")
            logger.info("Training is finished")

    def step(self, input_ids: torch.tensor, attention_mask: torch.tensor, lm_labels: torch.tensor):
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -60,7 +60,7 @@ def main():
    with open(args.file_path, "r", encoding="utf8") as fp:
        data = fp.readlines()

-    logger.info(f"Start encoding")
+    logger.info("Start encoding")
    logger.info(f"{len(data)} examples to process.")

    rslt = []
--- a/examples/distillation/scripts/extract.py
+++ b/examples/distillation/scripts/extract.py
@@ -93,7 +93,7 @@ if __name__ == "__main__":
    elif args.model_type == "gpt2":
        for w in ["weight", "bias"]:
            compressed_sd[f"{prefix}.ln_f.{w}"] = state_dict[f"{prefix}.ln_f.{w}"]
-        compressed_sd[f"lm_head.weight"] = state_dict[f"lm_head.weight"]
+        compressed_sd["lm_head.weight"] = state_dict["lm_head.weight"]

    print(f"N layers selected for distillation: {std_idx}")
    print(f"Number of params transfered for distillation: {len(compressed_sd.keys())}")
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -37,7 +37,7 @@ if __name__ == "__main__":
        model = BertForMaskedLM.from_pretrained(args.model_name)
        prefix = "bert"
    else:
-        raise ValueError(f'args.model_type should be "bert".')
+        raise ValueError('args.model_type should be "bert".')

    state_dict = model.state_dict()
    compressed_sd = {}
@@ -78,8 +78,8 @@ if __name__ == "__main__":
            ]
        std_idx += 1

-    compressed_sd[f"vocab_projector.weight"] = state_dict[f"cls.predictions.decoder.weight"]
-    compressed_sd[f"vocab_projector.bias"] = state_dict[f"cls.predictions.bias"]
+    compressed_sd["vocab_projector.weight"] = state_dict["cls.predictions.decoder.weight"]
+    compressed_sd["vocab_projector.bias"] = state_dict["cls.predictions.bias"]
    if args.vocab_transform:
        for w in ["weight", "bias"]:
            compressed_sd[f"vocab_transform.{w}"] = state_dict[f"cls.predictions.transform.dense.{w}"]
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -273,7 +273,7 @@ def main():
        token_probs = None

    train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
-    logger.info(f"Data loader created.")
+    logger.info("Data loader created.")

    # STUDENT #
    logger.info(f"Loading student config from {args.student_config}")
@@ -288,7 +288,7 @@ def main():

    if args.n_gpu > 0:
        student.to(f"cuda:{args.local_rank}")
-    logger.info(f"Student loaded.")
+    logger.info("Student loaded.")

    # TEACHER #
    teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
@@ -115,15 +115,13 @@ class DataTrainingArguments:
    )


-def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False, local_rank=-1):
+def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
-        return LineByLineTextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank
-        )
+        return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(
-            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, local_rank=local_rank,
+            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )


@@ -220,16 +218,9 @@ def main():
        data_args.block_size = min(data_args.block_size, tokenizer.max_len)

    # Get datasets
-    train_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        get_dataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-        if training_args.do_eval
-        else None
-    )
+
+    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
+    eval_dataset = get_dataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=data_args.mlm, mlm_probability=data_args.mlm_probability
    )
@@ -260,7 +251,7 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        eval_output = trainer.evaluate()
@@ -269,11 +260,12 @@ def main():
        result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key in sorted(result.keys()):
-                logger.info("  %s = %s", key, str(result[key]))
-                writer.write("%s = %s\n" % (key, str(result[key])))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key in sorted(result.keys()):
+                    logger.info("  %s = %s", key, str(result[key]))
+                    writer.write("%s = %s\n" % (key, str(result[key])))

        results.update(result)

--- a/examples/multiple-choice/run_multiple_choice.py
+++ b/examples/multiple-choice/run_multiple_choice.py
@@ -159,7 +159,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -172,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -204,19 +202,20 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+                results.update(result)

    return results

--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
@@ -26,6 +26,7 @@ from enum import Enum
 from typing import List, Optional

 import tqdm
+from filelock import FileLock

 from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available

@@ -77,7 +78,6 @@ class Split(Enum):
 if is_torch_available():
    import torch
    from torch.utils.data.dataset import Dataset
-    from transformers import torch_distributed_zero_first

    class MultipleChoiceDataset(Dataset):
        """
@@ -95,7 +95,6 @@ if is_torch_available():
            max_seq_length: Optional[int] = None,
            overwrite_cache=False,
            mode: Split = Split.train,
-            local_rank=-1,
        ):
            processor = processors[task]()

@@ -103,9 +102,11 @@ if is_torch_available():
                data_dir,
                "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
            )
-            with torch_distributed_zero_first(local_rank):
-                # Make sure only the first process in distributed training processes the dataset,
-                # and the others will use the cache.
+
+            # Make sure only the first process in distributed training processes the dataset,
+            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

                if os.path.exists(cached_features_file) and not overwrite_cache:
                    logger.info(f"Loading features from cached file {cached_features_file}")
@@ -130,9 +131,8 @@ if is_torch_available():
                        pad_token=tokenizer.pad_token_id,
                        pad_token_segment_id=tokenizer.pad_token_type_id,
                    )
-                    if local_rank in [-1, 0]:
-                        logger.info("Saving features into cached file %s", cached_features_file)
-                        torch.save(self.features, cached_features_file)
+                    logger.info("Saving features into cached file %s", cached_features_file)
+                    torch.save(self.features, cached_features_file)

        def __len__(self):
            return len(self.features)
@@ -535,7 +535,12 @@ def convert_examples_to_features(
                text_b = example.question + " " + ending

            inputs = tokenizer.encode_plus(
-                text_a, text_b, add_special_tokens=True, max_length=max_length, pad_to_max_length=True,
+                text_a,
+                text_b,
+                add_special_tokens=True,
+                max_length=max_length,
+                pad_to_max_length=True,
+                return_overflowing_tokens=True,
            )
            if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
                logger.info(
--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
@@ -135,7 +135,8 @@ def main():

    # Get datasets
    train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
-    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None
+    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev") if training_args.do_eval else None
+    test_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="test") if training_args.do_predict else None

    def compute_metrics(p: EvalPrediction) -> Dict:
        if output_mode == "classification":
@@ -165,31 +166,57 @@ def main():
            tokenizer.save_pretrained(training_args.output_dir)

    # Evaluation
-    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    eval_results = {}
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
-            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))
+            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="dev"))

        for eval_dataset in eval_datasets:
-            result = trainer.evaluate(eval_dataset=eval_dataset)
+            eval_result = trainer.evaluate(eval_dataset=eval_dataset)

            output_eval_file = os.path.join(
                training_args.output_dir, f"eval_results_{eval_dataset.args.task_name}.txt"
            )
-            with open(output_eval_file, "w") as writer:
-                logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
-                for key, value in result.items():
-                    logger.info("  %s = %s", key, value)
-                    writer.write("%s = %s\n" % (key, value))
+            if trainer.is_world_master():
+                with open(output_eval_file, "w") as writer:
+                    logger.info("***** Eval results {} *****".format(eval_dataset.args.task_name))
+                    for key, value in eval_result.items():
+                        logger.info("  %s = %s", key, value)
+                        writer.write("%s = %s\n" % (key, value))

-            results.update(result)
+            eval_results.update(eval_result)

-    return results
+    if training_args.do_predict:
+        logging.info("*** Test ***")
+        test_datasets = [test_dataset]
+        if data_args.task_name == "mnli":
+            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
+            test_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, mode="test"))
+
+        for test_dataset in test_datasets:
+            predictions = trainer.predict(test_dataset=test_dataset).predictions
+            if output_mode == "classification":
+                predictions = np.argmax(predictions, axis=1)
+
+            output_test_file = os.path.join(
+                training_args.output_dir, f"test_results_{test_dataset.args.task_name}.txt"
+            )
+            if trainer.is_world_master():
+                with open(output_test_file, "w") as writer:
+                    logger.info("***** Test results {} *****".format(test_dataset.args.task_name))
+                    writer.write("index\tprediction\n")
+                    for index, item in enumerate(predictions):
+                        if output_mode == "regression":
+                            writer.write("%d\t%3.3f\n" % (index, item))
+                        else:
+                            item = test_dataset.get_labels()[item]
+                            writer.write("%d\t%s\n" % (index, item))
+    return eval_results


 def _mp_fn(index):
--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -171,7 +171,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.train,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_train
        else None
@@ -185,7 +184,6 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.dev,
-            local_rank=training_args.local_rank,
        )
        if training_args.do_eval
        else None
@@ -237,22 +235,23 @@ def main():

    # Evaluation
    results = {}
-    if training_args.do_eval and training_args.local_rank in [-1, 0]:
+    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        result = trainer.evaluate()

        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
-        with open(output_eval_file, "w") as writer:
-            logger.info("***** Eval results *****")
-            for key, value in result.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_eval_file, "w") as writer:
+                logger.info("***** Eval results *****")
+                for key, value in result.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

            results.update(result)

    # Predict
-    if training_args.do_predict and training_args.local_rank in [-1, 0]:
+    if training_args.do_predict:
        test_dataset = NerDataset(
            data_dir=data_args.data_dir,
            tokenizer=tokenizer,
@@ -261,33 +260,36 @@ def main():
            max_seq_length=data_args.max_seq_length,
            overwrite_cache=data_args.overwrite_cache,
            mode=Split.test,
-            local_rank=training_args.local_rank,
        )

        predictions, label_ids, metrics = trainer.predict(test_dataset)
        preds_list, _ = align_predictions(predictions, label_ids)

        output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt")
-        with open(output_test_results_file, "w") as writer:
-            for key, value in metrics.items():
-                logger.info("  %s = %s", key, value)
-                writer.write("%s = %s\n" % (key, value))
+        if trainer.is_world_master():
+            with open(output_test_results_file, "w") as writer:
+                for key, value in metrics.items():
+                    logger.info("  %s = %s", key, value)
+                    writer.write("%s = %s\n" % (key, value))

        # Save predictions
        output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")
-        with open(output_test_predictions_file, "w") as writer:
-            with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
-                example_id = 0
-                for line in f:
-                    if line.startswith("-DOCSTART-") or line == "" or line == "\n":
-                        writer.write(line)
-                        if not preds_list[example_id]:
-                            example_id += 1
-                    elif preds_list[example_id]:
-                        output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
-                        writer.write(output_line)
-                    else:
-                        logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
+        if trainer.is_world_master():
+            with open(output_test_predictions_file, "w") as writer:
+                with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f:
+                    example_id = 0
+                    for line in f:
+                        if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                            writer.write(line)
+                            if not preds_list[example_id]:
+                                example_id += 1
+                        elif preds_list[example_id]:
+                            output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n"
+                            writer.write(output_line)
+                        else:
+                            logger.warning(
+                                "Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
+                            )

    return results

--- a/examples/token-classification/utils_ner.py
+++ b/examples/token-classification/utils_ner.py
@@ -22,6 +22,8 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List, Optional, Union

+from filelock import FileLock
+
 from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available


@@ -68,7 +70,6 @@ if is_torch_available():
    import torch
    from torch import nn
    from torch.utils.data.dataset import Dataset
-    from transformers import torch_distributed_zero_first

    class NerDataset(Dataset):
        """
@@ -90,16 +91,16 @@ if is_torch_available():
            max_seq_length: Optional[int] = None,
            overwrite_cache=False,
            mode: Split = Split.train,
-            local_rank=-1,
        ):
            # Load data features from cache or dataset file
            cached_features_file = os.path.join(
                data_dir, "cached_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length)),
            )

-            with torch_distributed_zero_first(local_rank):
-                # Make sure only the first process in distributed training processes the dataset,
-                # and the others will use the cache.
+            # Make sure only the first process in distributed training processes the dataset,
+            # and the others will use the cache.
+            lock_path = cached_features_file + ".lock"
+            with FileLock(lock_path):

                if os.path.exists(cached_features_file) and not overwrite_cache:
                    logger.info(f"Loading features from cached file {cached_features_file}")
@@ -125,9 +126,8 @@ if is_torch_available():
                        pad_token_segment_id=tokenizer.pad_token_type_id,
                        pad_token_label_id=self.pad_token_label_id,
                    )
-                    if local_rank in [-1, 0]:
-                        logger.info(f"Saving features into cached file {cached_features_file}")
-                        torch.save(self.features, cached_features_file)
+                    logger.info(f"Saving features into cached file {cached_features_file}")
+                    torch.save(self.features, cached_features_file)

        def __len__(self):
            return len(self.features)
--- a/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
+++ b/model_cards/Tereveni-AI/gpt2-124M-uk-fiction/README.md
@@ -0,0 +1,23 @@
+Note: **default code snippet above won't work** because we are using `AlbertTokenizer` with `GPT2LMHeadModel`, see [issue](https://github.com/huggingface/transformers/issues/4285).
+
+## GPT2 124M Trained on Ukranian Fiction
+
+Example usage:
+```python
+from transformers import AlbertTokenizer, GPT2LMHeadModel
+
+tokenizer = AlbertTokenizer.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+model = GPT2LMHeadModel.from_pretrained("Tereveni-AI/gpt2-124M-uk-fiction")
+
+input_ids = tokenizer.encode('Но зла Юнона, суча дочка,', add_special_tokens=False, return_tensors='pt')
+
+outputs = model.generate(
+    input_ids,
+    do_sample=True,
+    num_return_sequences=3,
+    max_length=50
+)
+
+for i, out in enumerate(outputs):
+    print('{}: {}'.format(i, tokenizer.decode(out)))
+```
--- a/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
+++ b/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
@@ -1,19 +1,25 @@
+---
+language: norwegian
+thumbnail: https://i.imgur.com/QqSEC5I.png
+---
+
 # Norwegian Electra
-Image incoming, im going to have som fun with this one.
+![Image of norwegian electra](https://i.imgur.com/QqSEC5I.png)

 Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)

 Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
-
+# Model
+## Electra: Pre-training Text Encoders as Discriminators Rather Than Generators
+Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning
+- https://openreview.net/pdf?id=r1xMH1BtvB
+- https://github.com/google-research/electra
 # Acknowledgments
-
 ### TensorFlow Research Cloud
 Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
 - https://www.tensorflow.org/tfrc
-
 #### OSCAR corpus
 - https://oscar-corpus.com/
-
 #### OPUS
 - http://opus.nlpl.eu/
 - http://www.opensubtitles.org/
--- a/model_cards/activebus/BERT-DK_laptop/README.md
+++ b/model_cards/activebus/BERT-DK_laptop/README.md
@@ -0,0 +1,43 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-DK_laptop")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-DK_rest/README.md
+++ b/model_cards/activebus/BERT-DK_rest/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-DK_rest")
+model = AutoModel.from_pretrained("activebus/BERT-DK_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_laptop/README.md
+++ b/model_cards/activebus/BERT-PT_laptop/README.md
@@ -0,0 +1,41 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_laptop` is trained from 100MB laptop corpus under `Electronics/Computers & Accessories/Laptops`. 
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_laptop")
+model = AutoModel.from_pretrained("activebus/BERT-PT_laptop")
+
+```
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-PT_rest/README.md
+++ b/model_cards/activebus/BERT-PT_rest/README.md
@@ -0,0 +1,42 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT-DK_rest` is trained from 1G (19 types) restaurants from Yelp.
+`BERT-PT_*` addtionally uses SQuAD 1.1.  
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-PT_rest")
+model = AutoModel.from_pretrained("activebus/BERT-PT_rest")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT-XD_Review/README.md
+++ b/model_cards/activebus/BERT-XD_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+Please visit https://github.com/howardhsu/BERT-for-RRC-ABSA for details.  
+
+`BERT-XD_Review` is a cross-domain (beyond just `laptop` and `restaurant`) language model, where each example is from a single product / restaurant with the same rating, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+## Model Description
+
+The original model is from `BERT-base-uncased`.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT-XD_Review")
+model = AutoModel.from_pretrained("activebus/BERT-XD_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/activebus/BERT_Review/README.md
+++ b/model_cards/activebus/BERT_Review/README.md
@@ -0,0 +1,44 @@
+# ReviewBERT
+
+BERT (post-)trained from review corpus to understand sentiment, options and various e-commence aspects.  
+
+`BERT_Review` is cross-domain (beyond just `laptop` and `restaurant`) language model with one example from randomly mixed domains, post-trained (fine-tuned) on a combination of 5-core Amazon reviews and all Yelp data, expected to be 22 G in total. It is trained for 4 epochs on `bert-base-uncased`.
+The preprocessing code [here](https://github.com/howardhsu/BERT-for-RRC-ABSA/transformers).
+
+
+## Model Description
+
+The original model is from `BERT-base-uncased` trained from Wikipedia+BookCorpus.  
+Models are post-trained from [Amazon Dataset](http://jmcauley.ucsd.edu/data/amazon/) and [Yelp Dataset](https://www.yelp.com/dataset/challenge/).  
+
+
+## Instructions
+Loading the post-trained weights are as simple as, e.g., 
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("activebus/BERT_Review")
+model = AutoModel.from_pretrained("activebus/BERT_Review")
+
+```
+
+
+## Evaluation Results
+
+Check our [NAACL paper](https://www.aclweb.org/anthology/N19-1242.pdf) 
+`BERT_Review` is expected to have similar performance on domain-specific tasks (such as aspect extraction) as `BERT-DK`, but much better on general tasks such as aspect sentiment classification (different domains mostly share similar sentiment words).
+
+
+## Citation
+If you find this work useful, please cite as following.
+```
+@inproceedings{xu_bert2019,
+    title = "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis",
+    author = "Xu, Hu and Liu, Bing and Shu, Lei and Yu, Philip S.",
+    booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics",
+    month = "jun",
+    year = "2019",
+}
+```
--- a/model_cards/bert-base-german-cased-README.md
+++ b/model_cards/bert-base-german-cased-README.md
@@ -18,13 +18,16 @@ tags:
 **Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)  
 **Infrastructure**: 1x TPU v2  
 **Published**: Jun 14th, 2019
+
+**Update April 3rd, 2020**: we updated the vocabulary file on deepset's s3 to conform with the default tokenization of punctuation tokens. 
+For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60). If you want to use the old vocab we have also uploaded a ["deepset/bert-base-german-cased-oldvocab"](https://huggingface.co/deepset/bert-base-german-cased-oldvocab) model.
 
 ## Details
 - We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
 - We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
 - As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
 - We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
- Update April 3rd, 2020: updated the vocab file on deepset s3 to adjust tokenization of punctuation.
+

 See https://deepset.ai/german-bert for more details

--- a/model_cards/deepset/bert-base-german-cased-oldvocab/README.md
+++ b/model_cards/deepset/bert-base-german-cased-oldvocab/README.md
@@ -0,0 +1,28 @@
+---
+language: german
+thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
+tags:
+- exbert
+---
+
+<a href="https://huggingface.co/exbert/?model=bert-base-german-cased">
+	<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
+</a>
+
+# German BERT with old vocabulary
+For details see the related [FARM issue](https://github.com/deepset-ai/FARM/issues/60).
+
+
+## About us
+![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
+
+We bring NLP to the industry via open source!  
+Our focus: Industry specific language models & large scale QA systems.  
+  
+Some of our work: 
+- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
+- [FARM](https://github.com/deepset-ai/FARM)
+- [Haystack](https://github.com/deepset-ai/haystack/)
+
+Get in touch:
+[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)  
--- a/model_cards/digitalepidemiologylab/covid-twitter-bert/README.md
+++ b/model_cards/digitalepidemiologylab/covid-twitter-bert/README.md
@@ -0,0 +1,18 @@
+# COVID-Twitter-BERT (CT-BERT)
+BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19
+
+## Overview
+This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
+
+This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings. 
+
+In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
+
+## Example usage
+```python
+tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
+model = TFAutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
+```
+
+## References
+[1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).
--- a/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md
+++ b/model_cards/dumitrescustefan/bert-base-romanian-cased-v1/README.md
@@ -0,0 +1,48 @@
+---
+language: romanian
+---
+
+# bert-base-romanian-cased-v1
+
+The BERT **base**, **cased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
+
+### How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
+model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
+# tokenize a sentence and run through the model
+input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+outputs = model(input_ids)
+# get encoding
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+### Evaluation
+
+Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). 
+
+The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
+
+| Model                          |  UPOS |  XPOS  |  NER  |  LAS  |
+|--------------------------------|:-----:|:------:|:-----:|:-----:|
+| bert-base-multilingual-cased   | 97.87 |  96.16 | 84.13 | 88.04 |
+| bert-base-romanian-cased-v1    | **98.00** |  **96.46** | **85.88** | **89.69** |
+
+### Corpus 
+
+The model is trained on the following corpora (stats in the table below are after cleaning):
+
+| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
+|-----------	|:--------:	|:--------:	|:--------:	|:--------:	|
+| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
+| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
+| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
+| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|
+
+#### Acknowledgements
+
+- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
--- a/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md
+++ b/model_cards/dumitrescustefan/bert-base-romanian-uncased-v1/README.md
@@ -0,0 +1,51 @@
+---
+language: romanian
+---
+
+# bert-base-romanian-uncased-v1
+
+The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
+
+### How to use
+
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+
+# load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
+model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
+
+# tokenize a sentence and run through the model
+input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+outputs = model(input_ids)
+
+# get encoding
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+```
+
+### Evaluation
+
+Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md). 
+
+The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.
+
+| Model                          |  UPOS |  XPOS  |  NER  |  LAS  |
+|--------------------------------|:-----:|:------:|:-----:|:-----:|
+| bert-base-multilingual-uncased | 97.65 |  95.72 | 83.91 | 87.65 |
+| bert-base-romanian-uncased-v1  | **98.18** |  **96.84** | **85.26** | **89.61** |
+
+### Corpus 
+
+The model is trained on the following corpora (stats in the table below are after cleaning):
+
+| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
+|-----------	|:--------:	|:--------:	|:--------:	|:--------:	|
+| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
+| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
+| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
+| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|
+
+#### Acknowledgements
+
+- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
--- a/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/t5-base-bahasa-cased/README.md
@@ -0,0 +1,74 @@
+---
+language: malay
+---
+
+# Bahasa T5 Model
+
+Pretrained T5 base language model for Malay and Indonesian. 
+
+## Pretraining Corpus
+
+`t5-base-bahasa-cased` model was pretrained on multiple tasks. Below is list of tasks we trained on,
+
+1. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+2. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+3. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+4. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+5. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+6. [Unsupervised](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1875) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+7. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local Wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+8. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+9. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+10. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+11. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+12. [Next sentence prediction](https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1129) on [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+13. [Bahasa SNLI](https://github.com/huseinzol05/Malaya-Dataset#snli).
+14. [Bahasa Question Quora](https://github.com/huseinzol05/Malaya-Dataset#quora).
+15. [Bahasa Natural Questions](https://github.com/huseinzol05/Malaya-Dataset#natural-questions).
+16. [News title summarization](https://github.com/huseinzol05/Malaya-Dataset#crawled-news).
+17. [Stemming to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-stemming.ipynb).
+18. [Synonym to original wikipedia](https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/generate-synonym.ipynb).
+
+Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
+
+## Pretraining details
+
+- This model was trained using Google T5's github [repository](https://github.com/google-research/text-to-text-transfer-transformer) on v3-8 TPU.
+- All steps can reproduce from here, [Malaya/pretrained-model/t5](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5).
+
+## Load Pretrained Model
+
+You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
+
+```python
+from transformers import T5Tokenizer, T5Model
+
+model = T5Model.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+```
+
+## Example using T5ForConditionalGeneration
+
+```python
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+tokenizer = T5Tokenizer.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+model = T5ForConditionalGeneration.from_pretrained('huseinzol05/t5-base-bahasa-cased')
+input_ids = tokenizer.encode('soalan: siapakah perdana menteri malaysia?', return_tensors = 'pt')
+outputs = model.generate(input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+
+Output is,
+
+```
+'Mahathir Mohamad'
+```
+
+## Results
+
+For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
+
+## Acknowledgement
+
+Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train T5 for Bahasa. 
--- a/model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
+++ b/model_cards/mrm8488/RuPERTa-base-finetuned-ner/README.md
@@ -0,0 +1,92 @@
+---
+language: spanish
+thumbnail:
+---
+
+# RuPERTa-base  (Spanish RoBERTa) + NER 🎃🏷
+
+This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **NER** downstream task.
+
+## Details of the downstream task (NER) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  |  329 K |
+| Dev                    | 40 K |
+
+
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
+
+- Labels covered:
+
+```
+B-LOC
+B-MISC
+B-ORG
+B-PER
+I-LOC
+I-MISC
+I-ORG
+I-PER
+O
+```
+
+## Metrics on evaluation set 🧾
+
+|                                                      Metric                                                       |  # score  |
+| :------------------------------------------------------------------------------------: | :-------: |
+| F1                                       | **77.55**  
+| Precision                                | **75.53** | 
+| Recall                                   | **79.68** |    
+
+## Model in action 🔨
+
+
+Example of usage:
+
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+id2label = {
+    "0": "B-LOC",
+    "1": "B-MISC",
+    "2": "B-ORG",
+    "3": "B-PER",
+    "4": "I-LOC",
+    "5": "I-MISC",
+    "6": "I-ORG",
+    "7": "I-PER",
+    "8": "O"
+}
+
+text ="Julien, CEO de HF, nació en Francia."
+input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
+
+outputs = model(input_ids)
+last_hidden_states = outputs[0]
+
+for m in last_hidden_states:
+  for index, n in enumerate(m):
+    if(index > 0 and index <= len(text.split(" "))):
+      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
+      
+'''
+Output:
+--------
+Julien,: I-PER
+CEO: O
+de: O
+HF,: B-ORG
+nació: I-PER
+en: I-PER
+Francia.: I-LOC
+'''
+```
+Yeah! Not too bad 🎉
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
+++ b/model_cards/mrm8488/RuPERTa-base-finetuned-pos/README.md
@@ -0,0 +1,111 @@
+---
+language: spanish
+thumbnail:
+---
+
+# RuPERTa-base  (Spanish RoBERTa) + POS 🎃🏷
+
+This model is a fine-tuned on [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of [RuPERTa-base](https://huggingface.co/mrm8488/RuPERTa-base) for **POS** downstream task.
+
+## Details of the downstream task (POS) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 📚
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  | 445 K |
+| Dev                    | 55 K |
+
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)
+
+- Labels covered:
+
+```
+ADJ
+ADP
+ADV
+AUX
+CCONJ
+DET
+INTJ
+NOUN
+NUM
+PART
+PRON
+PROPN
+PUNCT
+SCONJ
+SYM
+VERB
+```
+
+## Metrics on evaluation set 🧾
+
+|                                                      Metric                                                       |  # score  |
+| :------------------------------------------------------------------------------------: | :-------: |
+| F1                                       | **97.39**  
+| Precision                                | **97.47** | 
+| Recall                                   | **9732** |    
+
+## Model in action 🔨
+
+
+Example of usage
+
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
+model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-pos')
+
+id2label = {
+    "0": "O",
+    "1": "ADJ",
+    "2": "ADP",
+    "3": "ADV",
+    "4": "AUX",
+    "5": "CCONJ",
+    "6": "DET",
+    "7": "INTJ",
+    "8": "NOUN",
+    "9": "NUM",
+    "10": "PART",
+    "11": "PRON",
+    "12": "PROPN",
+    "13": "PUNCT",
+    "14": "SCONJ",
+    "15": "SYM",
+    "16": "VERB"
+}
+
+text ="Mis amigos están pensando viajar a Londres este verano."
+input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
+
+outputs = model(input_ids)
+last_hidden_states = outputs[0]
+
+for m in last_hidden_states:
+  for index, n in enumerate(m):
+    if(index > 0 and index <= len(text.split(" "))):
+      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
+      
+'''
+Output:
+--------
+Mis: NUM
+amigos: PRON
+están: AUX
+pensando: ADV
+viajar: VERB
+a: ADP
+Londres: PROPN
+este: DET
+verano..: NOUN
+'''
+```
+Yeah! Not too bad 🎉
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/electricidad-small-discriminator/README.md
+++ b/model_cards/mrm8488/electricidad-small-discriminator/README.md
@@ -43,8 +43,8 @@ import torch
 discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
 tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")

-sentence = "El rápido zorro marrón salta sobre el perro perezoso"
-fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
+sentence = "el zorro rojo es muy rápido"
+fake_sentence = "el zorro rojo es muy ser"

 fake_tokens = tokenizer.tokenize(sentence)
 fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
@@ -53,9 +53,16 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

 [print("%7s" % token, end="") for token in fake_tokens]

-[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
+[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()[1:-1]]
+
+# Output:
+'''
+el  zorro   rojo     es    muy    ser      0      0      0      0      0      1[None, None, None, None, None, None]
+'''
 ```

+As you can see there is a **1** in the place where the model detected the fake token (**ser**). So, it works! 🎉
+
 ## Acknowledgments

 I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
--- a/model_cards/mrm8488/gpt2-imdb-neg/README.md
+++ b/model_cards/mrm8488/gpt2-imdb-neg/README.md
@@ -20,6 +20,9 @@ A few examples of the model response to a query before and after optimisation:
 |I have watched 3 episodes |with this guy and he is such a talented actor...|	but the show is just plain awful and there ne...|	2.681171|	-4.512792|
 |We know that firefighters and|	police officers are forced to become populari...|	other chains have going to get this disaster ...|	1.367811|	-3.34017|

+## Training logs and metrics <img src="https://gblobscdn.gitbook.com/spaces%2F-Lqya5RvLedGEWPhtkjU%2Favatar.png?alt=media" width="25" height="25">
+Watch the whole training logs and metrics on [W&B](https://app.wandb.ai/mrm8488/gpt2-sentiment-negative?workspace=user-mrm8488)
+


 > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
--- a/model_cards/oliverguhr/german-sentiment-bert/README.md
+++ b/model_cards/oliverguhr/german-sentiment-bert/README.md
@@ -0,0 +1,125 @@
+# German Sentiment Classification with Bert
+
+This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model, 
+we provide a Python package that bundles the code need for the preprocessing and inferencing. 
+
+The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews. 
+You can find more information about the dataset and the training process in the [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf).
+
+## Using the Python package
+
+To get started install the package from [pypi](https://pypi.org/project/germansentiment/):
+
+```bash
+pip install germansentiment
+```
+
+```python
+from germansentiment import SentimentModel
+
+model = SentimentModel()
+
+texts = [
+    "Mit keinem guten Ergebniss","Das ist gar nicht mal so gut",
+    "Total awesome!","nicht so schlecht wie erwartet",
+    "Der Test verlief positiv.","Sie fährt ein grünes Auto."]
+       
+result = model.predict_sentiment(texts)
+print(result)
+```
+
+The code above will output following list:
+
+```python
+["negative","negative","positive","positive","neutral", "neutral"]
+```
+
+## minimal working Sample
+
+
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from typing import List
+import torch
+import re
+
+class SentimentModel():
+    def __init__(self, model_name: str):
+        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
+        self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
+        self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
+
+    def predict_sentiment(self, texts: List[str])-> List[str]:
+        texts = [self.clean_text(text) for text in texts]
+        # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
+        input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
+        input_ids = torch.tensor(input_ids["input_ids"])
+
+        with torch.no_grad():
+            logits = self.model(input_ids)    
+
+        label_ids = torch.argmax(logits[0], axis=1)
+
+        labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
+        return labels
+
+    def replace_numbers(self,text: str) -> str:
+            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")         
+
+    def clean_text(self,text: str)-> str:    
+            text = text.replace("\n", " ")        
+            text = self.clean_http_urls.sub('',text)
+            text = self.clean_at_mentions.sub('',text)        
+            text = self.replace_numbers(text)                
+            text = self.clean_chars.sub('', text) # use only text chars                          
+            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace   
+            text = text.strip().lower()
+            return text
+
+texts = ["Mit keinem guten Ergebniss","Das war unfair", "Das ist gar nicht mal so gut",
+        "Total awesome!","nicht so schlecht wie erwartet", "Das ist gar nicht mal so schlecht",
+        "Der Test verlief positiv.","Sie fährt ein grünes Auto.", "Der Fall wurde an die Polzei übergeben."]
+
+model = SentimentModel(model_name = "oliverguhr/german-sentiment-bert")
+
+print(model.predict_sentiment(texts))
+```
+
+## Model and Data
+
+If you are interested in code and data that was used to train this model please have a look at [this repository](https://github.com/oliverguhr/german-sentiment) and our [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf). Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.
+
+| Dataset                                                      | F1 micro Score |
+| :----------------------------------------------------------- | -------------: |
+| [holidaycheck](https://github.com/oliverguhr/german-sentiment) |         0.9568 |
+| [scare](https://www.romanklinger.de/scare/)                  |         0.9418 |
+| [filmstarts](https://github.com/oliverguhr/german-sentiment) |         0.9021 |
+| [germeval](https://sites.google.com/view/germeval2017-absa/home) |         0.7536 |
+| [PotTS](https://www.aclweb.org/anthology/L16-1181/)          |         0.6780 |
+| [emotions](https://github.com/oliverguhr/german-sentiment)  |         0.9649 |
+| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) |         0.7376 |
+| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) |         0.9967 |
+| all                                                          |         0.9639 |
+
+## Cite
+
+For feedback and questions contact me view mail or Twitter [@oliverguhr](https://twitter.com/oliverguhr). Please cite us if you found this useful:
+
+```
+@InProceedings{guhr-EtAl:2020:LREC,
+  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
+  title     = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
+  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
+  month          = {May},
+  year           = {2020},
+  address        = {Marseille, France},
+  publisher      = {European Language Resources Association},
+  pages     = {1620--1625},
+  url       = {https://www.aclweb.org/anthology/2020.lrec-1.201}
+}
+```
+
+
--- a/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
+++ b/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
@@ -5,12 +5,12 @@ https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
 This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased


-# Dataset
+## Dataset

-The dataset is taken from the studies [2] and [3] and merged.
+The dataset is taken from the studies [[2]](#paper-2) and [[3]](#paper-3), and merged.

 * The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
-The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
+The movie dataset is taken from a cinema Web page ([Beyazperde](www.beyazperde.com)) with
 5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
 scale from 0 to 5 by the users who made the reviews. The study considered a review
 sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
@@ -19,9 +19,9 @@ Web page. They constructed benchmark dataset consisting of reviews regarding som
 products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
 and majority class of reviews are 5. Each category has 700 positive and 700 negative
 reviews in which average rating of negative reviews is 2.27 and of positive reviews
-is 4.5. This dataset is also used the study [1]
+is 4.5. This dataset is also used by the study [[1]](#paper-1).

-* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 
+* The study [[3]](#paper-3) collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 

 *Merged Dataset* 

@@ -32,20 +32,21 @@ is 4.5. This dataset is also used the study [1]
 |  32000 |train.tsv|
 |  *48290* |*total*|

+### The dataset is used by following papers

-The dataset is used by following papers
- 
-* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
-* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
+<a id="paper-1">[1]</a> Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
+
+<a id="paper-2">[2]</a> Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
 Discovery and Opinion Mining (WISDOM ’13)
-* [3] Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

-# Training
+<a id="paper-3">[3]</a> Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

-```
+
+## Training
+
+```shell
 export GLUE_DIR="./sst-2-newall"
 export TASK_NAME=SST-2
- 

 python3 run_glue.py \
  --model_type bert \
@@ -59,88 +60,79 @@ python3 run_glue.py \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir "./model"
-
 ```


+## Results
+
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -   \*\*\*\*\* Running Evaluation \*\*\*\*\*  
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999  
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8  
+> Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -   \*\*\*\*\* Eval results sst-2 \*\*\*\*\*  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -     acc = 0.9539942492811602  
+> 05/10/2020 17:01:17 - INFO - \_\_main__ -     loss = 0.16348013816401363
+
+Accuracy is about **95.4%**


-# Results
+## Code Usage

-> 05/10/2020 17:00:43 - INFO - transformers.trainer -   ***** Running Evaluation *****
-
-> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999
-
-> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8
-
->Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
-
->05/10/2020 17:01:17 - INFO - __main__ -   ***** Eval results sst-2 *****
-
->05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602
-
->05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363
-
-
-Accuracy is about *%95.4*
-# Code Usage
-
-```
+```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+
 model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

-p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
+p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
 print(p)
-#[{'label': 'LABEL_1', 'score': 0.9871089}]
-print (p[0]['label']=='LABEL_1')
-#True
+# [{'label': 'LABEL_1', 'score': 0.9871089}]
+print(p[0]['label'] == 'LABEL_1')
+# True

-
-p= sa("Film çok kötü ve çok sahteydi")
+p = sa("Film çok kötü ve çok sahteydi")
 print(p)
-#[{'label': 'LABEL_0', 'score': 0.9975505}]
-print (p[0]['label']=='LABEL_1')
-#False
+# [{'label': 'LABEL_0', 'score': 0.9975505}]
+print(p[0]['label'] == 'LABEL_1')
+# False
 ```

-# Test your data
+
+## Test
+### Data

 Suppose your file has lots of lines of comment and label (1 or 0) at the end  (tab seperated)

-> comment1 ... \t label
-
-> comment2 ... \t label
- 
+> comment1 ... \t label  
+> comment2 ... \t label  
 > ...

+### Code

-
-```
+```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

-f="/path/to/your/file/yourfile.tsv"
 model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
-sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
+sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

-i,crr=0,0
-for line in open(f):
- lines=line.strip().split("\t")
- if len(lines)==2:
-  i=i+1
-  if i%100==0:
-   print(i)
-  pred= sa(lines[0])
-  pred=pred[0]["label"].split("_")[1]
-  if pred== lines[1]:
-   crr=crr+1
+input_file = "/path/to/your/file/yourfile.tsv"
+
+i, crr = 0, 0
+for line in open(input_file):
+    lines = line.strip().split("\t")
+    if len(lines) == 2:
+        
+        i = i + 1
+        if i%100 == 0:
+            print(i)
+        
+        pred = sa(lines[0])
+        pred = pred[0]["label"].split("_")[1]
+        
+        if pred == lines[1]:
+        crr = crr + 1

 print(crr, i, crr/i)
 ```
-
-
-
-
-
--- a/model_cards/savasy/bert-base-turkish-squad/README.md
+++ b/model_cards/savasy/bert-base-turkish-squad/README.md
@@ -0,0 +1,67 @@
+---
+language: turkish
+---
+# Turkish SQuAD  Model : Question Answering
+
+I fine-tuned Turkish-Bert-Model for Question-Answering problem with Turkish version of SQuAD; TQuAD 
+* BERT-base: https://huggingface.co/dbmdz/bert-base-turkish-uncased
+* TQuAD dataset:  https://github.com/TQuad/turkish-nlp-qa-dataset
+
+
+# Training Code
+
+```
+!python3 run_squad.py \
+  --model_type bert \
+  --model_name_or_path dbmdz/bert-base-turkish-uncased\
+  --do_train \
+  --do_eval \
+  --train_file trainQ.json \
+  --predict_file dev1.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 5.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir "./model"
+```
+
+
+# Example Usage
+
+> Load Model
+```
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("./model")
+model = AutoModelForQuestionAnswering.from_pretrained("./model")
+nlp=pipeline("question-answering", model=model, tokenizer=tokenizer)
+```
+
+> Apply the model
+```
+
+sait="ABASIYANIK, Sait Faik. Hikayeci (Adapazarı 23 Kasım 1906-İstanbul 11 Mayıs 1954). \
+İlk öğrenimine Adapazarı’nda Rehber-i Terakki Mektebi’nde başladı. İki yıl kadar Adapazarı İdadisi’nde okudu.\
+İstanbul Erkek Lisesi’nde devam ettiği orta öğrenimini Bursa Lisesi’nde tamamladı (1928). İstanbul Edebiyat \
+Fakültesi’ne iki yıl devam ettikten sonra babasının isteği üzerine iktisat öğrenimi için İsviçre’ye gitti. \
+Kısa süre sonra iktisat öğrenimini bırakarak Lozan’dan Grenoble’a geçti. Üç yıl başıboş bir edebiyat öğrenimi \
+gördükten sonra babası tarafından geri çağrıldı (1933). Bir müddet Halıcıoğlu Ermeni Yetim Mektebi'nde Türkçe \
+gurup dersleri öğretmenliği yaptı. Ticarete atıldıysa da tutunamadı. Bir ay Haber gazetesinde adliye muhabirliği\
+yaptı (1942). Babasının ölümü üzerine aileden kalan emlakin geliri ile avare bir hayata başladı. Evlenemedi.\
+Yazları Burgaz adasındaki köşklerinde, kışları Şişli’deki apartmanlarında annesi ile beraber geçen bu fazla \
+içkili bohem hayatı ömrünün sonuna kadar sürdü."
+
+print(nlp(question="Ne zaman avare bir hayata başladı?", context=sait))
+print(nlp(question="Sait Faik hangi Lisede orta öğrenimini tamamladı?", context=sait))
+
+```
+```
+# Ask your self ! type your question
+print(nlp(question="...?", context=sait))
+```
+
+
+Check My other Model
+https://huggingface.co/savasy
--- a/model_cards/seiya/oubiobert-base-uncased/README.md
+++ b/model_cards/seiya/oubiobert-base-uncased/README.md
@@ -0,0 +1,51 @@
+---
+tags:
+- exbert
+license: apache-2.0
+---
+
+# ouBioBERT-Base, Uncased
+
+Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) is a language model based on the BERT-Base (Devlin, et al., 2019) architecture. We pre-trained ouBioBERT on PubMed abstracts from the PubMed baseline (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline) via our method.  
+
+The details of the pre-training procedure can be found in Wada, et al. (2020).  
+
+## Evaluation
+
+We evaluated the performance of ouBioBERT in terms of the biomedical language understanding evaluation (BLUE) benchmark (Peng, et al., 2019). The numbers are mean (standard deviation) on five different random seeds.  
+
+
+| Dataset         |  Task Type                   |  Score       |
+|:----------------|:-----------------------------|-------------:|
+| MedSTS          |  Sentence similarity         |  84.9 (0.6)  |
+| BIOSSES         |  Sentence similarity         |  92.3 (0.8)  |
+| BC5CDR-disease  |  Named-entity recognition    |  87.4 (0.1)  |
+| BC5CDR-chemical |  Named-entity recognition    |  93.7 (0.2)  |
+| ShARe/CLEFE     |  Named-entity recognition    |  80.1 (0.4)  |
+| DDI             |  Relation extraction         |  81.1 (1.5)  |
+| ChemProt        |  Relation extraction         |  75.0 (0.3)  |
+| i2b2 2010       |  Relation extraction         |  74.0 (0.8)  |
+| HoC             |  Document classification     |  86.4 (0.5)  |
+| MedNLI          |  Inference                   |  83.6 (0.7)  |
+| **Total**       |  Macro average of the scores |**83.8 (0.3)**|
+
+
+## Code for Fine-tuning
+We made the source code for fine-tuning freely available at [our repository](https://github.com/sy-wada/blue_benchmark_with_transformers).
+
+## Citation
+
+If you use our work in your research, please kindly cite the following paper:  
+
+```bibtex
+@misc{2005.07202,
+Author = {Shoya Wada and Toshihiro Takeda and Shiro Manabe and Shozo Konishi and Jun Kamohara and Yasushi Matsumura},
+Title = {A pre-training technique to localize medical BERT and enhance BioBERT},
+Year = {2020},
+Eprint = {arXiv:2005.07202},
+}
+```
+
+<a href="https://huggingface.co/exbert/?model=seiya/oubiobert-base-uncased&sentence=Coronavirus%20disease%20(COVID-19)%20is%20caused%20by%20SARS-COV2%20and%20represents%20the%20causative%20agent%20of%20a%20potentially%20fatal%20disease%20that%20is%20of%20great%20global%20public%20health%20concern.">
+	<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png">
+</a>
--- a/model_cards/valhalla/t5-base-squad/README.md
+++ b/model_cards/valhalla/t5-base-squad/README.md
@@ -0,0 +1,38 @@
+# T5 for question-answering
+This is T5-base model fine-tuned on SQuAD1.1 for QA using text-to-text approach
+
+## Model training
+This model was trained on colab TPU with 35GB RAM for 4 epochs
+
+## Results:
+| Metric      | #Value  |
+|-------------|---------|
+| Exact Match | 81.5610 |
+| F1          | 89.9601 |
+
+## Model in Action 🚀
+```
+from transformers import AutoModelWithLMHead, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("valhalla/t5-base-squad")
+model = AutoModelWithLMHead.from_pretrained("valhalla/t5-base-squad")
+
+def get_answer(question, context):
+  input_text = "question: %s  context: %s </s>" % (question, context)
+  features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
+
+  out = model.generate(input_ids=features['input_ids'], 
+               attention_mask=features['attention_mask'])
+  
+  return tokenizer.decode(out[0])
+
+context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
+question = "What is Valhalla ?"
+
+get_answer(question, context)
+# output: 'a majestic, enormous hall located in Asgard, ruled over by the god Odin'
+```
+Play with this model [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1a5xpJiUjZybfU9Mi-aDkOp116PZ9-wni?usp=sharing)
+
+> Created by Suraj Patil [![Github icon](https://cdn0.iconfinder.com/data/icons/octicons/1024/mark-github-32.png)](https://github.com/patil-suraj/)
+[![Twitter icon](https://cdn0.iconfinder.com/data/icons/shift-logotypes/32/Twitter-32.png)](https://twitter.com/psuraj28)
--- a/notebooks/02-transformers.ipynb
+++ b/notebooks/02-transformers.ipynb
@@ -3,7 +3,6 @@
  {
   "cell_type": "markdown",
   "metadata": {
-    "collapsed": true,
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
@@ -77,7 +76,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
   "metadata": {
    "pycharm": {
     "is_executing": false,
@@ -85,77 +84,7 @@
    },
    "scrolled": true
   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Requirement already satisfied: transformers in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (2.5.1)\n",
-      "Requirement already satisfied: filelock in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (3.0.12)\n",
-      "Requirement already satisfied: sentencepiece in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.1.83)\n",
-      "Requirement already satisfied: boto3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (1.12.0)\n",
-      "Requirement already satisfied: requests in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (2.22.0)\n",
-      "Requirement already satisfied: numpy in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (1.18.1)\n",
-      "Requirement already satisfied: sacremoses in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.0.35)\n",
-      "Requirement already satisfied: tokenizers==0.5.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (0.5.2)\n",
-      "Requirement already satisfied: regex!=2019.12.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (2020.1.8)\n",
-      "Requirement already satisfied: tqdm>=4.27 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from transformers) (4.42.1)\n",
-      "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (0.3.3)\n",
-      "Requirement already satisfied: botocore<1.16.0,>=1.15.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (1.15.0)\n",
-      "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from boto3->transformers) (0.9.4)\n",
-      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (2019.11.28)\n",
-      "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (2.8)\n",
-      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (1.25.8)\n",
-      "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests->transformers) (3.0.4)\n",
-      "Requirement already satisfied: joblib in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (0.14.0)\n",
-      "Requirement already satisfied: click in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (7.0)\n",
-      "Requirement already satisfied: six in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from sacremoses->transformers) (1.14.0)\n",
-      "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.0->boto3->transformers) (0.15.2)\n",
-      "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from botocore<1.16.0,>=1.15.0->boto3->transformers) (2.8.1)\n",
-      "Requirement already satisfied: tensorflow==2.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (2.1.0)\n",
-      "Requirement already satisfied: termcolor>=1.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.1.0)\n",
-      "Requirement already satisfied: keras-preprocessing>=1.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.1.0)\n",
-      "Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (3.1.0)\n",
-      "Requirement already satisfied: protobuf>=3.8.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (3.11.4)\n",
-      "Requirement already satisfied: numpy<2.0,>=1.16.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.18.1)\n",
-      "Requirement already satisfied: tensorboard<2.2.0,>=2.1.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (2.1.0)\n",
-      "Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.0.8)\n",
-      "Requirement already satisfied: wrapt>=1.11.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.11.2)\n",
-      "Requirement already satisfied: six>=1.12.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.14.0)\n",
-      "Requirement already satisfied: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (2.1.0)\n",
-      "Requirement already satisfied: scipy==1.4.1; python_version >= \"3\" in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.4.1)\n",
-      "Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.1.8)\n",
-      "Requirement already satisfied: wheel>=0.26; python_version >= \"3\" in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.34.2)\n",
-      "Requirement already satisfied: grpcio>=1.8.6 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (1.16.1)\n",
-      "Requirement already satisfied: absl-py>=0.7.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.9.0)\n",
-      "Requirement already satisfied: gast==0.2.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.2.2)\n",
-      "Requirement already satisfied: astor>=0.6.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorflow==2.1.0) (0.8.0)\n",
-      "Requirement already satisfied: setuptools in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from protobuf>=3.8.0->tensorflow==2.1.0) (45.2.0.post20200210)\n",
-      "Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.11.2)\n",
-      "Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.4.1)\n",
-      "Requirement already satisfied: markdown>=2.6.8 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.1.1)\n",
-      "Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.0.0)\n",
-      "Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2.22.0)\n",
-      "Requirement already satisfied: h5py in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow==2.1.0) (2.10.0)\n",
-      "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (4.0)\n",
-      "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (4.0.0)\n",
-      "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.2.8)\n",
-      "Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.3.0)\n",
-      "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2.8)\n",
-      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (2019.11.28)\n",
-      "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.0.4)\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (1.25.8)\r\n",
-      "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (0.4.8)\r\n",
-      "Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/Caskroom/miniconda/base/envs/huggingface/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0) (3.1.0)\r\n"
-     ]
-    }
-   ],
+   "outputs": [],
   "source": [
    "!pip install transformers\n",
    "!pip install tensorflow==2.1.0"
@@ -174,7 +103,7 @@
    {
     "data": {
      "text/plain": [
-       "<torch.autograd.grad_mode.set_grad_enabled at 0x102c0ce10>"
+       "<torch.autograd.grad_mode.set_grad_enabled at 0x7f10b441e890>"
      ]
     },
     "execution_count": 2,
@@ -441,7 +370,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "is_executing": false
@@ -458,13 +387,22 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "output differences: 1.6236e-05\n",
+      "pooled differences: -1.3039e-08\n"
+     ]
+    }
+   ],
   "source": [
    "# transformers generates a ready to use dictionary with all the required parameters for the specific framework.\n",
    "input_tf = tokenizer.encode_plus(\"This is a sample input\", return_tensors=\"tf\")\n",
@@ -476,7 +414,7 @@
    "# Models outputs 2 values (The value for each tokens, the pooled representation of the input sentence)\n",
    "# Here we compare the output differences between PyTorch and TensorFlow.\n",
    "for name, o_tf, o_pt in zip([\"output\", \"pooled\"], output_tf, output_pt):\n",
-    "    print(\"{} differences: {}\".format(name, (o_tf.numpy() - o_pt.numpy()).sum()))"
+    "    print(\"{} differences: {:.5}\".format(name, (o_tf.numpy() - o_pt.numpy()).sum()))"
   ]
  },
  {
@@ -504,13 +442,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 232 ms, sys: 0 ns, total: 232 ms\n",
+      "Wall time: 21.1 ms\n",
+      "CPU times: user 511 ms, sys: 0 ns, total: 511 ms\n",
+      "Wall time: 43.9 ms\n"
+     ]
+    }
+   ],
   "source": [
    "from transformers import DistilBertModel\n",
    "\n",
@@ -541,13 +490,25 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tokens (int)      : [102, 12272, 9355, 5746, 30881, 215, 261, 5945, 4118, 212, 2414, 153, 1942, 232, 3532, 566, 103]\n",
+      "Tokens (str)      : ['[CLS]', 'Hug', '##ging', 'Fac', '##e', 'ist', 'eine', 'französische', 'Firma', 'mit', 'Sitz', 'in', 'New', '-', 'York', '.', '[SEP]']\n",
+      "Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n",
+      "\n",
+      "Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])\n"
+     ]
+    }
+   ],
   "source": [
    "# Let's load German BERT from the Bavarian State Library\n",
    "de_bert = BertModel.from_pretrained(\"dbmdz/bert-base-german-cased\")\n",
@@ -557,7 +518,14 @@
    "    \"Hugging Face ist eine französische Firma mit Sitz in New-York.\",\n",
    "    return_tensors=\"pt\"\n",
    ")\n",
-    "output_de, pooled_de = de_bert(**de_input)"
+    "print(\"Tokens (int)      : {}\".format(de_input['input_ids'].tolist()[0]))\n",
+    "print(\"Tokens (str)      : {}\".format([de_tokenizer.convert_ids_to_tokens(s) for s in de_input['input_ids'].tolist()[0]]))\n",
+    "print(\"Tokens (attn_mask): {}\".format(de_input['attention_mask'].tolist()[0]))\n",
+    "print()\n",
+    "\n",
+    "output_de, pooled_de = de_bert(**de_input)\n",
+    "\n",
+    "print(\"Token wise output: {}, Pooled output: {}\".format(outputs.shape, pooled.shape))"
   ]
  }
 ],
@@ -577,7 +545,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.7.4"
  },
  "pycharm": {
   "stem_cell": {
@@ -590,5 +558,5 @@
  }
 },
 "nbformat": 4,
- "nbformat_minor": 1
+ "nbformat_minor": 4
 }
--- a/notebooks/04-onnx-export.ipynb
+++ b/notebooks/04-onnx-export.ipynb
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -4,15 +4,25 @@ You can find here a list of the official notebooks provided by Hugging Face.

 Also, we would like to list here interesting content created by the community. 
 If you wrote some notebook(s) leveraging transformers and would like be listed here, please open a 
-Pull Request and we'll review it so it can be included here. 
+Pull Request so it can be included under the Community notebooks. 


 ## Hugging Face's notebooks :hugs:

 | Notebook     |      Description      |   |
-|:----------|:-------------:|------:|
+|:----------|:-------------|------:|
 | [Getting Started Tokenizers](https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb)  | How to train and use your very own tokenizer  |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
 | [Getting Started Transformers](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb)   | How to easily start using transformers  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb) |
 | [How to use Pipelines](https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb) |
 | [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
 | [How to generate text](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)| How to use different decoding methods for language generation with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)|
+| [How to export model to ONNX](https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb) | Highlight how to export and run inference workloads through ONNX |
+
+
+## Community notebooks:
+
+| Notebook     |      Description      |      Author      |      |
+|:----------|:-------------|:-------------|------:|
+| [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb)  | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) |
+| [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb)  | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning |  [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) |
+| [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb)  | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots |  [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb)
--- a/setup.cfg
+++ b/setup.cfg
@@ -36,5 +36,5 @@ multi_line_output = 3
 use_parentheses = True

 [flake8]
-ignore = E203, E501, W503
+ignore = E203, E501, E741, W503
 max-line-length = 119
--- a/setup.py
+++ b/setup.py
@@ -67,8 +67,18 @@ extras = {}

 extras["mecab"] = ["mecab-python3"]
 extras["sklearn"] = ["scikit-learn"]
-extras["tf"] = ["tensorflow"]
-extras["tf-cpu"] = ["tensorflow-cpu"]
+
+# keras2onnx and onnxconverter-common version is specific through a commit until 1.7.0 lands on pypi
+extras["tf"] = [
+    "tensorflow",
+    "onnxconverter-common",
+    "keras2onnx"
+]
+extras["tf-cpu"] = [
+    "tensorflow-cpu",
+    "onnxconverter-common",
+    "keras2onnx"
+]
 extras["torch"] = ["torch"]

 extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
@@ -79,14 +89,14 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
 extras["quality"] = [
    "black",
    "isort",
-    "flake8==3.7.9",
+    "flake8",
 ]
 extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]

 setup(
    name="transformers",
-    version="2.9.1",
-    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
+    version="2.10.0",
+    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
    long_description=open("README.md", "r", encoding="utf-8").read(),
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "2.9.1"
+__version__ = "2.10.0"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -44,6 +44,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
 from .configuration_encoder_decoder import EncoderDecoderConfig
 from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
 from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
+from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
 from .configuration_marian import MarianConfig
 from .configuration_mmbt import MMBTConfig
 from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
@@ -138,6 +139,7 @@ from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFas
 from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
 from .tokenization_flaubert import FlaubertTokenizer
 from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
+from .tokenization_longformer import LongformerTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
 from .tokenization_reformer import ReformerTokenizer
 from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
@@ -319,6 +321,7 @@ if is_torch_available():
        ElectraForMaskedLM,
        ElectraForTokenClassification,
        ElectraPreTrainedModel,
+        ElectraForSequenceClassification,
        ElectraModel,
        load_tf_weights_in_electra,
        ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -332,6 +335,8 @@ if is_torch_available():
        REFORMER_PRETRAINED_MODEL_ARCHIVE_MAP,
    )

+    from .modeling_longformer import LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP, LongformerModel, LongformerForMaskedLM
+
    # Optimization
    from .optimization import (
        AdamW,
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -28,6 +28,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
 from .configuration_encoder_decoder import EncoderDecoderConfig
 from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
 from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
+from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
 from .configuration_marian import MarianConfig
 from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
 from .configuration_reformer import ReformerConfig
@@ -62,6 +63,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
        XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
        FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
    ]
    for key, value, in pretrained_map.items()
 )
@@ -77,6 +79,7 @@ CONFIG_MAPPING = OrderedDict(
        ("marian", MarianConfig,),
        ("bart", BartConfig,),
        ("reformer", ReformerConfig,),
+        ("longformer", LongformerConfig,),
        ("roberta", RobertaConfig,),
        ("flaubert", FlaubertConfig,),
        ("bert", BertConfig,),
@@ -133,6 +136,7 @@ class AutoConfig:
            - contains `albert`: :class:`~transformers.AlbertConfig` (ALBERT model)
            - contains `camembert`: :class:`~transformers.CamembertConfig` (CamemBERT model)
            - contains `xlm-roberta`: :class:`~transformers.XLMRobertaConfig` (XLM-RoBERTa model)
+            - contains `longformer`: :class:`~transformers.LongformerConfig` (Longformer model)
            - contains `roberta`: :class:`~transformers.RobertaConfig` (RoBERTa model)
            - contains `reformer`: :class:`~transformers.ReformerConfig` (Reformer model)
            - contains `bert`: :class:`~transformers.BertConfig` (Bert model)
@@ -145,7 +149,6 @@ class AutoConfig:
            - contains `flaubert` : :class:`~transformers.FlaubertConfig` (Flaubert model)
            - contains `electra` : :class:`~transformers.ElectraConfig` (ELECTRA model)

-
        Args:
            pretrained_model_name_or_path (:obj:`string`):
                Is either: \
--- a/src/transformers/configuration_longformer.py
+++ b/src/transformers/configuration_longformer.py
@@ -0,0 +1,69 @@
+# coding=utf-8
+# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Longformer configuration """
+
+import logging
+from typing import List, Union
+
+from .configuration_roberta import RobertaConfig
+
+
+logger = logging.getLogger(__name__)
+
+LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "longformer-base-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/config.json",
+    "longformer-large-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/config.json",
+}
+
+
+class LongformerConfig(RobertaConfig):
+    r"""
+        This is the configuration class to store the configuration of an :class:`~transformers.LongformerModel`.
+        It is used to instantiate an Longformer model according to the specified arguments, defining the model
+        architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+        the RoBERTa `roberta-base <https://huggingface.co/roberta-base>`__ architecture with a sequence length 4,096.
+
+        The :class:`~transformers.LongformerConfig` class directly inherits :class:`~transformers.RobertaConfig`.
+        It reuses the same defaults. Please check the parent class for more information.
+
+        Args:
+            attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
+                Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
+                To specify a different window size for each layer, use a :obj:`List[int]` where
+                ``len(attention_window) == num_hidden_layers``.
+
+        Example::
+
+            from transformers import LongformerConfig, LongformerModel
+
+            # Initializing a Longformer configuration
+            configuration = LongformerConfig()
+
+            # Initializing a model from the configuration
+            model = LongformerModel(configuration)
+
+            # Accessing the model configuration
+            configuration = model.config
+
+        Attributes:
+            pretrained_config_archive_map (Dict[str, str]):
+                A dictionary containing all the available pre-trained checkpoints.
+    """
+    pretrained_config_archive_map = LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP
+    model_type = "longformer"
+
+    def __init__(self, attention_window: Union[List[int], int] = 512, **kwargs):
+        super().__init__(**kwargs)
+        self.attention_window = attention_window
--- a/src/transformers/configuration_roberta.py
+++ b/src/transformers/configuration_roberta.py
@@ -68,6 +68,6 @@ class RobertaConfig(BertConfig):
    model_type = "roberta"

    def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):
-        """Constructs FlaubertConfig.
+        """Constructs RobertaConfig.
        """
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
--- a/src/transformers/configuration_t5.py
+++ b/src/transformers/configuration_t5.py
@@ -39,10 +39,10 @@ class T5Config(PretrainedConfig):

        Arguments:
            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `T5Model`.
-            hidden_size: Size of the encoder layers and the pooler layer.
-            num_hidden_layers: Number of hidden layers in the Transformer encoder.
-            num_attention_heads: Number of attention heads for each attention layer in
-                the Transformer encoder.
+            d_model: Size of the encoder layers and the pooler layer. `d_model` can also accesed via the property `hidden_size`.
+            num_layers: Number of hidden layers in the Transformer encoder. `num_layers` can also be accessed via the property `num_hidden_layers`.
+            num_heads: Number of attention heads for each attention layer in
+                the Transformer encoder. `num_heads` can also be accessed via the property `num_attention_heads`.
            intermediate_size: The size of the "intermediate" (i.e., feed-forward)
                layer in the Transformer encoder.
            hidden_act: The non-linear activation function (function or string) in the
@@ -51,9 +51,9 @@ class T5Config(PretrainedConfig):
                layers in the embeddings, encoder, and pooler.
            attention_probs_dropout_prob: The dropout ratio for the attention
                probabilities.
-            max_position_embeddings: The maximum sequence length that this model might
+            n_positions: The maximum sequence length that this model might
                ever be used with. Typically set this to something large just in case
-                (e.g., 512 or 1024 or 2048).
+                (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'.
            type_vocab_size: The vocabulary size of the `token_type_ids` passed into
                `T5Model`.
            initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing).
--- a/src/transformers/convert_graph_to_onnx.py
+++ b/src/transformers/convert_graph_to_onnx.py
@@ -0,0 +1,220 @@
+from argparse import ArgumentParser
+from itertools import takewhile
+from os import listdir, makedirs
+from os.path import abspath, dirname, exists
+from typing import Dict, List, Optional, Tuple
+
+from transformers import is_tf_available, is_torch_available
+from transformers.pipelines import Pipeline, pipeline
+from transformers.tokenization_utils import BatchEncoding
+
+
+class OnnxConverterArgumentParser(ArgumentParser):
+    """
+    Wraps all the script arguments supported to export transformers models to ONNX IR
+    """
+
+    def __init__(self):
+        super(OnnxConverterArgumentParser, self).__init__("ONNX Converter")
+
+        self.add_argument("--model", type=str, required=True, help="Model's id or path (ex: bert-base-cased)")
+        self.add_argument("--tokenizer", type=str, help="Tokenizer's id or path (ex: bert-base-cased)")
+        self.add_argument("--framework", type=str, choices=["pt", "tf"], help="Framework for loading the model")
+        self.add_argument("--opset", type=int, default=11, help="ONNX opset to use")
+        self.add_argument("--check-loading", action="store_true", help="Check ONNX is able to load the model")
+        self.add_argument("--use-external-format", action="store_true", help="Allow exporting model >= than 2Gb")
+        self.add_argument("output")
+
+
+def ensure_valid_input(model, tokens, input_names):
+    """
+    Ensure input are presented in the correct order, without any None
+    Args:
+        model: The model used to forward the input data
+        tokens: BatchEncoding holding the input data
+        input_names: The name of the inputs
+
+    Returns: Tuple
+
+    """
+    model_args_name = model.forward.__code__.co_varnames
+    model_args_pos = [(model_args_name.index(name) - 1, name) for name in input_names]
+    model_args = [None] * (max(map(lambda x: x[0], model_args_pos)) + 1)
+
+    for arg_pos, arg_name in model_args_pos:
+        model_args[arg_pos] = tokens[arg_name]
+
+    model_args = tuple(model_args)  # Need to be ordered
+    return tuple(takewhile(lambda arg: arg is not None, model_args))
+
+
+def infer_shapes(nlp: Pipeline, framework: str) -> Tuple[List[str], List[str], Dict, BatchEncoding]:
+    def build_shape_dict(tensor, is_input: bool, seq_len: int):
+        if isinstance(tensor, (tuple, list)):
+            return [build_shape_dict(t, is_input, seq_len) for t in tensor]
+
+        else:
+            # Let's assume batch is the first axis with only 1 element (~~ might not be always true ...)
+            axes = {[axis for axis, numel in enumerate(tensor.shape) if numel == 1][0]: "batch"}
+            if is_input:
+                if len(tensor.shape) == 2:
+                    axes[1] = "sequence"
+                else:
+                    raise ValueError("Unable to infer tensor axes ({})".format(len(tensor.shape)))
+            else:
+                seq_axes = [dim for dim, shape in enumerate(tensor.shape) if shape == seq_len]
+                axes.update({dim: "sequence" for dim in seq_axes})
+
+        return axes
+
+    tokens = nlp.tokenizer.encode_plus("This is a sample output", return_tensors=framework)
+    seq_len = tokens.input_ids.shape[-1]
+    outputs = nlp.model(**tokens) if framework == "pt" else nlp.model(tokens)
+
+    if not isinstance(outputs, (list, tuple)):
+        outputs = (outputs,)
+
+    # Generate input names & axes
+    input_vars = list(tokens.keys())
+    input_dynamic_axes = {k: build_shape_dict(v, True, seq_len) for k, v in tokens.items()}
+
+    # flatten potentially grouped outputs (past for gpt2, attentions)
+    outputs_flat = []
+    for output in outputs:
+        if isinstance(output, (tuple, list)):
+            outputs_flat.extend(output)
+        else:
+            outputs_flat.append(output)
+
+    # Generate output names & axes
+    output_names = ["output_{}".format(i) for i in range(len(outputs_flat))]
+    output_dynamic_axes = {k: build_shape_dict(v, False, seq_len) for k, v in zip(output_names, outputs_flat)}
+
+    # Create the aggregated axes representation
+    dynamic_axes = dict(input_dynamic_axes, **output_dynamic_axes)
+    return input_vars, output_names, dynamic_axes, tokens
+
+
+def load_graph_from_args(framework: str, model: str, tokenizer: Optional[str] = None) -> Pipeline:
+    # If no tokenizer provided
+    if tokenizer is None:
+        tokenizer = model
+
+    print("Loading pipeline (model: {}, tokenizer: {})".format(model, tokenizer))
+
+    # Allocate tokenizer and model
+    return pipeline("feature-extraction", model=model, tokenizer=tokenizer, framework=framework)
+
+
+def convert_pytorch(nlp: Pipeline, opset: int, output: str, use_external_format: bool):
+    if not is_torch_available():
+        raise Exception("Cannot convert because PyTorch is not installed. Please install torch first.")
+
+    import torch
+    from torch.onnx import export
+
+    print("PyTorch: {}".format(torch.__version__))
+
+    with torch.no_grad():
+        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, "pt")
+        model_args = ensure_valid_input(nlp.model, tokens, input_names)
+
+        export(
+            nlp.model,
+            model_args,
+            f=output,
+            input_names=input_names,
+            output_names=output_names,
+            dynamic_axes=dynamic_axes,
+            do_constant_folding=True,
+            use_external_data_format=use_external_format,
+            enable_onnx_checker=True,
+            opset_version=opset,
+        )
+
+
+def convert_tensorflow(nlp: Pipeline, opset: int, output: str):
+    if not is_tf_available():
+        raise Exception(
+            "Cannot convert {} because TF is not installed. Please install torch first.".format(args.model)
+        )
+
+    print("/!\\ Please note TensorFlow doesn't support exporting model > 2Gb /!\\")
+
+    try:
+        import tensorflow as tf
+        from keras2onnx import convert_keras, save_model, __version__ as k2ov
+
+        print("TensorFlow: {}, keras2onnx: {}".format(tf.version.VERSION, k2ov))
+
+        # Build
+        input_names, output_names, dynamic_axes, tokens = infer_shapes(nlp, "tf")
+
+        # Forward
+        nlp.model.predict(tokens.data)
+        onnx_model = convert_keras(nlp.model, nlp.model.name, target_opset=opset)
+        save_model(onnx_model, output)
+
+    except ImportError as e:
+        raise Exception(
+            "Cannot import {} required to convert TF model to ONNX. Please install {} first.".format(e.name, e.name)
+        )
+
+
+def convert(
+    framework: str,
+    model: str,
+    output: str,
+    opset: int,
+    tokenizer: Optional[str] = None,
+    use_external_format: bool = False,
+):
+    print("ONNX opset version set to: {}".format(opset))
+
+    # Load the pipeline
+    nlp = load_graph_from_args(framework, model, tokenizer)
+
+    parent = dirname(output)
+    if not exists(parent):
+        print("Creating folder {}".format(parent))
+        makedirs(parent)
+    elif len(listdir(parent)) > 0:
+        raise Exception("Folder {} is not empty, aborting conversion".format(parent))
+
+    # Export the graph
+    if framework == "pt":
+        convert_pytorch(nlp, opset, output, use_external_format)
+    else:
+        convert_tensorflow(nlp, opset, output)
+
+
+def verify(path: str):
+    from onnxruntime import InferenceSession, SessionOptions
+    from onnxruntime.capi.onnxruntime_pybind11_state import RuntimeException
+
+    print("Checking ONNX model loading from: {}".format(path))
+    try:
+        onnx_options = SessionOptions()
+        _ = InferenceSession(path, onnx_options, providers=["CPUExecutionProvider"])
+        print("Model correctly loaded")
+    except RuntimeException as re:
+        print("Error while loading the model: {}".format(re))
+
+
+if __name__ == "__main__":
+    parser = OnnxConverterArgumentParser()
+    args = parser.parse_args()
+
+    # Make sure output is absolute path
+    args.output = abspath(args.output)
+
+    try:
+        # Convert
+        convert(args.framework, args.model, args.output, args.opset, args.tokenizer, args.use_external_format)
+
+        # And verify
+        if args.check_loading:
+            verify(args.output)
+    except Exception as e:
+        print("Error while converting the model: {}".format(e))
+        exit(1)
--- a/src/transformers/convert_marian_to_pytorch.py
+++ b/src/transformers/convert_marian_to_pytorch.py
@@ -226,7 +226,7 @@ def lmap(f, x) -> List:
 def fetch_test_set(test_set_url):
    import wget

-    fname = wget.download(test_set_url, f"opus_test.txt")
+    fname = wget.download(test_set_url, "opus_test.txt")
    lns = Path(fname).open().readlines()
    src = lmap(str.strip, lns[::4])
    gold = lmap(str.strip, lns[1::4])
--- a/src/transformers/data/datasets/glue.py
+++ b/src/transformers/data/datasets/glue.py
@@ -2,7 +2,8 @@ import logging
 import os
 import time
 from dataclasses import dataclass, field
-from typing import List, Optional
+from enum import Enum
+from typing import List, Optional, Union

 import torch
 from filelock import FileLock
@@ -47,6 +48,12 @@ class GlueDataTrainingArguments:
        self.task_name = self.task_name.lower()


+class Split(Enum):
+    train = "train"
+    dev = "dev"
+    test = "test"
+
+
 class GlueDataset(Dataset):
    """
    This will be superseded by a framework-agnostic approach
@@ -62,16 +69,21 @@ class GlueDataset(Dataset):
        args: GlueDataTrainingArguments,
        tokenizer: PreTrainedTokenizer,
        limit_length: Optional[int] = None,
-        evaluate=False,
+        mode: Union[str, Split] = Split.train,
    ):
        self.args = args
-        processor = glue_processors[args.task_name]()
+        self.processor = glue_processors[args.task_name]()
        self.output_mode = glue_output_modes[args.task_name]
+        if isinstance(mode, str):
+            try:
+                mode = Split[mode]
+            except KeyError:
+                raise KeyError("mode is not a valid split name")
        # Load data features from cache or dataset file
        cached_features_file = os.path.join(
            args.data_dir,
            "cached_{}_{}_{}_{}".format(
-                "dev" if evaluate else "train", tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
+                mode.value, tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
            ),
        )

@@ -88,7 +100,7 @@ class GlueDataset(Dataset):
                )
            else:
                logger.info(f"Creating features from dataset file at {args.data_dir}")
-                label_list = processor.get_labels()
+                label_list = self.processor.get_labels()
                if args.task_name in ["mnli", "mnli-mm"] and tokenizer.__class__ in (
                    RobertaTokenizer,
                    RobertaTokenizerFast,
@@ -96,11 +108,12 @@ class GlueDataset(Dataset):
                ):
                    # HACK(label indices are swapped in RoBERTa pretrained model)
                    label_list[1], label_list[2] = label_list[2], label_list[1]
-                examples = (
-                    processor.get_dev_examples(args.data_dir)
-                    if evaluate
-                    else processor.get_train_examples(args.data_dir)
-                )
+                if mode == Split.dev:
+                    examples = self.processor.get_dev_examples(args.data_dir)
+                elif mode == Split.test:
+                    examples = self.processor.get_test_examples(args.data_dir)
+                else:
+                    examples = self.processor.get_train_examples(args.data_dir)
                if limit_length is not None:
                    examples = examples[:limit_length]
                self.features = glue_convert_examples_to_features(
@@ -114,7 +127,7 @@ class GlueDataset(Dataset):
                torch.save(self.features, cached_features_file)
                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.
                logger.info(
-                    f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
+                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )

    def __len__(self):
@@ -122,3 +135,6 @@ class GlueDataset(Dataset):

    def __getitem__(self, i) -> InputFeatures:
        return self.features[i]
+
+    def get_labels(self):
+        return self.processor.get_labels()
--- a/src/transformers/data/datasets/language_modeling.py
+++ b/src/transformers/data/datasets/language_modeling.py
@@ -4,10 +4,10 @@ import pickle
 import time

 import torch
+from filelock import FileLock
 from torch.utils.data.dataset import Dataset

 from ...tokenization_utils import PreTrainedTokenizer
-from ...trainer import torch_distributed_zero_first


 logger = logging.getLogger(__name__)
@@ -20,7 +20,7 @@ class TextDataset(Dataset):
    """

    def __init__(
-        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False, local_rank=-1,
+        self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, overwrite_cache=False,
    ):
        assert os.path.isfile(file_path)

@@ -31,9 +31,10 @@ class TextDataset(Dataset):
            directory, "cached_lm_{}_{}_{}".format(tokenizer.__class__.__name__, str(block_size), filename,),
        )

-        with torch_distributed_zero_first(local_rank):
-            # Make sure only the first process in distributed training processes the dataset,
-            # and the others will use the cache.
+        # Make sure only the first process in distributed training processes the dataset,
+        # and the others will use the cache.
+        lock_path = cached_features_file + ".lock"
+        with FileLock(lock_path):

            if os.path.exists(cached_features_file) and not overwrite_cache:
                start = time.time()
@@ -64,7 +65,7 @@ class TextDataset(Dataset):
                with open(cached_features_file, "wb") as handle:
                    pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
                logger.info(
-                    f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
+                    "Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
                )

    def __len__(self):
@@ -80,7 +81,7 @@ class LineByLineTextDataset(Dataset):
    soon.
    """

-    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int, local_rank=-1):
+    def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
        assert os.path.isfile(file_path)
        # Here, we do not cache the features, operating under the assumption
        # that we will soon use fast multithreaded tokenizers from the
--- a/src/transformers/data/processors/glue.py
+++ b/src/transformers/data/processors/glue.py
@@ -126,7 +126,9 @@ def _glue_convert_examples_to_features(

    label_map = {label: i for i, label in enumerate(label_list)}

-    def label_from_example(example: InputExample) -> Union[int, float]:
+    def label_from_example(example: InputExample) -> Union[int, float, None]:
+        if example.label is None:
+            return None
        if output_mode == "classification":
            return label_map[example.label]
        elif output_mode == "regression":
@@ -180,12 +182,16 @@ class MrpcProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -193,7 +199,7 @@ class MrpcProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
-            label = line[0]
+            label = None if set_type == "test" else line[0]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

@@ -218,12 +224,16 @@ class MnliProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")), "dev_matched")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test_matched")
+
    def get_labels(self):
        """See base class."""
        return ["contradiction", "entailment", "neutral"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -231,7 +241,7 @@ class MnliProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[8]
            text_b = line[9]
-            label = line[-1]
+            label = None if set_type.startswith("test") else line[-1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

@@ -241,7 +251,11 @@ class MnliMismatchedProcessor(MnliProcessor):

    def get_dev_examples(self, data_dir):
        """See base class."""
-        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), "dev_matched")
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev_mismatched.tsv")), "dev_mismatched")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test_mismatched.tsv")), "test_mismatched")


 class ColaProcessor(DataProcessor):
@@ -264,17 +278,25 @@ class ColaProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
+        test_mode = set_type == "test"
+        if test_mode:
+            lines = lines[1:]
+        text_index = 1 if test_mode else 3
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
-            text_a = line[3]
-            label = line[1]
+            text_a = line[text_index]
+            label = None if test_mode else line[1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

@@ -299,19 +321,23 @@ class Sst2Processor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[0]
-            label = line[1]
+            label = None if set_type == "test" else line[1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

@@ -336,12 +362,16 @@ class StsbProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return [None]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -349,7 +379,7 @@ class StsbProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[7]
            text_b = line[8]
-            label = line[-1]
+            label = None if set_type == "test" else line[-1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

@@ -374,21 +404,28 @@ class QqpProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
+        test_mode = set_type == "test"
+        q1_index = 1 if test_mode else 3
+        q2_index = 2 if test_mode else 4
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            try:
-                text_a = line[3]
-                text_b = line[4]
-                label = line[5]
+                text_a = line[q1_index]
+                text_b = line[q2_index]
+                label = None if test_mode else line[5]
            except IndexError:
                continue
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
@@ -413,14 +450,18 @@ class QnliProcessor(DataProcessor):

    def get_dev_examples(self, data_dir):
        """See base class."""
-        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev_matched")
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -428,7 +469,7 @@ class QnliProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
-            label = line[-1]
+            label = None if set_type == "test" else line[-1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

@@ -453,12 +494,16 @@ class RteProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["entailment", "not_entailment"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -466,7 +511,7 @@ class RteProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
-            label = line[-1]
+            label = None if set_type == "test" else line[-1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

@@ -491,12 +536,16 @@ class WnliProcessor(DataProcessor):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
-        """Creates examples for the training and dev sets."""
+        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
@@ -504,7 +553,7 @@ class WnliProcessor(DataProcessor):
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
-            label = line[-1]
+            label = None if set_type == "test" else line[-1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

--- a/src/transformers/data/processors/squad.py
+++ b/src/transformers/data/processors/squad.py
@@ -195,18 +195,22 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
        cls_index = span["input_ids"].index(tokenizer.cls_token_id)

        # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
-        # Original TF implem also keep the classification token (set to 0) (not sure why...)
-        p_mask = np.array(span["token_type_ids"])
-
-        p_mask = np.minimum(p_mask, 1)
-
+        # Original TF implem also keep the classification token (set to 0)
+        p_mask = np.ones_like(span["token_type_ids"])
        if tokenizer.padding_side == "right":
-            # Limit positive values to one
-            p_mask = 1 - p_mask
+            p_mask[len(truncated_query) + sequence_added_tokens :] = 0
+        else:
+            p_mask[-len(span["tokens"]) : -(len(truncated_query) + sequence_added_tokens)] = 0

-        p_mask[np.where(np.array(span["input_ids"]) == tokenizer.sep_token_id)[0]] = 1
+        pad_token_indices = np.where(span["input_ids"] == tokenizer.pad_token_id)
+        special_token_indices = np.asarray(
+            tokenizer.get_special_tokens_mask(span["input_ids"], already_has_special_tokens=True)
+        ).nonzero()

-        # Set the CLS index to '0'
+        p_mask[pad_token_indices] = 1
+        p_mask[special_token_indices] = 1
+
+        # Set the cls index to 0: the CLS index can be used for impossible answers
        p_mask[cls_index] = 0

        span_is_impossible = example.is_impossible
--- a/src/transformers/data/processors/utils.py
+++ b/src/transformers/data/processors/utils.py
@@ -98,6 +98,10 @@ class DataProcessor:
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the test set."""
+        raise NotImplementedError()
+
    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()
--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -550,7 +550,7 @@ class AlbertModel(AlbertPreTrainedModel):
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
-        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -30,6 +30,7 @@ from .configuration_auto import (
    EncoderDecoderConfig,
    FlaubertConfig,
    GPT2Config,
+    LongformerConfig,
    OpenAIGPTConfig,
    ReformerConfig,
    RobertaConfig,
@@ -87,6 +88,7 @@ from .modeling_electra import (
    ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
    ElectraForMaskedLM,
    ElectraForPreTraining,
+    ElectraForSequenceClassification,
    ElectraForTokenClassification,
    ElectraModel,
 )
@@ -99,6 +101,7 @@ from .modeling_flaubert import (
    FlaubertWithLMHeadModel,
 )
 from .modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2LMHeadModel, GPT2Model
+from .modeling_longformer import LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP, LongformerForMaskedLM, LongformerModel
 from .modeling_marian import MarianMTModel
 from .modeling_openai import OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OpenAIGPTLMHeadModel, OpenAIGPTModel
 from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
@@ -162,6 +165,7 @@ ALL_PRETRAINED_MODEL_ARCHIVE_MAP = dict(
        FLAUBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
        ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP,
+        LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP,
    ]
    for key, value, in pretrained_map.items()
 )
@@ -174,6 +178,7 @@ MODEL_MAPPING = OrderedDict(
        (CamembertConfig, CamembertModel),
        (XLMRobertaConfig, XLMRobertaModel),
        (BartConfig, BartModel),
+        (LongformerConfig, LongformerModel),
        (RobertaConfig, RobertaModel),
        (BertConfig, BertModel),
        (OpenAIGPTConfig, OpenAIGPTModel),
@@ -196,6 +201,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
        (BartConfig, BartForConditionalGeneration),
+        (LongformerConfig, LongformerForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
        (BertConfig, BertForPreTraining),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
@@ -218,6 +224,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
        (MarianConfig, MarianMTModel),
        (BartConfig, BartForConditionalGeneration),
+        (LongformerConfig, LongformerForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
        (BertConfig, BertForMaskedLM),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
@@ -245,6 +252,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
        (XLNetConfig, XLNetForSequenceClassification),
        (FlaubertConfig, FlaubertForSequenceClassification),
        (XLMConfig, XLMForSequenceClassification),
+        (ElectraConfig, ElectraForSequenceClassification),
    ]
 )

@@ -313,6 +321,7 @@ class AutoModel:
                The model class to instantiate is selected based on the configuration class:

                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModel` (DistilBERT model)
+                - isInstance of `longformer` configuration class: :class:`~transformers.LongformerModel` (Longformer model)
                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModel` (RoBERTa model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertModel` (Bert model)
                - isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTModel` (OpenAI GPT model)
@@ -355,6 +364,7 @@ class AutoModel:
            - contains `albert`: :class:`~transformers.AlbertModel` (ALBERT model)
            - contains `camembert`: :class:`~transformers.CamembertModel` (CamemBERT model)
            - contains `xlm-roberta`: :class:`~transformers.XLMRobertaModel` (XLM-RoBERTa model)
+            - contains `longformer` :class:`~transformers.LongformerModel` (Longformer model)
            - contains `roberta`: :class:`~transformers.RobertaModel` (RoBERTa model)
            - contains `bert`: :class:`~transformers.BertModel` (Bert model)
            - contains `openai-gpt`: :class:`~transformers.OpenAIGPTModel` (OpenAI GPT model)
@@ -463,6 +473,7 @@ class AutoModelForPreTraining:
                The model class to instantiate is selected based on the configuration class:

                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
+                - isInstance of `longformer` configuration class: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertForPreTraining` (Bert model)
                - isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
@@ -504,6 +515,7 @@ class AutoModelForPreTraining:
            - contains `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
            - contains `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model)
            - contains `xlm-roberta`: :class:`~transformers.XLMRobertaForMaskedLM` (XLM-RoBERTa model)
+            - contains `longformer`: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
            - contains `roberta`: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
            - contains `bert`: :class:`~transformers.BertForPreTraining` (Bert model)
            - contains `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
@@ -606,6 +618,7 @@ class AutoModelWithLMHead:
                The model class to instantiate is selected based on the configuration class:

                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
+                - isInstance of `longformer` configuration class: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertForMaskedLM` (Bert model)
                - isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
@@ -648,6 +661,7 @@ class AutoModelWithLMHead:
            - contains `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model)
            - contains `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model)
            - contains `xlm-roberta`: :class:`~transformers.XLMRobertaForMaskedLM` (XLM-RoBERTa model)
+            - contains `longformer`: :class:`~transformers.LongformerForMaskedLM` (Longformer model)
            - contains `roberta`: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
            - contains `bert`: :class:`~transformers.BertForMaskedLM` (Bert model)
            - contains `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
--- a/src/transformers/modeling_bert.py
+++ b/src/transformers/modeling_bert.py
@@ -703,9 +703,7 @@ class BertModel(BertPreTrainedModel):

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
-        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
-            attention_mask, input_shape, self.device
-        )
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)

        # If a 2D ou 3D attention mask is provided for the cross-attention
        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
--- a/src/transformers/modeling_electra.py
+++ b/src/transformers/modeling_electra.py
@@ -3,6 +3,7 @@ import os

 import torch
 import torch.nn as nn
+from torch.nn import CrossEntropyLoss, MSELoss

 from .activations import get_activation
 from .configuration_electra import ElectraConfig
@@ -330,6 +331,112 @@ class ElectraModel(ElectraPreTrainedModel):
        return hidden_states


+class ElectraClassificationHead(nn.Module):
+    """Head for sentence-level classification tasks."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, features, **kwargs):
+        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
+        x = self.dropout(x)
+        x = self.dense(x)
+        x = get_activation("gelu")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here
+        x = self.dropout(x)
+        x = self.out_proj(x)
+        return x
+
+
+@add_start_docstrings(
+    """ELECTRA Model transformer with a sequence classification/regression head on top (a linear layer on top of
+    the pooled output) e.g. for GLUE tasks. """,
+    ELECTRA_START_DOCSTRING,
+)
+class ElectraForSequenceClassification(ElectraPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.electra = ElectraModel(config)
+        self.classifier = ElectraClassificationHead(config)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_callable(ELECTRA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+            Labels for computing the sequence classification/regression loss.
+            Indices should be in :obj:`[0, ..., config.num_labels - 1]`.
+            If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):
+            Classification (or regression if config.num_labels==1) loss.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
+            Classification (or regression if config.num_labels==1) scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        from transformers import BertTokenizer, BertForSequenceClassification
+        import torch
+
+        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+
+        loss, logits = outputs[:2]
+
+        """
+        discriminator_hidden_states = self.electra(
+            input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds
+        )
+
+        sequence_output = discriminator_hidden_states[0]
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + discriminator_hidden_states[2:]  # add hidden states and attention if they are here
+
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), logits, (hidden_states), (attentions)
+
+
@add_start_docstrings(
    """
    Electra model with a binary classification head on top as used during pre-training for identifying generated
--- a/src/transformers/modeling_longformer.py
+++ b/src/transformers/modeling_longformer.py
@@ -0,0 +1,709 @@
+# coding=utf-8
+# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch Longformer model. """
+
+import logging
+import math
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from torch.nn import functional as F
+
+from .configuration_longformer import LongformerConfig
+from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
+from .modeling_bert import BertPreTrainedModel
+from .modeling_roberta import RobertaLMHead, RobertaModel
+
+
+logger = logging.getLogger(__name__)
+
+LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    "longformer-base-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-base-4096/pytorch_model.bin",
+    "longformer-large-4096": "https://s3.amazonaws.com/models.huggingface.co/bert/allenai/longformer-large-4096/pytorch_model.bin",
+}
+
+
+class LongformerSelfAttention(nn.Module):
+    def __init__(self, config, layer_id):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+        self.output_attentions = config.output_attentions
+        self.num_heads = config.num_attention_heads
+        self.head_dim = int(config.hidden_size / config.num_attention_heads)
+        self.embed_dim = config.hidden_size
+
+        self.query = nn.Linear(config.hidden_size, self.embed_dim)
+        self.key = nn.Linear(config.hidden_size, self.embed_dim)
+        self.value = nn.Linear(config.hidden_size, self.embed_dim)
+
+        # separate projection layers for tokens with global attention
+        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)
+        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)
+        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)
+
+        self.dropout = config.attention_probs_dropout_prob
+
+        self.layer_id = layer_id
+        attention_window = config.attention_window[self.layer_id]
+        assert (
+            attention_window % 2 == 0
+        ), f"`attention_window` for layer {self.layer_id} has to be an even value. Given {attention_window}"
+        assert (
+            attention_window > 0
+        ), f"`attention_window` for layer {self.layer_id} has to be positive. Given {attention_window}"
+
+        self.one_sided_attention_window_size = attention_window // 2
+
+    @staticmethod
+    def _skew(x, direction):
+        """Convert diagonals into columns (or columns into diagonals depending on `direction`"""
+        x_padded = F.pad(x, direction)  # padding value is not important because it will be overwritten
+        x_padded = x_padded.view(*x_padded.size()[:-2], x_padded.size(-1), x_padded.size(-2))
+        return x_padded
+
+    @staticmethod
+    def _skew2(x):
+        """shift every row 1 step to right converting columns into diagonals"""
+        # X = B x C x M x L
+        B, C, M, L = x.size()
+        x = F.pad(x, (0, M + 1))  # B x C x M x (L+M+1). Padding value is not important because it'll be overwritten
+        x = x.view(B, C, -1)  # B x C x ML+MM+M
+        x = x[:, :, :-M]  # B x C x ML+MM
+        x = x.view(B, C, M, M + L)  # B x C, M x L+M
+        x = x[:, :, :, :-1]
+        return x
+
+    @staticmethod
+    def _chunk(x, w):
+        """convert into overlapping chunkings. Chunk size = 2w, overlap size = w"""
+
+        # non-overlapping chunks of size = 2w
+        x = x.view(x.size(0), x.size(1) // (w * 2), w * 2, x.size(2))
+
+        # use `as_strided` to make the chunks overlap with an overlap size = w
+        chunk_size = list(x.size())
+        chunk_size[1] = chunk_size[1] * 2 - 1
+
+        chunk_stride = list(x.stride())
+        chunk_stride[1] = chunk_stride[1] // 2
+        return x.as_strided(size=chunk_size, stride=chunk_stride)
+
+    def _mask_invalid_locations(self, input_tensor, w) -> torch.Tensor:
+        affected_seqlen = w
+        beginning_mask_2d = input_tensor.new_ones(w, w + 1).tril().flip(dims=[0])
+        beginning_mask = beginning_mask_2d[None, :, None, :]
+        ending_mask = beginning_mask.flip(dims=(1, 3))
+        seqlen = input_tensor.size(1)
+        beginning_input = input_tensor[:, :affected_seqlen, :, : w + 1]
+        beginning_mask = beginning_mask[:, :seqlen].expand(beginning_input.size())
+        beginning_input.masked_fill_(beginning_mask == 1, -float("inf"))  # `== 1` converts to bool or uint8
+        ending_input = input_tensor[:, -affected_seqlen:, :, -(w + 1) :]
+        ending_mask = ending_mask[:, -seqlen:].expand(ending_input.size())
+        ending_input.masked_fill_(ending_mask == 1, -float("inf"))  # `== 1` converts to bool or uint8
+
+    def _sliding_chunks_matmul_qk(self, q: torch.Tensor, k: torch.Tensor, w: int):
+        """Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.
+        This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)
+        with an overlap of size w"""
+        batch_size, seqlen, num_heads, head_dim = q.size()
+        assert seqlen % (w * 2) == 0, f"Sequence length should be multiple of {w * 2}. Given {seqlen}"
+        assert q.size() == k.size()
+
+        chunks_count = seqlen // w - 1
+
+        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2
+        q = q.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
+        k = k.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
+
+        chunk_q = self._chunk(q, w)
+        chunk_k = self._chunk(k, w)
+
+        # matrix multipication
+        # bcxd: batch_size * num_heads x chunks x 2w x head_dim
+        # bcyd: batch_size * num_heads x chunks x 2w x head_dim
+        # bcxy: batch_size * num_heads x chunks x 2w x 2w
+        chunk_attn = torch.einsum("bcxd,bcyd->bcxy", (chunk_q, chunk_k))  # multiply
+
+        # convert diagonals into columns
+        diagonal_chunk_attn = self._skew(chunk_attn, direction=(0, 0, 0, 1))
+
+        # allocate space for the overall attention matrix where the chunks are compined. The last dimension
+        # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to
+        # w previous words). The following column is attention score from each word to itself, then
+        # followed by w columns for the upper triangle.
+
+        diagonal_attn = diagonal_chunk_attn.new_empty((batch_size * num_heads, chunks_count + 1, w, w * 2 + 1))
+
+        # copy parts from diagonal_chunk_attn into the compined matrix of attentions
+        # - copying the main diagonal and the upper triangle
+        diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, : w + 1]
+        diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, : w + 1]
+        # - copying the lower triangle
+        diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, -(w + 1) : -1, w + 1 :]
+        diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, : w - 1, 1 - w :]
+
+        # separate batch_size and num_heads dimensions again
+        diagonal_attn = diagonal_attn.view(batch_size, num_heads, seqlen, 2 * w + 1).transpose(2, 1)
+
+        self._mask_invalid_locations(diagonal_attn, w)
+        return diagonal_attn
+
+    def _sliding_chunks_matmul_pv(self, prob: torch.Tensor, v: torch.Tensor, w: int):
+        """Same as _sliding_chunks_matmul_qk but for prob and value tensors. It is expecting the same output
+        format from _sliding_chunks_matmul_qk"""
+        batch_size, seqlen, num_heads, head_dim = v.size()
+        assert seqlen % (w * 2) == 0
+        assert prob.size()[:3] == v.size()[:3]
+        assert prob.size(3) == 2 * w + 1
+        chunks_count = seqlen // w - 1
+        # group batch_size and num_heads dimensions into one, then chunk seqlen into chunks of size 2w
+        chunk_prob = prob.transpose(1, 2).reshape(batch_size * num_heads, seqlen // w, w, 2 * w + 1)
+
+        # group batch_size and num_heads dimensions into one
+        v = v.transpose(1, 2).reshape(batch_size * num_heads, seqlen, head_dim)
+
+        # pad seqlen with w at the beginning of the sequence and another w at the end
+        padded_v = F.pad(v, (0, 0, w, w), value=-1)
+
+        # chunk padded_v into chunks of size 3w and an overlap of size w
+        chunk_v_size = (batch_size * num_heads, chunks_count + 1, 3 * w, head_dim)
+        chunk_v_stride = padded_v.stride()
+        chunk_v_stride = chunk_v_stride[0], w * chunk_v_stride[1], chunk_v_stride[1], chunk_v_stride[2]
+        chunk_v = padded_v.as_strided(size=chunk_v_size, stride=chunk_v_stride)
+
+        skewed_prob = self._skew2(chunk_prob)
+
+        context = torch.einsum("bcwd,bcdh->bcwh", (skewed_prob, chunk_v))
+        return context.view(batch_size, num_heads, seqlen, head_dim).transpose(1, 2)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+    ):
+        """
+        LongformerSelfAttention expects `len(hidden_states)` to be multiple of `attention_window`.
+        Padding to `attention_window` happens in LongformerModel.forward to avoid redoing the padding on each layer.
+
+        The `attention_mask` is changed in `BertModel.forward` from 0, 1, 2 to
+            -ve: no attention
+              0: local attention
+            +ve: global attention
+
+        `encoder_hidden_states` and `encoder_attention_mask` are not supported and should be None
+        """
+        # TODO: add support for `encoder_hidden_states` and `encoder_attention_mask`
+        assert encoder_hidden_states is None, "`encoder_hidden_states` is not supported and should be None"
+        assert encoder_attention_mask is None, "`encoder_attention_mask` is not supported and shiould be None"
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
+            key_padding_mask = attention_mask < 0
+            extra_attention_mask = attention_mask > 0
+            remove_from_windowed_attention_mask = attention_mask != 0
+
+            num_extra_indices_per_batch = extra_attention_mask.long().sum(dim=1)
+            max_num_extra_indices_per_batch = num_extra_indices_per_batch.max()
+            if max_num_extra_indices_per_batch <= 0:
+                extra_attention_mask = None
+            else:
+                # To support the case of variable number of global attention in the rows of a batch,
+                # we use the following three selection masks to select global attention embeddings
+                # in a 3d tensor and pad it to `max_num_extra_indices_per_batch`
+                # 1) selecting embeddings that correspond to global attention
+                extra_attention_mask_nonzeros = extra_attention_mask.nonzero(as_tuple=True)
+                zero_to_max_range = torch.arange(
+                    0, max_num_extra_indices_per_batch, device=num_extra_indices_per_batch.device
+                )
+                # mask indicating which values are actually going to be padding
+                selection_padding_mask = zero_to_max_range < num_extra_indices_per_batch.unsqueeze(dim=-1)
+                # 2) location of the non-padding values in the selected global attention
+                selection_padding_mask_nonzeros = selection_padding_mask.nonzero(as_tuple=True)
+                # 3) location of the padding values in the selected global attention
+                selection_padding_mask_zeros = (selection_padding_mask == 0).nonzero(as_tuple=True)
+        else:
+            remove_from_windowed_attention_mask = None
+            extra_attention_mask = None
+            key_padding_mask = None
+
+        hidden_states = hidden_states.transpose(0, 1)
+        seqlen, batch_size, embed_dim = hidden_states.size()
+        assert embed_dim == self.embed_dim
+        q = self.query(hidden_states)
+        k = self.key(hidden_states)
+        v = self.value(hidden_states)
+        q /= math.sqrt(self.head_dim)
+
+        q = q.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+        k = k.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+        # attn_weights = (batch_size, seqlen, num_heads, window*2+1)
+        attn_weights = self._sliding_chunks_matmul_qk(q, k, self.one_sided_attention_window_size)
+        self._mask_invalid_locations(attn_weights, self.one_sided_attention_window_size)
+        if remove_from_windowed_attention_mask is not None:
+            # This implementation is fast and takes very little memory because num_heads x hidden_size = 1
+            # from (batch_size x seqlen) to (batch_size x seqlen x num_heads x hidden_size)
+            remove_from_windowed_attention_mask = remove_from_windowed_attention_mask.unsqueeze(dim=-1).unsqueeze(
+                dim=-1
+            )
+            # cast to fp32/fp16 then replace 1's with -inf
+            float_mask = remove_from_windowed_attention_mask.type_as(q).masked_fill(
+                remove_from_windowed_attention_mask, -10000.0
+            )
+            ones = float_mask.new_ones(size=float_mask.size())  # tensor of ones
+            # diagonal mask with zeros everywhere and -inf inplace of padding
+            d_mask = self._sliding_chunks_matmul_qk(ones, float_mask, self.one_sided_attention_window_size)
+            attn_weights += d_mask
+        assert list(attn_weights.size()) == [
+            batch_size,
+            seqlen,
+            self.num_heads,
+            self.one_sided_attention_window_size * 2 + 1,
+        ]
+
+        # the extra attention
+        if extra_attention_mask is not None:
+            selected_k = k.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)
+            selected_k[selection_padding_mask_nonzeros] = k[extra_attention_mask_nonzeros]
+            # (batch_size, seqlen, num_heads, max_num_extra_indices_per_batch)
+            selected_attn_weights = torch.einsum("blhd,bshd->blhs", (q, selected_k))
+            selected_attn_weights[selection_padding_mask_zeros[0], :, :, selection_padding_mask_zeros[1]] = -10000
+            # concat to attn_weights
+            # (batch_size, seqlen, num_heads, extra attention count + 2*window+1)
+            attn_weights = torch.cat((selected_attn_weights, attn_weights), dim=-1)
+
+        attn_weights_fp32 = F.softmax(attn_weights, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability
+        attn_weights = attn_weights_fp32.type_as(attn_weights)
+
+        if key_padding_mask is not None:
+            # softmax sometimes inserts NaN if all positions are masked, replace them with 0
+            attn_weights = torch.masked_fill(attn_weights, key_padding_mask.unsqueeze(-1).unsqueeze(-1), 0.0)
+
+        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training)
+        v = v.view(seqlen, batch_size, self.num_heads, self.head_dim).transpose(0, 1)
+        attn = None
+        if extra_attention_mask is not None:
+            selected_attn_probs = attn_probs.narrow(-1, 0, max_num_extra_indices_per_batch)
+            selected_v = v.new_zeros(batch_size, max_num_extra_indices_per_batch, self.num_heads, self.head_dim)
+            selected_v[selection_padding_mask_nonzeros] = v[extra_attention_mask_nonzeros]
+            # use `matmul` because `einsum` crashes sometimes with fp16
+            # attn = torch.einsum('blhs,bshd->blhd', (selected_attn_probs, selected_v))
+            attn = torch.matmul(
+                selected_attn_probs.transpose(1, 2), selected_v.transpose(1, 2).type_as(selected_attn_probs)
+            ).transpose(1, 2)
+            attn_probs = attn_probs.narrow(
+                -1, max_num_extra_indices_per_batch, attn_probs.size(-1) - max_num_extra_indices_per_batch
+            ).contiguous()
+        if attn is None:
+            attn = self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)
+        else:
+            attn += self._sliding_chunks_matmul_pv(attn_probs, v, self.one_sided_attention_window_size)
+
+        assert attn.size() == (batch_size, seqlen, self.num_heads, self.head_dim), "Unexpected size"
+        attn = attn.transpose(0, 1).reshape(seqlen, batch_size, embed_dim).contiguous()
+
+        # For this case, we'll just recompute the attention for these indices
+        # and overwrite the attn tensor.
+        # TODO: remove the redundant computation
+        if extra_attention_mask is not None:
+            selected_hidden_states = hidden_states.new_zeros(max_num_extra_indices_per_batch, batch_size, embed_dim)
+            selected_hidden_states[selection_padding_mask_nonzeros[::-1]] = hidden_states[
+                extra_attention_mask_nonzeros[::-1]
+            ]
+
+            q = self.query_global(selected_hidden_states)
+            k = self.key_global(hidden_states)
+            v = self.value_global(hidden_states)
+            q /= math.sqrt(self.head_dim)
+
+            q = (
+                q.contiguous()
+                .view(max_num_extra_indices_per_batch, batch_size * self.num_heads, self.head_dim)
+                .transpose(0, 1)
+            )  # (batch_size * self.num_heads, max_num_extra_indices_per_batch, head_dim)
+            k = (
+                k.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
+            )  # batch_size * self.num_heads, seqlen, head_dim)
+            v = (
+                v.contiguous().view(-1, batch_size * self.num_heads, self.head_dim).transpose(0, 1)
+            )  # batch_size * self.num_heads, seqlen, head_dim)
+            attn_weights = torch.bmm(q, k.transpose(1, 2))
+            assert list(attn_weights.size()) == [batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen]
+
+            attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)
+            attn_weights[selection_padding_mask_zeros[0], :, selection_padding_mask_zeros[1], :] = -10000.0
+            if key_padding_mask is not None:
+                attn_weights = attn_weights.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -10000.0,)
+            attn_weights = attn_weights.view(batch_size * self.num_heads, max_num_extra_indices_per_batch, seqlen)
+            attn_weights_float = F.softmax(
+                attn_weights, dim=-1, dtype=torch.float32
+            )  # use fp32 for numerical stability
+            attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)
+            selected_attn = torch.bmm(attn_probs, v)
+            assert list(selected_attn.size()) == [
+                batch_size * self.num_heads,
+                max_num_extra_indices_per_batch,
+                self.head_dim,
+            ]
+
+            selected_attn_4d = selected_attn.view(
+                batch_size, self.num_heads, max_num_extra_indices_per_batch, self.head_dim
+            )
+            nonzero_selected_attn = selected_attn_4d[
+                selection_padding_mask_nonzeros[0], :, selection_padding_mask_nonzeros[1]
+            ]
+            attn[extra_attention_mask_nonzeros[::-1]] = nonzero_selected_attn.view(
+                len(selection_padding_mask_nonzeros[0]), -1
+            ).type_as(hidden_states)
+
+        context_layer = attn.transpose(0, 1)
+        if self.output_attentions:
+            if extra_attention_mask is not None:
+                # With global attention, return global attention probabilities only
+                # batch_size x num_heads x max_num_global_attention_tokens x sequence_length
+                # which is the attention weights from tokens with global attention to all tokens
+                # It doesn't not return local attention
+                # In case of variable number of global attantion in the rows of a batch,
+                # attn_weights are padded with -10000.0 attention scores
+                attn_weights = attn_weights.view(batch_size, self.num_heads, max_num_extra_indices_per_batch, seqlen)
+            else:
+                # without global attention, return local attention probabilities
+                # batch_size x num_heads x sequence_length x window_size
+                # which is the attention weights of every token attending to its neighbours
+                attn_weights = attn_weights.permute(0, 2, 1, 3)
+        outputs = (context_layer, attn_weights) if self.output_attentions else (context_layer,)
+        return outputs
+
+
+LONGFORMER_START_DOCSTRING = r"""
+
+    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general
+    usage and behavior.
+
+    Parameters:
+        config (:class:`~transformers.LongformerConfig`): Model configuration class with all the parameters of the
+            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+LONGFORMER_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`transformers.LonmgformerTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.encode` and
+            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Mask to decide the attention given on each token, local attention, global attenion, or no attention (for padding tokens).
+            Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for
+            task-specific finetuning because it makes the model more flexible at representing the task. For example,
+            for classification, the <s> token should be given global attention. For QA, all question tokens should also have
+            global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details.
+            Mask values selected in ``[0, 1, 2]``:
+            ``0`` for no attention (padding tokens),
+            ``1`` for local attention (a sliding window attention),
+            ``2`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them).
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Segment token indices to indicate first and second portions of the inputs.
+            Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
+            corresponds to a `sentence B` token
+
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+
+            `What are position IDs? <../glossary.html#position-ids>`_
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            than the model's internal embedding lookup matrix.
+"""
+
+
+@add_start_docstrings(
+    "The bare Longformer Model outputting raw hidden-states without any specific head on top.",
+    LONGFORMER_START_DOCSTRING,
+)
+class LongformerModel(RobertaModel):
+    """
+    This class overrides :class:`~transformers.RobertaModel` to provide the ability to process
+    long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer`_by
+    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window)
+    and global attention to extend to long documents without the O(n^2) increase in memory and compute.
+
+    The selfattention module `LongformerSelfAttention` implemented here supports the combination of local and
+    global attention but it lacks support for autoregressive attention and dilated attention. Autoregressive
+    and dilated attention are more relevant for autoregressive language modeling than finetuning on downstream
+    tasks. Future release will add support for autoregressive attention, but the support for dilated attention
+    requires a custom CUDA kernel to be memory and compute efficient.
+
+    .. _`Longformer: the Long-Document Transformer`:
+        https://arxiv.org/abs/2004.05150
+
+    """
+
+    config_class = LongformerConfig
+    pretrained_model_archive_map = LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "longformer"
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        if isinstance(config.attention_window, int):
+            assert config.attention_window % 2 == 0, "`config.attention_window` has to be an even value"
+            assert config.attention_window > 0, "`config.attention_window` has to be positive"
+            config.attention_window = [config.attention_window] * config.num_hidden_layers  # one value per layer
+        else:
+            assert len(config.attention_window) == config.num_hidden_layers, (
+                "`len(config.attention_window)` should equal `config.num_hidden_layers`. "
+                f"Expected {config.num_hidden_layers}, given {len(config.attention_window)}"
+            )
+
+        for i, layer in enumerate(self.encoder.layer):
+            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
+            layer.attention.self = LongformerSelfAttention(config, layer_id=i)
+
+        self.init_weights()
+
+    def _pad_to_window_size(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        token_type_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        inputs_embeds: torch.Tensor,
+        attention_window: int,
+        pad_token_id: int,
+    ):
+        """A helper function to pad tokens and mask to work with implementation of Longformer selfattention."""
+
+        assert attention_window % 2 == 0, f"`attention_window` should be an even value. Given {attention_window}"
+        input_shape = input_ids.shape if input_ids is not None else inputs_embeds.shape
+        batch_size, seqlen = input_shape[:2]
+
+        padding_len = (attention_window - seqlen % attention_window) % attention_window
+        if padding_len > 0:
+            logger.info(
+                "Input ids are automatically padded from {} to {} to be a multiple of `config.attention_window`: {}".format(
+                    seqlen, seqlen + padding_len, attention_window
+                )
+            )
+            if input_ids is not None:
+                input_ids = F.pad(input_ids, (0, padding_len), value=pad_token_id)
+            if attention_mask is not None:
+                attention_mask = F.pad(
+                    attention_mask, (0, padding_len), value=False
+                )  # no attention on the padding tokens
+            if token_type_ids is not None:
+                token_type_ids = F.pad(token_type_ids, (0, padding_len), value=0)  # pad with token_type_id = 0
+            if position_ids is not None:
+                # pad with position_id = pad_token_id as in modeling_roberta.RobertaEmbeddings
+                position_ids = F.pad(position_ids, (0, padding_len), value=pad_token_id)
+            if inputs_embeds is not None:
+                input_ids_padding = inputs_embeds.new_full(
+                    (batch_size, padding_len), self.config.pad_token_id, dtype=torch.long,
+                )
+                inputs_embeds_padding = self.embeddings(input_ids_padding)
+                inputs_embeds = torch.cat([inputs_embeds, inputs_embeds_padding], dim=-2)
+
+        return padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds
+
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        masked_lm_labels=None,
+    ):
+        r"""
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.RobertaConfig`) and inputs:
+        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        import torch
+        from transformers import LongformerModel, LongformerTokenizer
+
+        model = LongformerModel.from_pretrained('longformer-base-4096')
+        tokenizer = LongformerTokenizer.from_pretrained('longformer-base-4096')
+
+        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
+        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
+
+        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
+        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
+        attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,
+                                            # classification: the <s> token
+                                            # QA: question tokens
+                                            # LM: potentially on the beginning of sentences and paragraphs
+        sequence_output, pooled_output = model(input_ids, attention_mask=attention_mask)
+        """
+
+        # padding
+        attention_window = (
+            self.config.attention_window
+            if isinstance(self.config.attention_window, int)
+            else max(self.config.attention_window)
+        )
+        padding_len, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds = self._pad_to_window_size(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            attention_window=attention_window,
+            pad_token_id=self.config.pad_token_id,
+        )
+
+        # embed
+        output = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=None,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=None,
+            encoder_attention_mask=None,
+        )
+
+        # undo padding
+        if padding_len > 0:
+            # `output` has the following tensors: sequence_output, pooled_output, (hidden_states), (attentions)
+            # `sequence_output`: unpad because the calling function is expecting a length == input_ids.size(1)
+            # `pooled_output`: independent of the sequence length
+            # `hidden_states`: mainly used for debugging and analysis, so keep the padding
+            # `attentions`: mainly used for debugging and analysis, so keep the padding
+            output = output[0][:, :-padding_len], *output[1:]
+
+        return output
+
+
+@add_start_docstrings("""Longformer Model with a `language modeling` head on top. """, LONGFORMER_START_DOCSTRING)
+class LongformerForMaskedLM(BertPreTrainedModel):
+    config_class = LongformerConfig
+    pretrained_model_archive_map = LONGFORMER_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "longformer"
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.longformer = LongformerModel(config)
+        self.lm_head = RobertaLMHead(config)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_callable(LONGFORMER_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        inputs_embeds=None,
+        masked_lm_labels=None,
+    ):
+        r"""
+        masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.RobertaConfig`) and inputs:
+        masked_lm_loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Masked language modeling loss.
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        import torch
+        from transformers import LongformerForMaskedLM, LongformerTokenizer
+
+        model = LongformerForMaskedLM.from_pretrained('longformer-base-4096')
+        tokenizer = LongformerTokenizer.from_pretrained('longformer-base-4096')
+
+        SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
+        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
+
+        attention_mask = None  # default is local attention everywhere, which is a good choice for MaskedLM
+                               # check ``LongformerModel.forward`` for more details how to set `attention_mask`
+        loss, prediction_scores = model(input_ids, attention_mask=attention_mask, masked_lm_labels=input_ids)
+        """
+
+        outputs = self.longformer(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
+
+        if masked_lm_labels is not None:
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            outputs = (masked_lm_loss,) + outputs
+
+        return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)
--- a/src/transformers/modeling_t5.py
+++ b/src/transformers/modeling_t5.py
@@ -149,8 +149,12 @@ class T5LayerNorm(nn.Module):
        self.variance_epsilon = eps

    def forward(self, x):
-        variance = x.pow(2).mean(-1, keepdim=True)
+        # layer norm should always be calculated in float32
+        variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
        x = x / torch.sqrt(variance + self.variance_epsilon)
+
+        if self.weight.dtype == torch.float16:
+            x = x.to(torch.float16)
        return self.weight * x


@@ -691,14 +695,16 @@ class T5Stack(T5PreTrainedModel):
            attention_mask = torch.ones(batch_size, mask_seq_length).to(inputs_embeds.device)
        if self.is_decoder and encoder_attention_mask is None and encoder_hidden_states is not None:
            encoder_seq_length = encoder_hidden_states.shape[1]
-            encoder_attention_mask = torch.ones(batch_size, encoder_seq_length).to(inputs_embeds.device)
+            encoder_attention_mask = torch.ones(
+                batch_size, encoder_seq_length, device=inputs_embeds.device, dtype=torch.long
+            )

        # initialize past_key_value_states with `None` if past does not exist
        if past_key_value_states is None:
            past_key_value_states = [None] * len(self.block)

        # ourselves in which case we just need to make it broadcastable to all heads.
-        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
+        extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, inputs_embeds.device)

        if self.is_decoder and encoder_attention_mask is not None:
            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
@@ -733,6 +739,7 @@ class T5Stack(T5PreTrainedModel):
            # layer_outputs is a tuple with:
            # hidden-states, key-value-states, (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
            hidden_states, present_key_value_state = layer_outputs[:2]
+
            if i == 0:
                # We share the position biases between the layers - the first layer store them
                # layer_outputs = hidden-states, key-value-states (self-attention weights), (self-attention position bias), (cross-attention weights), (cross-attention position bias)
--- a/src/transformers/modeling_tf_t5.py
+++ b/src/transformers/modeling_tf_t5.py
@@ -537,7 +537,7 @@ class TFT5MainLayer(tf.keras.layers.Layer):

    def call(
        self,
-        input_ids,
+        inputs,
        attention_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
@@ -548,19 +548,19 @@ class TFT5MainLayer(tf.keras.layers.Layer):
        training=False,
    ):

-        if input_ids is not None and inputs_embeds is not None:
-            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
-        elif input_ids is not None:
-            input_shape = shape_list(input_ids)
-            input_ids = tf.reshape(input_ids, (-1, input_shape[-1]))
+        if inputs is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both inputs and inputs_embeds at the same time")
+        elif inputs is not None:
+            input_shape = shape_list(inputs)
+            inputs = tf.reshape(inputs, (-1, input_shape[-1]))
        elif inputs_embeds is not None:
            input_shape = shape_list(inputs_embeds)[:-1]
        else:
-            raise ValueError("You have to specify either input_ids or inputs_embeds")
+            raise ValueError("You have to specify either inputs or inputs_embeds")

        if inputs_embeds is None:
            assert self.embed_tokens is not None, "You have to intialize the model with valid token embeddings"
-            inputs_embeds = self.embed_tokens(input_ids)
+            inputs_embeds = self.embed_tokens(inputs)

        batch_size, seq_length = input_shape

@@ -725,11 +725,11 @@ class TFT5PreTrainedModel(TFPreTrainedModel):

    @property
    def dummy_inputs(self):
-        input_ids = tf.constant(DUMMY_INPUTS)
+        inputs = tf.constant(DUMMY_INPUTS)
        input_mask = tf.constant(DUMMY_MASK)
        dummy_inputs = {
-            "inputs": input_ids,
-            "decoder_input_ids": input_ids,
+            "inputs": inputs,
+            "decoder_input_ids": inputs,
            "decoder_attention_mask": input_mask,
        }
        return dummy_inputs
@@ -759,11 +759,11 @@ T5_START_DOCSTRING = r"""    The T5 model was proposed in

        If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument :

-        - a single Tensor with input_ids only and nothing else: `model(inputs_ids)
+        - a single Tensor with inputs only and nothing else: `model(inputs_ids)
        - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
-            `model([input_ids, attention_mask])` or `model([input_ids, attention_mask, token_type_ids])`
+            `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])`
        - a dictionary with one or several input Tensors associaed to the input names given in the docstring:
-            `model({'input_ids': input_ids, 'token_type_ids': token_type_ids})`
+            `model({'inputs': inputs, 'token_type_ids': token_type_ids})`

    Parameters:
        config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model.
@@ -780,7 +780,7 @@ T5_INPUTS_DOCSTRING = r"""
            T5 is a model with relative position embeddings so you should be able to pad the inputs on
            the right or the left.
            Indices can be obtained using :class:`transformers.T5Tokenizer`.
-            To know more on how to prepare :obj:`input_ids` for pre-training take a look at
+            To know more on how to prepare :obj:`inputs` for pre-training take a look at
            `T5 Training <./t5.html#training>`_ .
            See :func:`transformers.PreTrainedTokenizer.encode` and
            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
@@ -805,8 +805,8 @@ T5_INPUTS_DOCSTRING = r"""
        use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
            If `use_cache` is True, `decoder_past_key_value_states` are returned and can be used to speed up decoding (see `decoder_past_key_value_states`).
        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
-            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
-            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            Optionally, instead of passing :obj:`inputs` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `inputs` indices into associated vectors
            than the model's internal embedding lookup matrix.
        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
@@ -885,8 +885,8 @@ class TFT5Model(TFT5PreTrainedModel):

        tokenizer = T5Tokenizer.from_pretrained('t5-small')
        model = TFT5Model.from_pretrained('t5-small')
-        input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
-        outputs = model(input_ids, decoder_input_ids=input_ids)
+        inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        outputs = model(inputs, decoder_input_ids=inputs)
        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

        """
@@ -897,7 +897,7 @@ class TFT5Model(TFT5PreTrainedModel):
            kwargs["inputs"] = inputs

        # retrieve arguments
-        input_ids = kwargs.get("inputs", None)
+        inputs = kwargs.get("inputs", None)
        inputs_embeds = kwargs.get("inputs_embeds", None)
        attention_mask = kwargs.get("attention_mask", None)
        encoder_outputs = kwargs.get("encoder_outputs", None)
@@ -911,7 +911,7 @@ class TFT5Model(TFT5PreTrainedModel):
        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
            encoder_outputs = self.encoder(
-                input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
+                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
            )

        hidden_states = encoder_outputs[0]
@@ -1006,14 +1006,14 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):

        tokenizer = T5Tokenizer.from_pretrained('t5-small')
        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
-        input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
-        outputs = model(input_ids, decoder_input_ids=input_ids)
+        inputs = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        outputs = model(inputs, decoder_input_ids=inputs)
        prediction_scores = outputs[0]

        tokenizer = T5Tokenizer.from_pretrained('t5-small')
        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
-        input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf")  # Batch size 1
-        model.generate(input_ids)
+        inputs = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        model.generate(inputs)

        """

@@ -1023,7 +1023,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
            kwargs["inputs"] = inputs

        # retrieve arguments
-        input_ids = kwargs.get("inputs", None)
+        inputs = kwargs.get("inputs", None)
        decoder_input_ids = kwargs.get("decoder_input_ids", None)
        attention_mask = kwargs.get("attention_mask", None)
        encoder_outputs = kwargs.get("encoder_outputs", None)
@@ -1038,7 +1038,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
        if encoder_outputs is None:
            # Convert encoder inputs in embeddings if needed
            encoder_outputs = self.encoder(
-                input_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
+                inputs, attention_mask=attention_mask, inputs_embeds=inputs_embeds, head_mask=head_mask,
            )

        hidden_states = encoder_outputs[0]
@@ -1076,7 +1076,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):

        return decoder_outputs + encoder_outputs

-    def prepare_inputs_for_generation(self, input_ids, past, attention_mask, use_cache, **kwargs):
+    def prepare_inputs_for_generation(self, inputs, past, attention_mask, use_cache, **kwargs):
        assert past is not None, "past has to be defined for encoder_outputs"

        # first step
@@ -1087,7 +1087,7 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):

        return {
            "inputs": None,  # inputs don't have to be defined, but still need to be passed to make Keras.layer.__call__ happy
-            "decoder_input_ids": input_ids,  # input_ids are the decoder_input_ids
+            "decoder_input_ids": inputs,  # inputs are the decoder_input_ids
            "decoder_past_key_value_states": decoder_past_key_value_states,
            "encoder_outputs": encoder_outputs,
            "attention_mask": attention_mask,
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -929,7 +929,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
            else:
                tokens_to_add = next_token

+            # add token and increase length by one
            input_ids = tf.concat([input_ids, tf.expand_dims(tokens_to_add, -1)], 1)
+            cur_len = cur_len + 1

            if eos_token_id is not None:
                eos_in_sents = tokens_to_add == eos_token_id
@@ -955,8 +957,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1
                )

-            cur_len = cur_len + 1
-
        # if there are different sentences lengths in the batch, some batches have to be padded
        min_sent_length = tf.math.reduce_min(sent_lengths)
        max_sent_length = tf.math.reduce_max(sent_lengths)
@@ -970,7 +970,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
                tf.expand_dims(sent_lengths, -1), [batch_size, max_sent_length]
            )
            broad_casted_range = tf.transpose(
-                tf.broadcast_to(tf.expand_dims(tf.range(max_length), -1), [max_length, batch_size])
+                tf.broadcast_to(tf.expand_dims(tf.range(max_sent_length), -1), [max_sent_length, batch_size])
            )

            decoded = tf.where(broad_casted_range < broad_casted_sent_lengths, input_ids, padding)
@@ -1205,9 +1205,11 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
            beam_tokens = tf.convert_to_tensor([x[1] for x in next_batch_beam], dtype=tf.int32)
            beam_idx = tf.convert_to_tensor([x[2] for x in next_batch_beam], dtype=tf.int32)

-            # re-order batch
+            # re-order batch and update current length
            input_ids = tf.stack([tf.identity(input_ids[x, :]) for x in beam_idx])
            input_ids = tf.concat([input_ids, tf.expand_dims(beam_tokens, 1)], axis=-1)
+            cur_len = cur_len + 1
+
            # re-order internal states
            if past is not None:
                past = self._reorder_cache(past, beam_idx)
@@ -1218,9 +1220,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
                    [attention_mask, tf.ones((shape_list(attention_mask)[0], 1), dtype=tf.int32)], axis=-1
                )

-            # update current length
-            cur_len = cur_len + 1
-
        # finalize all open beam hypotheses and end to generated hypotheses
        for batch_idx in range(batch_size):
            # Add all open beam hypothesis to generated_hyps
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -17,7 +17,7 @@
 import inspect
 import logging
 import os
-from typing import Callable, Tuple
+from typing import Callable, List, Tuple

 import torch
 from torch import Tensor, device, dtype, nn
@@ -110,11 +110,33 @@ class ModuleUtilsMixin:

    @property
    def device(self) -> device:
-        return next(self.parameters()).device
+        try:
+            return next(self.parameters()).device
+        except StopIteration:
+            # For nn.DataParallel compatibility in PyTorch 1.5
+
+            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
+                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
+                return tuples
+
+            gen = self._named_members(get_members_fn=find_tensor_attributes)
+            first_tuple = next(gen)
+            return first_tuple[1].device

    @property
    def dtype(self) -> dtype:
-        return next(self.parameters()).dtype
+        try:
+            return next(self.parameters()).dtype
+        except StopIteration:
+            # For nn.DataParallel compatibility in PyTorch 1.5
+
+            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
+                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
+                return tuples
+
+            gen = self._named_members(get_members_fn=find_tensor_attributes)
+            first_tuple = next(gen)
+            return first_tuple[1].dtype

    def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:
        """type: torch.Tensor -> torch.Tensor"""
@@ -128,7 +150,18 @@ class ModuleUtilsMixin:
        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==
        # encoder_extended_attention_mask.transpose(-1, -2))
        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
-        encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
+
+        if self.dtype == torch.float16:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e4
+        elif self.dtype == torch.float32:
+            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
+        else:
+            raise ValueError(
+                "{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`".format(
+                    self.dtype
+                )
+            )
+
        return encoder_extended_attention_mask

    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: tuple, device: device):
@@ -737,7 +770,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            import torch_xla.core.xla_model as xm

            model = xm.send_cpu_data_to_device(model, xm.xla_device())
-            model = model.to(xm.xla_device())
+            model.to(xm.xla_device())

        return model

@@ -1236,13 +1269,15 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            else:
                tokens_to_add = next_token

+            # add token and increase length by one
            input_ids = torch.cat([input_ids, tokens_to_add.unsqueeze(-1)], dim=-1)
+            cur_len = cur_len + 1

            if eos_token_id is not None:
                eos_in_sents = tokens_to_add == eos_token_id
                # if sentence is unfinished and the token to add is eos, sent_lengths is filled with current length
                is_sents_unfinished_and_token_to_add_is_eos = unfinished_sents.mul(eos_in_sents.long()).bool()
-                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len + 1)
+                sent_lengths.masked_fill_(is_sents_unfinished_and_token_to_add_is_eos, cur_len)
                # unfinished_sents is set to zero if eos in sentence
                unfinished_sents.mul_((~eos_in_sents).long())

@@ -1256,8 +1291,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
                )

-            cur_len = cur_len + 1
-
        # if there are different sentences lengths in the batch, some batches have to be padded
        if sent_lengths.min().item() != sent_lengths.max().item():
            assert pad_token_id is not None, "`Pad_token_id` has to be defined if batches have different lengths"
@@ -1473,9 +1506,11 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            beam_tokens = input_ids.new([x[1] for x in next_batch_beam])
            beam_idx = input_ids.new([x[2] for x in next_batch_beam])

-            # re-order batch
+            # re-order batch and update current length
            input_ids = input_ids[beam_idx, :]
            input_ids = torch.cat([input_ids, beam_tokens.unsqueeze(1)], dim=-1)
+            cur_len = cur_len + 1
+
            # re-order internal states
            if past is not None:
                past = self._reorder_cache(past, beam_idx)
@@ -1486,9 +1521,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
                    [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
                )

-            # update current length
-            cur_len = cur_len + 1
-
        # finalize all open beam hypotheses and end to generated hypotheses
        for batch_idx in range(batch_size):
            if done[batch_idx]:
--- a/src/transformers/modeling_xlnet.py
+++ b/src/transformers/modeling_xlnet.py
@@ -623,7 +623,7 @@ class XLNetModel(XLNetPreTrainedModel):
            mask_lo = torch.tril(attn_mask, diagonal=-1)
            ret = torch.cat([ret[:, :qlen] + mask_lo, ret[:, qlen:]], dim=1)

-        ret = ret.to(next(self.parameters()))
+        ret = ret.to(self.device)
        return ret

    def cache_mem(self, curr_out, prev_mem):
@@ -685,7 +685,7 @@ class XLNetModel(XLNetPreTrainedModel):
                fwd_pos_seq = fwd_pos_seq.clamp(-self.clamp_len, self.clamp_len)
            pos_emb = self.positional_embedding(fwd_pos_seq, inv_freq, bsz)

-        pos_emb = pos_emb.to(next(self.parameters()))
+        pos_emb = pos_emb.to(self.device)
        return pos_emb

    @add_start_docstrings_to_callable(XLNET_INPUTS_DOCSTRING)
@@ -761,8 +761,8 @@ class XLNetModel(XLNetPreTrainedModel):
        mlen = mems[0].shape[0] if mems is not None and mems[0] is not None else 0
        klen = mlen + qlen

-        dtype_float = next(self.parameters()).dtype
-        device = next(self.parameters()).device
+        dtype_float = self.dtype
+        device = self.device

        # Attention mask
        # causal attention mask
--- a/src/transformers/optimization.py
+++ b/src/transformers/optimization.py
@@ -152,8 +152,8 @@ class AdamW(Optimizer):

                # Decay the first and second moment running average coefficient
                # In-place operations to update the averages at the same time
-                exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
-                exp_avg_sq.mul_(beta2).addcmul_(1.0 - beta2, grad, grad)
+                exp_avg.mul_(beta1).add_(grad, alpha=1.0 - beta1)
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
                denom = exp_avg_sq.sqrt().add_(group["eps"])

                step_size = group["lr"]
@@ -162,7 +162,7 @@ class AdamW(Optimizer):
                    bias_correction2 = 1.0 - beta2 ** state["step"]
                    step_size = step_size * math.sqrt(bias_correction2) / bias_correction1

-                p.data.addcdiv_(-step_size, exp_avg, denom)
+                p.data.addcdiv_(exp_avg, denom, value=-step_size)

                # Just adding the square of the weights to the loss function is *not*
                # the correct way of using L2 regularization/weight decay with Adam,
@@ -173,6 +173,6 @@ class AdamW(Optimizer):
                # of the weights to the loss with plain (non-momentum) SGD.
                # Add weight decay at the end (fixed version)
                if group["weight_decay"] > 0.0:
-                    p.data.add_(-group["lr"] * group["weight_decay"], p.data)
+                    p.data.add_(p.data, alpha=-group["lr"] * group["weight_decay"])

        return loss
--- a/src/transformers/optimization_tf.py
+++ b/src/transformers/optimization_tf.py
@@ -75,7 +75,7 @@ def create_optimizer(init_lr, num_train_steps, num_warmup_steps, end_lr=0.0, opt
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
-        exclude_from_weight_decay=["layer_norm", "bias"],
+        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"],
    )

    return optimizer
@@ -217,7 +217,7 @@ class GradientAccumulator(object):
        """The accumulated gradients on the current replica."""
        if not self._gradients:
            raise ValueError("The accumulator should be called first to initialize the gradients")
-        return list(gradient.value() for gradient in self._gradients)
+        return list(gradient.value() if gradient is not None else gradient for gradient in self._gradients)

    def __call__(self, gradients):
        """Accumulates :obj:`gradients` on the current replica."""
@@ -231,6 +231,8 @@ class GradientAccumulator(object):
                        synchronization=tf.VariableSynchronization.ON_READ,
                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
                    )
+                    if gradient is not None
+                    else gradient
                    for gradient in gradients
                ]
            )
@@ -238,7 +240,8 @@ class GradientAccumulator(object):
            raise ValueError("Expected %s gradients, but got %d" % (len(self._gradients), len(gradients)))

        for accum_gradient, gradient in zip(self._gradients, gradients):
-            accum_gradient.assign_add(gradient)
+            if accum_gradient is not None and gradient is not None:
+                accum_gradient.assign_add(gradient)

        self._accum_steps.assign_add(1)

@@ -248,4 +251,5 @@ class GradientAccumulator(object):
            return
        self._accum_steps.assign(0)
        for gradient in self._gradients:
-            gradient.assign(tf.zeros_like(gradient))
+            if gradient is not None:
+                gradient.assign(tf.zeros_like(gradient))
--- a/src/transformers/pipelines.py
+++ b/src/transformers/pipelines.py
@@ -24,7 +24,7 @@ from abc import ABC, abstractmethod
 from contextlib import contextmanager
 from itertools import chain
 from os.path import abspath, exists
-from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
+from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union

 import numpy as np

@@ -58,6 +58,10 @@ if is_torch_available():
        AutoModelWithLMHead,
    )

+if TYPE_CHECKING:
+    from .modeling_utils import PreTrainedModel
+    from .modeling_tf_utils import TFPreTrainedModel
+

 logger = logging.getLogger(__name__)

@@ -864,6 +868,7 @@ class NerPipeline(Pipeline):
        binary_output: bool = False,
        ignore_labels=["O"],
        task: str = "",
+        grouped_entities: bool = False,
    ):
        super().__init__(
            model=model,
@@ -878,6 +883,7 @@ class NerPipeline(Pipeline):

        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
        self.ignore_labels = ignore_labels
+        self.grouped_entities = grouped_entities

    def __call__(self, *args, **kwargs):
        inputs = self._args_parser(*args, **kwargs)
@@ -907,23 +913,74 @@ class NerPipeline(Pipeline):
            score = np.exp(entities) / np.exp(entities).sum(-1, keepdims=True)
            labels_idx = score.argmax(axis=-1)

-            answer = []
-            for idx, label_idx in enumerate(labels_idx):
-                if self.model.config.id2label[label_idx] not in self.ignore_labels:
-                    answer += [
-                        {
-                            "word": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),
-                            "score": score[idx][label_idx].item(),
-                            "entity": self.model.config.id2label[label_idx],
-                        }
-                    ]
+            entities = []
+            entity_groups = []
+            entity_group_disagg = []
+            # Filter to labels not in `self.ignore_labels`
+            filtered_labels_idx = [
+                (idx, label_idx)
+                for idx, label_idx in enumerate(labels_idx)
+                if self.model.config.id2label[label_idx] not in self.ignore_labels
+            ]
+
+            for idx, label_idx in filtered_labels_idx:
+
+                entity = {
+                    "word": self.tokenizer.convert_ids_to_tokens(int(input_ids[idx])),
+                    "score": score[idx][label_idx].item(),
+                    "entity": self.model.config.id2label[label_idx],
+                    "index": idx,
+                }
+                last_idx, _ = filtered_labels_idx[-1]
+                if self.grouped_entities:
+                    if not entity_group_disagg:
+                        entity_group_disagg += [entity]
+                        if idx == last_idx:
+                            entity_groups += [self.group_entities(entity_group_disagg)]
+                        continue
+
+                    # If the current entity is similar and adjacent to the previous entity, append it to the disaggregated entity group
+                    if (
+                        entity["entity"] == entity_group_disagg[-1]["entity"]
+                        and entity["index"] == entity_group_disagg[-1]["index"] + 1
+                    ):
+                        entity_group_disagg += [entity]
+                        # Group the entities at the last entity
+                        if idx == last_idx:
+                            entity_groups += [self.group_entities(entity_group_disagg)]
+                    # If the current entity is different from the previous entity, aggregate the disaggregated entity group
+                    else:
+                        entity_groups += [self.group_entities(entity_group_disagg)]
+                        entity_group_disagg = [entity]
+
+                entities += [entity]

            # Append
-            answers += [answer]
+            if self.grouped_entities:
+                answers += [entity_groups]
+            else:
+                answers += [entities]
+
        if len(answers) == 1:
            return answers[0]
        return answers

+    def group_entities(self, entities):
+        """
+        Returns grouped entities
+        """
+        # Get the last entity in the entity group
+        entity = entities[-1]["entity"]
+        scores = np.mean([entity["score"] for entity in entities])
+        tokens = [entity["word"] for entity in entities]
+
+        entity_group = {
+            "entity_group": entity,
+            "score": np.mean(scores),
+            "word": self.tokenizer.convert_tokens_to_string(tokens),
+        }
+        return entity_group
+

 TokenClassificationPipeline = NerPipeline

@@ -1509,7 +1566,7 @@ class TranslationPipeline(Pipeline):
            return results


-# Register all the supported task here
+# Register all the supported tasks here
 SUPPORTED_TASKS = {
    "feature-extraction": {
        "impl": FeatureExtractionPipeline,
@@ -1572,9 +1629,9 @@ SUPPORTED_TASKS = {
        "tf": TFAutoModelWithLMHead if is_tf_available() else None,
        "pt": AutoModelWithLMHead if is_torch_available() else None,
        "default": {
-            "model": {"pt": "bart-large-cnn", "tf": None},
+            "model": {"pt": "bart-large-cnn", "tf": "t5-small"},
            "config": None,
-            "tokenizer": ("bart-large-cnn", {"use_fast": False}),
+            "tokenizer": {"pt": ("bart-large-cnn", {"use_fast": False}), "tf": "t5-small"},
        },
    },
    "translation_en_to_fr": {
--- a/src/transformers/tokenization_auto.py
+++ b/src/transformers/tokenization_auto.py
@@ -29,6 +29,7 @@ from .configuration_auto import (
    ElectraConfig,
    FlaubertConfig,
    GPT2Config,
+    LongformerConfig,
    OpenAIGPTConfig,
    ReformerConfig,
    RobertaConfig,
@@ -50,6 +51,7 @@ from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFas
 from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
 from .tokenization_flaubert import FlaubertTokenizer
 from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
+from .tokenization_longformer import LongformerTokenizer
 from .tokenization_marian import MarianTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
 from .tokenization_reformer import ReformerTokenizer
@@ -73,6 +75,7 @@ TOKENIZER_MAPPING = OrderedDict(
        (XLMRobertaConfig, (XLMRobertaTokenizer, None)),
        (MarianConfig, (MarianTokenizer, None)),
        (BartConfig, (BartTokenizer, None)),
+        (LongformerConfig, (LongformerTokenizer, None)),
        (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
        (ReformerConfig, (ReformerTokenizer, None)),
        (ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),
@@ -105,6 +108,7 @@ class AutoTokenizer:
            - contains `albert`: AlbertTokenizer (ALBERT model)
            - contains `camembert`: CamembertTokenizer (CamemBERT model)
            - contains `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)
+            - contains `longformer`: LongformerTokenizer (AllenAI Longformer model)
            - contains `roberta`: RobertaTokenizer (RoBERTa model)
            - contains `bert`: BertTokenizer (Bert model)
            - contains `openai-gpt`: OpenAIGPTTokenizer (OpenAI GPT model)
@@ -136,6 +140,7 @@ class AutoTokenizer:
            - contains `albert`: AlbertTokenizer (ALBERT model)
            - contains `camembert`: CamembertTokenizer (CamemBERT model)
            - contains `xlm-roberta`: XLMRobertaTokenizer (XLM-RoBERTa model)
+            - contains `longformer`: LongformerTokenizer (AllenAI Longformer model)
            - contains `roberta`: RobertaTokenizer (RoBERTa model)
            - contains `bert-base-japanese`: BertJapaneseTokenizer (Bert model)
            - contains `bert`: BertTokenizer (Bert model)
--- a/src/transformers/tokenization_bart.py
+++ b/src/transformers/tokenization_bart.py
@@ -27,8 +27,6 @@ vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-v
 merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
 _all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn", "bart-large-xsum"]

-VOCAB_FILES_NAMES = {"vocab_file": "sentence.bpe.model"}
-

 class BartTokenizer(RobertaTokenizer):
    # merges and vocab same as Roberta
@@ -44,6 +42,6 @@ SPM_URL = "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-la


 class MBartTokenizer(XLMRobertaTokenizer):
-    vocab_files_names = VOCAB_FILES_NAMES
+    vocab_files_names = {"vocab_file": "sentencepiece.bpe.model"}
    max_model_input_sizes = {m: 1024 for m in _all_mbart_models}
    pretrained_vocab_files_map = {"vocab_file": {m: SPM_URL for m in _all_mbart_models}}
--- a/src/transformers/tokenization_longformer.py
+++ b/src/transformers/tokenization_longformer.py
@@ -0,0 +1,42 @@
+# coding=utf-8
+# Copyright 2020 The Allen Institute for AI team and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from .tokenization_roberta import RobertaTokenizer
+
+
+logger = logging.getLogger(__name__)
+
+
+# vocab and merges same as roberta
+vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json"
+merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
+_all_longformer_models = ["longformer-base-4096", "longformer-large-4096"]
+
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "longformer-base-4096": 4096,
+    "longformer-large-4096": 4096,
+}
+
+
+class LongformerTokenizer(RobertaTokenizer):
+    # merges and vocab same as Roberta
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_vocab_files_map = {
+        "vocab_file": {m: vocab_url for m in _all_longformer_models},
+        "merges_file": {m: merges_url for m in _all_longformer_models},
+    }
--- a/src/transformers/tokenization_marian.py
+++ b/src/transformers/tokenization_marian.py
@@ -1,7 +1,9 @@
 import json
 import re
 import warnings
-from typing import Dict, List, Optional, Union
+from pathlib import Path
+from shutil import copyfile
+from typing import Dict, List, Optional, Tuple, Union

 import sentencepiece

@@ -15,7 +17,7 @@ vocab_files_names = {
    "vocab": "vocab.json",
    "tokenizer_config_file": "tokenizer_config.json",
 }
-MODEL_NAMES = ("opus-mt-en-de",)  # TODO(SS): the only required constant is vocab_files_names
+MODEL_NAMES = ("opus-mt-en-de",)  # TODO(SS): delete this, the only required constant is vocab_files_names
 PRETRAINED_VOCAB_FILES_MAP = {
    k: {m: f"{S3_BUCKET_PREFIX}/Helsinki-NLP/{m}/{fname}" for m in MODEL_NAMES}
    for k, fname in vocab_files_names.items()
@@ -55,14 +57,16 @@ class MarianTokenizer(PreTrainedTokenizer):
        eos_token="</s>",
        pad_token="<pad>",
        max_len=512,
+        **kwargs,
    ):

        super().__init__(
-            # bos_token=bos_token,
+            # bos_token=bos_token,  unused. Start decoding with config.decoder_start_token_id
            max_len=max_len,
            eos_token=eos_token,
            unk_token=unk_token,
            pad_token=pad_token,
+            **kwargs,
        )
        self.encoder = load_json(vocab)
        if self.unk_token not in self.encoder:
@@ -72,21 +76,23 @@ class MarianTokenizer(PreTrainedTokenizer):

        self.source_lang = source_lang
        self.target_lang = target_lang
+        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(">>") and k.endswith("<<")]
+        self.spm_files = [source_spm, target_spm]

        # load SentencePiece model for pre-processing
-        self.spm_source = sentencepiece.SentencePieceProcessor()
-        self.spm_source.Load(source_spm)
-
-        self.spm_target = sentencepiece.SentencePieceProcessor()
-        self.spm_target.Load(target_spm)
+        self.spm_source = load_spm(source_spm)
+        self.spm_target = load_spm(target_spm)
+        self.current_spm = self.spm_source

        # Multilingual target side: default to using first supported language code.
-        self.supported_language_codes: list = [k for k in self.encoder if k.startswith(">>") and k.endswith("<<")]

+        self._setup_normalizer()
+
+    def _setup_normalizer(self):
        try:
            from mosestokenizer import MosesPunctuationNormalizer

-            self.punc_normalizer = MosesPunctuationNormalizer(source_lang)
+            self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
        except ImportError:
            warnings.warn("Recommended: pip install mosestokenizer")
            self.punc_normalizer = lambda x: x
@@ -124,9 +130,6 @@ class MarianTokenizer(PreTrainedTokenizer):
        # We don't expect to process pairs, but leave the pair logic for API consistency
        return token_ids_0 + token_ids_1 + [self.eos_token_id]

-    def batch_decode(self, token_ids, **kwargs) -> List[str]:
-        return [self.decode(ids, **kwargs) for ids in token_ids]
-
    def prepare_translation_batch(
        self,
        src_texts: List[str],
@@ -179,6 +182,65 @@ class MarianTokenizer(PreTrainedTokenizer):
    def vocab_size(self) -> int:
        return len(self.encoder)

+    def save_vocabulary(self, save_directory: str) -> Tuple[str]:
+        """save vocab file to json and copy spm files from their original path."""
+        save_dir = Path(save_directory)
+        assert save_dir.is_dir(), f"{save_directory} should be a directory"
+        save_json(self.encoder, save_dir / self.vocab_files_names["vocab"])
+
+        for f in self.spm_files:
+            dest_path = save_dir / Path(f).name
+            if not dest_path.exists():
+                copyfile(f, save_dir / Path(f).name)
+        return tuple(save_dir / f for f in self.vocab_files_names)
+
+    def get_vocab(self) -> Dict:
+        vocab = self.encoder.copy()
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def __getstate__(self) -> Dict:
+        state = self.__dict__.copy()
+        state.update({k: None for k in ["spm_source", "spm_target", "current_spm", "punc_normalizer"]})
+        return state
+
+    def __setstate__(self, d: Dict) -> None:
+        self.__dict__ = d
+        self.spm_source, self.spm_target = (load_spm(f) for f in self.spm_files)
+        self.current_spm = self.spm_source
+        self._setup_normalizer()
+
+    def num_special_tokens_to_add(self, **unused):
+        """Just EOS"""
+        return 1
+
+    def _special_token_mask(self, seq):
+        all_special_ids = set(self.all_special_ids)  # call it once instead of inside list comp
+        all_special_ids.remove(self.unk_token_id)  # <unk> is only sometimes special
+        return [1 if x in all_special_ids else 0 for x in seq]
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """Get list where entries are [1] if a token is [eos] or [pad] else 0."""
+        if already_has_special_tokens:
+            return self._special_token_mask(token_ids_0)
+        elif token_ids_1 is None:
+            return self._special_token_mask(token_ids_0) + [1]
+        else:
+            return self._special_token_mask(token_ids_0 + token_ids_1) + [1]
+
+
+def load_spm(path: str) -> sentencepiece.SentencePieceProcessor:
+    spm = sentencepiece.SentencePieceProcessor()
+    spm.Load(path)
+    return spm
+
+
+def save_json(data, path: str) -> None:
+    with open(path, "w") as f:
+        json.dump(data, f, indent=2)
+

 def load_json(path: str) -> Union[Dict, List]:
    with open(path, "r") as f:
--- a/src/transformers/tokenization_roberta.py
+++ b/src/transformers/tokenization_roberta.py
@@ -199,7 +199,7 @@ class RobertaTokenizer(GPT2Tokenizer):
            if token_ids_1 is not None:
                raise ValueError(
                    "You should not supply a second sequence if the provided sequence of "
-                    "ids is already formated with special tokens for the model."
+                    "ids is already formatted with special tokens for the model."
                )
            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))

--- a/src/transformers/tokenization_utils.py
+++ b/src/transformers/tokenization_utils.py
@@ -771,26 +771,26 @@ class PreTrainedTokenizer(SpecialTokensMixin):
        raise NotImplementedError

    @property
-    def is_fast(self):
+    def is_fast(self) -> bool:
        return False

    @property
-    def max_len(self):
+    def max_len(self) -> int:
        """ Kept here for backward compatibility.
            Now renamed to `model_max_length` to avoid ambiguity.
        """
        return self.model_max_length

    @property
-    def max_len_single_sentence(self):
+    def max_len_single_sentence(self) -> int:
        return self.model_max_length - self.num_special_tokens_to_add(pair=False)

    @property
-    def max_len_sentences_pair(self):
+    def max_len_sentences_pair(self) -> int:
        return self.model_max_length - self.num_special_tokens_to_add(pair=True)

    @max_len_single_sentence.setter
-    def max_len_single_sentence(self, value):
+    def max_len_single_sentence(self, value) -> int:
        """ For backward compatibility, allow to try to setup 'max_len_single_sentence' """
        if value == self.model_max_length - self.num_special_tokens_to_add(pair=False):
            logger.warning(
@@ -802,7 +802,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
            )

    @max_len_sentences_pair.setter
-    def max_len_sentences_pair(self, value):
+    def max_len_sentences_pair(self, value) -> int:
        """ For backward compatibility, allow to try to setup 'max_len_sentences_pair' """
        if value == self.model_max_length - self.num_special_tokens_to_add(pair=True):
            logger.warning(
@@ -1118,7 +1118,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):

        return vocab_files + (special_tokens_map_file, added_tokens_file)

-    def save_vocabulary(self, save_directory):
+    def save_vocabulary(self, save_directory) -> Tuple[str]:
        """ Save the tokenizer vocabulary to a directory. This method does *NOT* save added tokens
            and special token mappings.

@@ -1128,7 +1128,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
        """
        raise NotImplementedError

-    def add_tokens(self, new_tokens):
+    def add_tokens(self, new_tokens: Union[str, List[str]]) -> int:
        """
        Add a list of new tokens to the tokenizer class. If the new tokens are not in the
        vocabulary, they are added to it with indices starting from length of the current vocabulary.
@@ -1156,7 +1156,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
        if not isinstance(new_tokens, list):
            new_tokens = [new_tokens]

-        to_add_tokens = []
+        tokens_to_add = []
        for token in new_tokens:
            assert isinstance(token, str)
            if self.init_kwargs.get("do_lower_case", False) and token not in self.all_special_tokens:
@@ -1164,18 +1164,18 @@ class PreTrainedTokenizer(SpecialTokensMixin):
            if (
                token != self.unk_token
                and self.convert_tokens_to_ids(token) == self.convert_tokens_to_ids(self.unk_token)
-                and token not in to_add_tokens
+                and token not in tokens_to_add
            ):
-                to_add_tokens.append(token)
+                tokens_to_add.append(token)
                logger.info("Adding %s to the vocabulary", token)

-        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(to_add_tokens))
+        added_tok_encoder = dict((tok, len(self) + i) for i, tok in enumerate(tokens_to_add))
        added_tok_decoder = {v: k for k, v in added_tok_encoder.items()}
        self.added_tokens_encoder.update(added_tok_encoder)
        self.unique_added_tokens_encoder = set(self.added_tokens_encoder.keys()).union(set(self.all_special_tokens))
        self.added_tokens_decoder.update(added_tok_decoder)

-        return len(to_add_tokens)
+        return len(tokens_to_add)

    def num_special_tokens_to_add(self, pair=False):
        """
@@ -2080,10 +2080,7 @@ class PreTrainedTokenizer(SpecialTokensMixin):
    def build_inputs_with_special_tokens(self, token_ids_0: List, token_ids_1: Optional[List] = None) -> List:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
-        by concatenating and adding special tokens.
-        A RoBERTa sequence has the following format:
-            single sequence: <s> X </s>
-            pair of sequences: <s> A </s></s> B </s>
+        by concatenating and adding special tokens. This implementation does not add special tokens.
        """
        if token_ids_1 is None:
            return token_ids_0
@@ -2183,6 +2180,9 @@ class PreTrainedTokenizer(SpecialTokensMixin):
        else:
            return text

+    def batch_decode(self, sequences: List[List[int]], **kwargs) -> List[str]:
+        return [self.decode(seq, **kwargs) for seq in sequences]
+
    @staticmethod
    def clean_up_tokenization(out_string: str) -> str:
        """ Clean up a list of simple English tokenization artifacts like spaces before punctuations and abreviated forms.
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -1,12 +1,13 @@
 import json
 import logging
+import math
 import os
 import random
 import re
 import shutil
 from contextlib import contextmanager
 from pathlib import Path
-from typing import Callable, Dict, List, Optional, Tuple, Union
+from typing import Callable, Dict, List, Optional, Tuple

 import numpy as np
 import torch
@@ -14,7 +15,7 @@ from torch import nn
 from torch.utils.data.dataloader import DataLoader
 from torch.utils.data.dataset import Dataset
 from torch.utils.data.distributed import DistributedSampler
-from torch.utils.data.sampler import RandomSampler
+from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler
 from tqdm.auto import tqdm, trange

 from .data.data_collator import DataCollator, DefaultDataCollator
@@ -89,7 +90,7 @@ def set_seed(seed: int):
@contextmanager
 def torch_distributed_zero_first(local_rank: int):
    """
-    Decorator to make all processes in distributed training wait for the first one (locally) to do something.
+    Decorator to make all processes in distributed training wait for each local_master to do something.
    """
    if local_rank not in [-1, 0]:
        torch.distributed.barrier()
@@ -98,6 +99,50 @@ def torch_distributed_zero_first(local_rank: int):
        torch.distributed.barrier()


+class SequentialDistributedSampler(Sampler):
+    """
+    Distributed Sampler that subsamples indicies sequentially,
+    making it easier to collate all results at the end.
+
+    Even though we only use this sampler for eval and predict (no training),
+    which means that the model params won't have to be synced (i.e. will not hang
+    for synchronization even if varied number of forward passes), we still add extra
+    samples to the sampler to make it evenly divisible (like in `DistributedSampler`)
+    to make it easy to `gather` or `reduce` resulting tensors at the end of the loop.
+    """
+
+    def __init__(self, dataset, num_replicas=None, rank=None):
+        if num_replicas is None:
+            if not torch.distributed.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            num_replicas = torch.distributed.get_world_size()
+        if rank is None:
+            if not torch.distributed.is_available():
+                raise RuntimeError("Requires distributed package to be available")
+            rank = torch.distributed.get_rank()
+        self.dataset = dataset
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
+        self.total_size = self.num_samples * self.num_replicas
+
+    def __iter__(self):
+        indices = list(range(len(self.dataset)))
+
+        # add extra samples to make it evenly divisible
+        indices += indices[: (self.total_size - len(indices))]
+        assert len(indices) == self.total_size
+
+        # subsample
+        indices = indices[self.rank * self.num_samples : (self.rank + 1) * self.num_samples]
+        assert len(indices) == self.num_samples
+
+        return iter(indices)
+
+    def __len__(self):
+        return self.num_samples
+
+
 def get_tpu_sampler(dataset: Dataset):
    if xm.xrt_world_size() <= 1:
        return RandomSampler(dataset)
@@ -142,7 +187,7 @@ class Trainer:
            prediction_loss_only:
                (Optional) in evaluation and prediction, only return the loss
        """
-        self.model = model
+        self.model = model.to(args.device)
        self.args = args
        if data_collator is not None:
            self.data_collator = data_collator
@@ -155,7 +200,7 @@ class Trainer:
        self.optimizers = optimizers
        if tb_writer is not None:
            self.tb_writer = tb_writer
-        elif is_tensorboard_available() and self.args.local_rank in [-1, 0]:
+        elif is_tensorboard_available() and self.is_world_master():
            self.tb_writer = SummaryWriter(log_dir=self.args.logging_dir)
        if not is_tensorboard_available():
            logger.warning(
@@ -170,7 +215,7 @@ class Trainer:
            )
        set_seed(self.args.seed)
        # Create output directory if needed
-        if self.is_local_master():
+        if self.is_world_master():
            os.makedirs(self.args.output_dir, exist_ok=True)
        if is_tpu_available():
            # Set an xla_device flag on the model's config.
@@ -196,9 +241,6 @@ class Trainer:
            collate_fn=self.data_collator.collate_batch,
        )

-        if is_tpu_available():
-            data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
-
        return data_loader

    def get_eval_dataloader(self, eval_dataset: Optional[Dataset] = None) -> DataLoader:
@@ -207,36 +249,42 @@ class Trainer:

        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset

-        sampler = get_tpu_sampler(eval_dataset) if is_tpu_available() else None
+        if is_tpu_available():
+            sampler = SequentialDistributedSampler(
+                eval_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()
+            )
+        elif self.args.local_rank != -1:
+            sampler = SequentialDistributedSampler(eval_dataset)
+        else:
+            sampler = SequentialSampler(eval_dataset)

        data_loader = DataLoader(
            eval_dataset,
            sampler=sampler,
            batch_size=self.args.eval_batch_size,
-            shuffle=False,
            collate_fn=self.data_collator.collate_batch,
        )

-        if is_tpu_available():
-            data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
-
        return data_loader

    def get_test_dataloader(self, test_dataset: Dataset) -> DataLoader:
        # We use the same batch_size as for eval.
-        sampler = get_tpu_sampler(test_dataset) if is_tpu_available() else None
+        if is_tpu_available():
+            sampler = SequentialDistributedSampler(
+                test_dataset, num_replicas=xm.xrt_world_size(), rank=xm.get_ordinal()
+            )
+        elif self.args.local_rank != -1:
+            sampler = SequentialDistributedSampler(test_dataset)
+        else:
+            sampler = SequentialSampler(test_dataset)

        data_loader = DataLoader(
            test_dataset,
            sampler=sampler,
            batch_size=self.args.eval_batch_size,
-            shuffle=False,
            collate_fn=self.data_collator.collate_batch,
        )

-        if is_tpu_available():
-            data_loader = pl.ParallelLoader(data_loader, [self.args.device]).per_device_loader(self.args.device)
-
        return data_loader

    def get_optimizers(
@@ -293,15 +341,11 @@ class Trainer:
                self.model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, self.args.logging_steps)
            )

-    def num_examples(self, dataloader: Union[DataLoader, "pl.PerDeviceLoader"]) -> int:
+    def num_examples(self, dataloader: DataLoader) -> int:
        """
        Helper to get num of examples from a DataLoader, by accessing its Dataset.
        """
-        if is_tpu_available():
-            assert isinstance(dataloader, pl.PerDeviceLoader)
-            return len(dataloader._loader._loader.dataset)
-        else:
-            return len(dataloader.dataset)
+        return len(dataloader.dataset)

    def train(self, model_path: Optional[str] = None):
        """
@@ -331,11 +375,12 @@ class Trainer:
            and os.path.isfile(os.path.join(model_path, "scheduler.pt"))
        ):
            # Load in optimizer and scheduler states
-            optimizer.load_state_dict(torch.load(os.path.join(model_path, "optimizer.pt")))
+            optimizer.load_state_dict(
+                torch.load(os.path.join(model_path, "optimizer.pt"), map_location=self.args.device)
+            )
            scheduler.load_state_dict(torch.load(os.path.join(model_path, "scheduler.pt")))

        model = self.model
-        model.to(self.args.device)
        if self.args.fp16:
            if not is_apex_available():
                raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
@@ -404,7 +449,17 @@ class Trainer:
            epochs_trained, int(num_train_epochs), desc="Epoch", disable=not self.is_local_master()
        )
        for epoch in train_iterator:
-            epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())
+            if isinstance(train_dataloader, DataLoader) and isinstance(train_dataloader.sampler, DistributedSampler):
+                train_dataloader.sampler.set_epoch(epoch)
+
+            if is_tpu_available():
+                parallel_loader = pl.ParallelLoader(train_dataloader, [self.args.device]).per_device_loader(
+                    self.args.device
+                )
+                epoch_iterator = tqdm(parallel_loader, desc="Iteration", disable=not self.is_local_master())
+            else:
+                epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=not self.is_local_master())
+
            for step, inputs in enumerate(epoch_iterator):

                # Skip past any already trained steps if resuming training
@@ -434,37 +489,43 @@ class Trainer:
                    self.global_step += 1
                    self.epoch = epoch + (step + 1) / len(epoch_iterator)

-                    if self.is_local_master():
-                        if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (
-                            self.global_step == 1 and self.args.logging_first_step
-                        ):
-                            logs: Dict[str, float] = {}
-                            logs["loss"] = (tr_loss - logging_loss) / self.args.logging_steps
-                            logs["learning_rate"] = scheduler.get_last_lr()[0]
-                            logging_loss = tr_loss
+                    if (self.args.logging_steps > 0 and self.global_step % self.args.logging_steps == 0) or (
+                        self.global_step == 1 and self.args.logging_first_step
+                    ):
+                        logs: Dict[str, float] = {}
+                        logs["loss"] = (tr_loss - logging_loss) / self.args.logging_steps
+                        logs["learning_rate"] = (
+                            scheduler.get_last_lr()[0]
+                        )
+                        logging_loss = tr_loss

-                            self._log(logs)
+                        self._log(logs)

-                            if self.args.evaluate_during_training:
-                                self.evaluate()
+                        if self.args.evaluate_during_training:
+                            self.evaluate()

-                        if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
-                            # In all cases (even distributed/parallel), self.model is always a reference
-                            # to the model we want to save.
-                            if hasattr(model, "module"):
-                                assert model.module is self.model
-                            else:
-                                assert model is self.model
-                            # Save model checkpoint
-                            output_dir = os.path.join(
-                                self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}"
-                            )
+                    if self.args.save_steps > 0 and self.global_step % self.args.save_steps == 0:
+                        # In all cases (even distributed/parallel), self.model is always a reference
+                        # to the model we want to save.
+                        if hasattr(model, "module"):
+                            assert model.module is self.model
+                        else:
+                            assert model is self.model
+                        # Save model checkpoint
+                        output_dir = os.path.join(self.args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}")

-                            self.save_model(output_dir)
+                        self.save_model(output_dir)
+
+                        if self.is_world_master():
                            self._rotate_checkpoints()
+
+                        if is_tpu_available():
+                            xm.rendezvous("saving_optimizer_states")
+                            xm.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
+                            xm.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
+                        elif self.is_world_master():
                            torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                            torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
-                            logger.info("Saving optimizer and scheduler states to %s", output_dir)

                if self.args.max_steps > 0 and self.global_step > self.args.max_steps:
                    epoch_iterator.close()
@@ -540,11 +601,30 @@ class Trainer:
        Saving best-practices: if you use default names for the model,
        you can reload it using from_pretrained().

-        Will only save from the master process.
+        Will only save from the world_master process (unless in TPUs).
        """
-        if self.is_world_master():
+
+        if is_tpu_available():
+            self._save_tpu(output_dir)
+        elif self.is_world_master():
            self._save(output_dir)

+    def _save_tpu(self, output_dir: Optional[str] = None):
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        logger.info("Saving model checkpoint to %s", output_dir)
+
+        if xm.is_master_ordinal():
+            os.makedirs(output_dir, exist_ok=True)
+            torch.save(self.args, os.path.join(output_dir, "training_args.bin"))
+
+        # Save a trained model and configuration using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        if not isinstance(self.model, PreTrainedModel):
+            raise ValueError("Trainer.model appears to not be a PreTrainedModel")
+
+        xm.rendezvous("saving_checkpoint")
+        self.model.save_pretrained(output_dir)
+
    def _save(self, output_dir: Optional[str] = None):
        output_dir = output_dir if output_dir is not None else self.args.output_dir
        os.makedirs(output_dir, exist_ok=True)
@@ -627,6 +707,7 @@ class Trainer:
        In that case, this method will also return metrics, like in evaluate().
        """
        test_dataloader = self.get_test_dataloader(test_dataset)
+
        return self._prediction_loop(test_dataloader, description="Prediction")

    def _prediction_loop(
@@ -640,27 +721,29 @@ class Trainer:

        prediction_loss_only = prediction_loss_only if prediction_loss_only is not None else self.prediction_loss_only

+        model = self.model
        # multi-gpu eval
-        if self.args.n_gpu > 1 and not isinstance(self.model, torch.nn.DataParallel):
-            model = torch.nn.DataParallel(self.model)
+        if self.args.n_gpu > 1:
+            model = torch.nn.DataParallel(model)
        else:
            model = self.model
-        model.to(self.args.device)
+        # Note: in torch.distributed mode, there's no point in wrapping the model
+        # inside a DistributedDataParallel as we'll be under `no_grad` anyways.

-        if is_tpu_available():
-            batch_size = dataloader._loader._loader.batch_size
-        else:
-            batch_size = dataloader.batch_size
+        batch_size = dataloader.batch_size
        logger.info("***** Running %s *****", description)
        logger.info("  Num examples = %d", self.num_examples(dataloader))
        logger.info("  Batch size = %d", batch_size)
        eval_losses: List[float] = []
-        preds: np.ndarray = None
-        label_ids: np.ndarray = None
+        preds: torch.Tensor = None
+        label_ids: torch.Tensor = None
        model.eval()

+        if is_tpu_available():
+            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)
+
        for inputs in tqdm(dataloader, desc=description):
-            has_labels = any(inputs.get(k) is not None for k in ["labels", "masked_lm_labels"])
+            has_labels = any(inputs.get(k) is not None for k in ["labels", "lm_labels", "masked_lm_labels"])

            for k, v in inputs.items():
                inputs[k] = v.to(self.args.device)
@@ -675,19 +758,33 @@ class Trainer:

            if not prediction_loss_only:
                if preds is None:
-                    preds = logits.detach().cpu().numpy()
+                    preds = logits.detach()
                else:
-                    preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
+                    preds = torch.cat((preds, logits.detach()), dim=0)
                if inputs.get("labels") is not None:
                    if label_ids is None:
-                        label_ids = inputs["labels"].detach().cpu().numpy()
+                        label_ids = inputs["labels"].detach()
                    else:
-                        label_ids = np.append(label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
+                        label_ids = torch.cat((label_ids, inputs["labels"].detach()), dim=0)

-        if is_tpu_available():
+        if self.args.local_rank != -1:
+            # In distributed mode, concatenate all results from all nodes:
+            if preds is not None:
+                preds = self.distributed_concat(preds, num_total_examples=self.num_examples(dataloader))
+            if label_ids is not None:
+                label_ids = self.distributed_concat(label_ids, num_total_examples=self.num_examples(dataloader))
+        elif is_tpu_available():
            # tpu-comment: Get all predictions and labels from all worker shards of eval dataset
-            preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
-            label_ids = xm.mesh_reduce("eval_out_label_ids", label_ids, np.concatenate)
+            if preds is not None:
+                preds = xm.mesh_reduce("eval_preds", preds, torch.cat)
+            if label_ids is not None:
+                label_ids = xm.mesh_reduce("eval_label_ids", label_ids, torch.cat)
+
+        # Finally, turn the aggregated tensors into numpy arrays.
+        if preds is not None:
+            preds = preds.cpu().numpy()
+        if label_ids is not None:
+            label_ids = label_ids.cpu().numpy()

        if self.compute_metrics is not None and preds is not None and label_ids is not None:
            metrics = self.compute_metrics(EvalPrediction(predictions=preds, label_ids=label_ids))
@@ -702,3 +799,15 @@ class Trainer:
                metrics[f"eval_{key}"] = metrics.pop(key)

        return PredictionOutput(predictions=preds, label_ids=label_ids, metrics=metrics)
+
+    def distributed_concat(self, tensor: torch.Tensor, num_total_examples: int) -> torch.Tensor:
+        assert self.args.local_rank != -1
+
+        output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
+        torch.distributed.all_gather(output_tensors, tensor)
+
+        concat = torch.cat(output_tensors, dim=0)
+
+        # truncate the dummy elements added by SequentialDistributedSampler
+        output = concat[:num_total_examples]
+        return output
--- a/src/transformers/trainer_tf.py
+++ b/src/transformers/trainer_tf.py
@@ -141,7 +141,7 @@ class TFTrainer:
                self.optimizer = tf.keras.optimizers.get(
                    {"class_name": self.args.optimizer_name, "config": {"learning_rate": self.args.learning_rate}}
                )
-        logger.info("Created an/a {} optimizer".format(self.optimizer))
+        logger.info("Created an/a {} optimizer".format(self.args.optimizer_name))

    def _create_checkpoint_manager(self, max_to_keep: int = 5, load_model: bool = True) -> None:
        """
@@ -335,12 +335,8 @@ class TFTrainer:
            gradient / tf.cast(gradient_scale, gradient.dtype) for gradient in self.gradient_accumulator.gradients
        ]
        gradients = [(tf.clip_by_value(grad, -self.args.max_grad_norm, self.args.max_grad_norm)) for grad in gradients]
-        vars = self.model.trainable_variables

-        if self.args.mode in ["token-classification", "question-answering"]:
-            vars = [var for var in self.model.trainable_variables if "pooler" not in var.name]
-
-        self.optimizer.apply_gradients(list(zip(gradients, vars)))
+        self.optimizer.apply_gradients(list(zip(gradients, self.model.trainable_variables)))
        self.gradient_accumulator.reset()

    def _accumulate_next_gradients(self):
@@ -375,12 +371,10 @@ class TFTrainer:
    def _forward(self, features, labels):
        """Forwards a training example and accumulates the gradients."""
        per_example_loss, _ = self._run_model(features, labels, True)
-        vars = self.model.trainable_variables
-
-        if self.args.mode in ["token-classification", "question-answering"]:
-            vars = [var for var in self.model.trainable_variables if "pooler" not in var.name]
-
-        gradients = self.optimizer.get_gradients(per_example_loss, vars)
+        gradients = tf.gradients(per_example_loss, self.model.trainable_variables)
+        gradients = [
+            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
+        ]

        self.gradient_accumulator(gradients)

--- a/tests/test_modeling_auto.py
+++ b/tests/test_modeling_auto.py
@@ -80,8 +80,9 @@ class AutoModelTest(unittest.TestCase):
            model, loading_info = AutoModelForPreTraining.from_pretrained(model_name, output_loading_info=True)
            self.assertIsNotNone(model)
            self.assertIsInstance(model, BertForPreTraining)
-            for value in loading_info.values():
-                self.assertEqual(len(value), 0)
+            for key, value in loading_info.items():
+                # Only one value should not be initialized and in the missing keys.
+                self.assertEqual(len(value), 1 if key == "missing_keys" else 0)

    @slow
    def test_lmhead_model_from_pretrained(self):
--- a/tests/test_modeling_bart.py
+++ b/tests/test_modeling_bart.py
@@ -231,7 +231,7 @@ class BartTranslationTests(unittest.TestCase):
        """Only load the model if needed."""
        if self._model is None:
            model = BartForConditionalGeneration.from_pretrained("mbart-large-en-ro")
-            self._model = model
+            self._model = model.to(torch_device)
        return self._model

    @slow
@@ -257,10 +257,7 @@ class BartTranslationTests(unittest.TestCase):
            )
        }
        translated_tokens = model.generate(input_ids=inputs["input_ids"].to(torch_device), num_beams=5,)
-        decoded = [
-            self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False)
-            for g in translated_tokens
-        ]
+        decoded = self.tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
        self.assertEqual(expected_translation_romanian, decoded[0])

    def test_mbart_enro_config(self):
@@ -576,11 +573,13 @@ class BartModelIntegrationTests(unittest.TestCase):

        PGE_ARTICLE = """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
        EXPECTED_SUMMARY = "California's largest power company has begun shutting off power to tens of thousands of homes and businesses in the state."
-        dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",)
+        dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",).to(
+            torch_device
+        )

        hypotheses_batch = model.generate(
-            input_ids=dct["input_ids"].to(torch_device),
-            attention_mask=dct["attention_mask"].to(torch_device),
+            input_ids=dct["input_ids"],
+            attention_mask=dct["attention_mask"],
            num_beams=2,
            max_length=62,
            min_length=11,
@@ -590,9 +589,7 @@ class BartModelIntegrationTests(unittest.TestCase):
            decoder_start_token_id=model.config.eos_token_id,
        )

-        decoded = [
-            tok.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in hypotheses_batch
-        ]
+        decoded = tok.batch_decode(hypotheses_batch, skip_special_tokens=True,)
        self.assertEqual(EXPECTED_SUMMARY, decoded[0])

    def test_xsum_config_generation_params(self):
--- a/tests/test_modeling_camembert.py
+++ b/tests/test_modeling_camembert.py
@@ -30,6 +30,7 @@ class CamembertModelIntegrationTest(unittest.TestCase):
    @slow
    def test_output_embeds_base_model(self):
        model = CamembertModel.from_pretrained("camembert-base")
+        model.to(torch_device)

        input_ids = torch.tensor(
            [[5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]], device=torch_device, dtype=torch.long,
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -23,7 +23,7 @@ from typing import List

 from transformers import is_torch_available

-from .utils import require_torch, slow, torch_device
+from .utils import require_multigpu, require_torch, slow, torch_device


 if is_torch_available():
@@ -758,6 +758,31 @@ class ModelTesterMixin:
                        return True
        return False

+    @require_multigpu
+    def test_multigpu_data_parallel_forward(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        # some params shouldn't be scattered by nn.DataParallel
+        # so just remove them if they are present.
+        blacklist_non_batched_params = ["head_mask"]
+        for k in blacklist_non_batched_params:
+            inputs_dict.pop(k, None)
+
+        # move input tensors to cuda:O
+        for k, v in inputs_dict.items():
+            if torch.is_tensor(v):
+                inputs_dict[k] = v.to(0)
+
+        for model_class in self.all_model_classes:
+            model = model_class(config=config)
+            model.to(0)
+            model.eval()
+
+            # Wrap model in nn.DataParallel
+            model = torch.nn.DataParallel(model)
+            with torch.no_grad():
+                _ = model(**inputs_dict)
+

 global_rng = random.Random()

--- a/tests/test_modeling_ctrl.py
+++ b/tests/test_modeling_ctrl.py
@@ -41,7 +41,7 @@ class CTRLModelTest(ModelTesterMixin, unittest.TestCase):
        def __init__(
            self,
            parent,
-            batch_size=13,
+            batch_size=14,
            seq_length=7,
            is_training=True,
            use_token_type_ids=True,
@@ -219,6 +219,7 @@ class CTRLModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_ctrl(self):
        model = CTRLLMHeadModel.from_pretrained("ctrl")
+        model.to(torch_device)
        input_ids = torch.tensor(
            [[11859, 0, 1611, 8]], dtype=torch.long, device=torch_device
        )  # Legal the president is
--- a/tests/test_modeling_electra.py
+++ b/tests/test_modeling_electra.py
@@ -30,6 +30,7 @@ if is_torch_available():
        ElectraForMaskedLM,
        ElectraForTokenClassification,
        ElectraForPreTraining,
+        ElectraForSequenceClassification,
    )
    from transformers.modeling_electra import ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP

@@ -242,6 +243,31 @@ class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
            self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.seq_length])
            self.check_loss_output(result)

+        def create_and_check_electra_for_sequence_classification(
+            self,
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+            fake_token_labels,
+        ):
+            config.num_labels = self.num_labels
+            model = ElectraForSequenceClassification(config)
+            model.to(torch_device)
+            model.eval()
+            loss, logits = model(
+                input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=sequence_labels
+            )
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(list(result["logits"].size()), [self.batch_size, self.num_labels])
+            self.check_loss_output(result)
+
        def prepare_config_and_inputs_for_common(self):
            config_and_inputs = self.prepare_config_and_inputs()
            (
@@ -280,6 +306,10 @@ class ElectraModelTest(ModelTesterMixin, unittest.TestCase):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_electra_for_pretraining(*config_and_inputs)

+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_electra_for_sequence_classification(*config_and_inputs)
+
    @slow
    def test_model_from_pretrained(self):
        for model_name in list(ELECTRA_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
--- a/tests/test_modeling_encoder_decoder.py
+++ b/tests/test_modeling_encoder_decoder.py
@@ -329,5 +329,5 @@ class EncoderDecoderModelTest(unittest.TestCase):

    @slow
    def test_real_bert_model_from_pretrained(self):
-        model = EncoderDecoderModel.from_pretrained("bert-base-uncased", "bert-base-uncased")
+        model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
        self.assertIsNotNone(model)
--- a/tests/test_modeling_gpt2.py
+++ b/tests/test_modeling_gpt2.py
@@ -46,7 +46,7 @@ class GPT2ModelTest(ModelTesterMixin, unittest.TestCase):
        def __init__(
            self,
            parent,
-            batch_size=13,
+            batch_size=14,
            seq_length=7,
            is_training=True,
            use_token_type_ids=True,
@@ -343,6 +343,7 @@ class GPT2ModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_gpt2(self):
        model = GPT2LMHeadModel.from_pretrained("gpt2")
+        model.to(torch_device)
        input_ids = torch.tensor([[464, 3290]], dtype=torch.long, device=torch_device)  # The dog
        expected_output_ids = [
            464,
@@ -372,6 +373,7 @@ class GPT2ModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_distilgpt2(self):
        model = GPT2LMHeadModel.from_pretrained("distilgpt2")
+        model.to(torch_device)
        input_ids = torch.tensor([[464, 1893]], dtype=torch.long, device=torch_device)  # The president
        expected_output_ids = [
            464,
--- a/tests/test_modeling_longformer.py
+++ b/tests/test_modeling_longformer.py
@@ -0,0 +1,253 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+from transformers import is_torch_available
+
+from .test_configuration_common import ConfigTester
+from .test_modeling_common import ModelTesterMixin, ids_tensor
+from .utils import require_torch, slow, torch_device
+
+
+if is_torch_available():
+    import torch
+    from transformers import (
+        LongformerConfig,
+        LongformerModel,
+        LongformerForMaskedLM,
+    )
+
+
+class LongformerModelTester(object):
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        num_choices=4,
+        scope=None,
+        attention_window=4,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.num_choices = num_choices
+        self.scope = scope
+        self.attention_window = attention_window
+
+        # `ModelTesterMixin.test_attention_outputs` is expecting attention tensors to be of size
+        # [num_attention_heads, encoder_seq_length, encoder_key_length], but LongformerSelfAttention
+        # returns attention of shape [num_attention_heads, encoder_seq_length, self.attention_window + 1]
+        # because its local attention only attends to `self.attention_window + 1` locations
+        self.key_length = self.attention_window + 1
+
+        # because of padding `encoder_seq_length`, is different from `seq_length`. Relevant for
+        # the `test_attention_outputs` and `test_hidden_states_output` tests
+        self.encoder_seq_length = (
+            self.seq_length + (self.attention_window - self.seq_length % self.attention_window) % self.attention_window
+        )
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+        token_type_ids = None
+        if self.use_token_type_ids:
+            token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+        sequence_labels = None
+        token_labels = None
+        choice_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+        config = LongformerConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_size=self.type_vocab_size,
+            initializer_range=self.initializer_range,
+            attention_window=self.attention_window,
+        )
+
+        return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+
+    def check_loss_output(self, result):
+        self.parent.assertListEqual(list(result["loss"].size()), [])
+
+    def create_and_check_longformer_model(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LongformerModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        sequence_output, pooled_output = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        sequence_output, pooled_output = model(input_ids, token_type_ids=token_type_ids)
+        sequence_output, pooled_output = model(input_ids)
+
+        result = {
+            "sequence_output": sequence_output,
+            "pooled_output": pooled_output,
+        }
+        self.parent.assertListEqual(
+            list(result["sequence_output"].size()), [self.batch_size, self.seq_length, self.hidden_size]
+        )
+        self.parent.assertListEqual(list(result["pooled_output"].size()), [self.batch_size, self.hidden_size])
+
+    def create_and_check_longformer_for_masked_lm(
+        self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
+    ):
+        model = LongformerForMaskedLM(config=config)
+        model.to(torch_device)
+        model.eval()
+        loss, prediction_scores = model(
+            input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, masked_lm_labels=token_labels
+        )
+        result = {
+            "loss": loss,
+            "prediction_scores": prediction_scores,
+        }
+        self.parent.assertListEqual(
+            list(result["prediction_scores"].size()), [self.batch_size, self.seq_length, self.vocab_size]
+        )
+        self.check_loss_output(result)
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            token_type_ids,
+            input_mask,
+            sequence_labels,
+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@require_torch
+class LongformerModelTest(ModelTesterMixin, unittest.TestCase):
+    test_pruning = False  # pruning is not supported
+    test_headmasking = False  # head masking is not supported
+    test_torchscript = False
+
+    all_model_classes = (LongformerForMaskedLM, LongformerModel) if is_torch_available() else ()
+
+    def setUp(self):
+        self.model_tester = LongformerModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=LongformerConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_longformer_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_longformer_model(*config_and_inputs)
+
+    def test_longformer_for_masked_lm(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_longformer_for_masked_lm(*config_and_inputs)
+
+
+class LongformerModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_inference_no_head(self):
+        model = LongformerModel.from_pretrained("longformer-base-4096")
+        model.to(torch_device)
+
+        # 'Hello world! ' repeated 1000 times
+        input_ids = torch.tensor(
+            [[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
+        )  # long input
+
+        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
+        attention_mask[:, [1, 4, 21]] = 2  # Set global attention on a few random positions
+
+        output = model(input_ids, attention_mask=attention_mask)[0]
+
+        expected_output_sum = torch.tensor(74585.8594, device=torch_device)
+        expected_output_mean = torch.tensor(0.0243, device=torch_device)
+        self.assertTrue(torch.allclose(output.sum(), expected_output_sum, atol=1e-4))
+        self.assertTrue(torch.allclose(output.mean(), expected_output_mean, atol=1e-4))
+
+    @slow
+    def test_inference_masked_lm(self):
+        model = LongformerForMaskedLM.from_pretrained("longformer-base-4096")
+        model.to(torch_device)
+
+        # 'Hello world! ' repeated 1000 times
+        input_ids = torch.tensor(
+            [[0] + [20920, 232, 328, 1437] * 1000 + [2]], dtype=torch.long, device=torch_device
+        )  # long input
+
+        loss, prediction_scores = model(input_ids, masked_lm_labels=input_ids)
+
+        expected_loss = torch.tensor(0.0620, device=torch_device)
+        expected_prediction_scores_sum = torch.tensor(-6.1599e08, device=torch_device)
+        expected_prediction_scores_mean = torch.tensor(-3.0622, device=torch_device)
+        input_ids = input_ids.to(torch_device)
+
+        self.assertTrue(torch.allclose(loss, expected_loss, atol=1e-4))
+        self.assertTrue(torch.allclose(prediction_scores.sum(), expected_prediction_scores_sum, atol=1e-4))
+        self.assertTrue(torch.allclose(prediction_scores.mean(), expected_prediction_scores_mean, atol=1e-4))
--- a/tests/test_modeling_marian.py
+++ b/tests/test_modeling_marian.py
@@ -129,11 +129,6 @@ class TestMarian_EN_DE_More(MarianIntegrationTest):
        max_indices = logits.argmax(-1)
        self.tokenizer.batch_decode(max_indices)

-    def test_tokenizer_equivalence(self):
-        batch = self.tokenizer.prepare_translation_batch(["I am a small frog"]).to(torch_device)
-        expected = [38, 121, 14, 697, 38848, 0]
-        self.assertListEqual(expected, batch.input_ids[0].tolist())
-
    def test_unk_support(self):
        t = self.tokenizer
        ids = t.prepare_translation_batch(["||"]).to(torch_device).input_ids[0].tolist()
--- a/tests/test_modeling_openai.py
+++ b/tests/test_modeling_openai.py
@@ -227,6 +227,7 @@ class OPENAIGPTModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_openai_gpt(self):
        model = OpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
+        model.to(torch_device)
        input_ids = torch.tensor([[481, 4735, 544]], dtype=torch.long, device=torch_device)  # the president is
        expected_output_ids = [
            481,
--- a/tests/test_modeling_reformer.py
+++ b/tests/test_modeling_reformer.py
@@ -19,7 +19,7 @@ from transformers import is_torch_available

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
-from .utils import require_torch, slow, torch_device
+from .utils import require_multigpu, require_torch, slow, torch_device


 if is_torch_available():
@@ -448,9 +448,14 @@ class ReformerTesterMixin:
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_reformer_model_fp16_generate(*config_and_inputs)

+    @require_multigpu
+    def test_multigpu_data_parallel_forward(self):
+        # Opt-out of this test.
+        pass
+

@require_torch
-class ReformerLocalAttnModelTest(ModelTesterMixin, ReformerTesterMixin, unittest.TestCase):
+class ReformerLocalAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
    all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
    all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
    test_pruning = False
@@ -504,7 +509,7 @@ class ReformerLocalAttnModelTest(ModelTesterMixin, ReformerTesterMixin, unittest


@require_torch
-class ReformerLSHAttnModelTest(ModelTesterMixin, unittest.TestCase, ReformerTesterMixin):
+class ReformerLSHAttnModelTest(ReformerTesterMixin, ModelTesterMixin, unittest.TestCase):
    all_model_classes = (ReformerModel, ReformerModelWithLMHead) if is_torch_available() else ()
    all_generative_model_classes = (ReformerModelWithLMHead,) if is_torch_available() else ()
    test_pruning = False
--- a/tests/test_modeling_t5.py
+++ b/tests/test_modeling_t5.py
@@ -304,6 +304,16 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
            output_with_past_cache = model.generate(input_ids[:1], num_beams=2, max_length=5, do_sample=True)
            self.parent.assertTrue(torch.all(output_with_past_cache == output_without_past_cache))

+        def create_and_check_t5_model_fp16_forward(
+            self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
+        ):
+            model = T5Model(config=config)
+            model.to(torch_device)
+            model.half()
+            model.eval()
+            output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)[0]
+            self.parent.assertFalse(torch.isnan(output).any().item())
+
        def prepare_config_and_inputs_for_common(self):
            config_and_inputs = self.prepare_config_and_inputs()
            (
@@ -355,6 +365,11 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_t5_and_check_t5_generate_with_past_key_value_states(*config_and_inputs)

+    @unittest.skipIf(torch_device == "cpu", "Cant do half precision")
+    def test_t5_model_fp16_forward(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_t5_model_fp16_forward(*config_and_inputs)
+
    @slow
    def test_model_from_pretrained(self):
        for model_name in list(T5_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
@@ -429,6 +444,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
        )

        input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
+        input_ids = input_ids.to(torch_device)

        output = model.generate(
            input_ids=input_ids,
@@ -456,6 +472,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
        expected_translation = "Cette section d'images provenant de l'enregistrement infrarouge effectué par le télescope Spitzer montre un « portrait familial » de générations innombrables de étoiles : les plus anciennes sont observées sous forme de pointes bleues, alors que les « nouveau-nés » de couleur rose dans la salle des accouchements doivent être plus difficiles "

        input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
+        input_ids = input_ids.to(torch_device)

        output = model.generate(
            input_ids=input_ids,
@@ -483,6 +500,7 @@ class T5ModelIntegrationTests(unittest.TestCase):
        expected_translation = "Taco Bell a declarat că intenţionează să adauge 2 000 de locaţii în SUA până în 2022."

        input_ids = tok.encode(model.config.prefix + original_input, return_tensors="pt")
+        input_ids = input_ids.to(torch_device)

        output = model.generate(
            input_ids=input_ids,
--- a/tests/test_modeling_transfo_xl.py
+++ b/tests/test_modeling_transfo_xl.py
@@ -21,7 +21,7 @@ from transformers import is_torch_available

 from .test_configuration_common import ConfigTester
 from .test_modeling_common import ModelTesterMixin, ids_tensor
-from .utils import require_torch, slow, torch_device
+from .utils import require_multigpu, require_torch, slow, torch_device


 if is_torch_available():
@@ -43,7 +43,7 @@ class TransfoXLModelTest(ModelTesterMixin, unittest.TestCase):
        def __init__(
            self,
            parent,
-            batch_size=13,
+            batch_size=14,
            seq_length=7,
            mem_len=30,
            clamp_len=15,
@@ -207,6 +207,11 @@ class TransfoXLModelTest(ModelTesterMixin, unittest.TestCase):
        output_result = self.model_tester.create_transfo_xl_lm_head(*config_and_inputs)
        self.model_tester.check_transfo_xl_lm_head_output(output_result)

+    @require_multigpu
+    def test_multigpu_data_parallel_forward(self):
+        # Opt-out of this test.
+        pass
+
    @slow
    def test_model_from_pretrained(self):
        for model_name in list(TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
@@ -218,6 +223,7 @@ class TransfoXLModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_transfo_xl_wt103(self):
        model = TransfoXLLMHeadModel.from_pretrained("transfo-xl-wt103")
+        model.to(torch_device)
        input_ids = torch.tensor(
            [
                [
--- a/tests/test_modeling_xlm.py
+++ b/tests/test_modeling_xlm.py
@@ -434,6 +434,7 @@ class XLMModelLanguageGenerationTest(unittest.TestCase):
    @slow
    def test_lm_generate_xlm_mlm_en_2048(self):
        model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-en-2048")
+        model.to(torch_device)
        input_ids = torch.tensor([[14, 447]], dtype=torch.long, device=torch_device)  # the president
        expected_output_ids = [
            14,
@@ -459,4 +460,4 @@ class XLMModelLanguageGenerationTest(unittest.TestCase):
        ]  # the president the president the president the president the president the president the president the president the president the president
        # TODO(PVP): this and other input_ids I tried for generation give pretty bad results. Not sure why. Model might just not be made for auto-regressive inference
        output_ids = model.generate(input_ids, do_sample=False)
-        self.assertListEqual(output_ids[0].numpy().tolist(), expected_output_ids)
+        self.assertListEqual(output_ids[0].cpu().numpy().tolist(), expected_output_ids)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Lysandre	10d72390c0	Revert #4446 Since it introduces a new dependency Some checks failed GitHub-hosted runner / check_code_quality (push) Has been cancelled Details	2020-05-22 10:49:45 -04:00
Lysandre	e0db6bbd65	Release: v2.10.0	2020-05-22 10:37:44 -04:00
Frankie Liuzzi	bd6e301832	added functionality for electra classification head (#4257 ) * added functionality for electra classification head * unneeded dropout * Test ELECTRA for sequence classification * Style Co-authored-by: Frankie <frankie@frase.io> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-05-22 09:48:21 -04:00
Lysandre	a086527727	Unused Union should not be imported	2020-05-21 09:42:47 -04:00
Lysandre Debut	9d2ce253de	TPU hangs when saving optimizer/scheduler (#4467 ) * TPU hangs when saving optimizer/scheduler * Style * ParallelLoader is not a DataLoader * Style * Addressing @julien-c's comments	2020-05-21 09:18:27 -04:00
Zhangyx	49296533ca	Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (#4463 ) * Adds predict stage for glue tasks, and generate result files which could be submitted to gluebenchmark.com website. * Use Split enum + always output the label name Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-21 09:17:44 -04:00
Tobias Lee	271bedb485	[examples] fix no grad in second pruning in run_bertology (#4479 ) * fix no grad in second pruning and typo * fix prune heads attention mismatch problem * fix * fix * fix * run make style * run make style	2020-05-21 09:17:03 -04:00
Julien Chaumond	865d4d595e	[ci] Close #4481	2020-05-20 18:27:42 -04:00
Julien Chaumond	a3af8e86cb	Update test_trainer_distributed.py	2020-05-20 18:26:51 -04:00
Cola	eacea530c1	🚨 Remove warning of deprecation (#4477 ) Remove warning of deprecated overload of addcdiv_ Fix #4451	2020-05-20 16:48:29 -04:00
Julien Plu	fa2fbed3e5	Better None gradients handling in TF Trainer (#4469 ) * Better None gradients handling * Apply Style * Apply Style	2020-05-20 16:46:21 -04:00
Oliver Åstrand	e708bb75bf	Correct TF formatting to exclude LayerNorms from weight decay (#4448 ) * Exclude LayerNorms from weight decay * Include both formats of layer norm	2020-05-20 16:45:59 -04:00
Rens	49c06132df	pass on tokenizer to pipeline (#4489 )	2020-05-20 22:23:21 +02:00
Nathan Cooper	cacb654c7f	Add Fine-tune DialoGPT on new datasets notebook (#4473 )	2020-05-20 16:17:52 -04:00
Timo Moeller	30a09f3827	Adjust german bert model card, add new model card (#4488 )	2020-05-20 16:08:29 -04:00
Lysandre Debut	14cb5b35fa	Fix slow gpu tests lysandre (#4487 ) * There is one missing key in BERT * Correct device for CamemBERT model * RoBERTa tokenization adding prefix space * Style	2020-05-20 11:59:45 -04:00
Manuel Romero	6dc52c78d8	Create README.md (#4482 )	2020-05-20 09:45:50 -04:00
Manuel Romero	ed5456daf4	Model card for RuPERTa-base fine-tuned for NER (#4466 )	2020-05-20 09:45:24 -04:00
Oleksandr Bushkovskyi	c76450e20c	Model card for Tereveni-AI/gpt2-124M-uk-fiction (#4470 ) Create model card for "Tereveni-AI/gpt2-124M-uk-fiction" model	2020-05-20 09:44:26 -04:00
Hu Xu	9907dc523a	add BERT trained from review corpus. (#4405 ) * add model_cards for BERT trained on reviews. * add link to repository. * refine README.md for each review model	2020-05-20 09:42:35 -04:00
Sam Shleifer	efbc1c5a9d	[MarianTokenizer] implement save_vocabulary and other common methods (#4389 )	2020-05-19 19:45:49 -04:00
Sam Shleifer	956c4c4eb4	[gpu slow tests] fix mbart-large-enro gpu tests (#4472 )	2020-05-19 19:45:31 -04:00
Patrick von Platen	48c3a70b4e	[Longformer] Docs and clean API (#4464 ) * add longformer docs * improve docs	2020-05-19 21:52:36 +02:00
Patrick von Platen	aa925a52fa	[Tests, GPU, SLOW] fix a bunch of GPU hardcoded tests in Pytorch (#4468 ) * fix gpu slow tests in pytorch * change model to device syntax	2020-05-19 21:35:04 +02:00
Suraj Patil	5856999a9f	add T5 fine-tuning notebook [Community notebooks] (#4462 ) * add T5 fine-tuning notebook [Community notebooks] * Update README.md Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>	2020-05-19 18:26:28 +02:00
Sam Shleifer	07dd7c2fd8	[cleanup] test_tokenization_common.py (#4390 )	2020-05-19 10:46:55 -04:00
Iz Beltagy	8f1d047148	Longformer (#4352 ) * first commit * bug fixes * better examples * undo padding * remove wrong VOCAB_FILES_NAMES * License * make style * make isort happy * unit tests * integration test * make `black` happy by undoing `isort` changes!! * lint * no need for the padding value * batch_size not bsz * remove unused type casting * seqlen not seq_len * staticmethod * `bert` selfattention instead of `n2` * uint8 instead of bool + lints * pad inputs_embeds using embeddings not a constant * black * unit test with padding * fix unit tests * remove redundant unit test * upload model weights * resolve todo * simpler _mask_invalid_locations without lru_cache + backward compatible masked_fill_ * increase unittest coverage	2020-05-19 16:04:43 +02:00
Girishkumar	31eedff5a0	Refactored the README.md file (#4427 )	2020-05-19 09:56:24 -04:00
Shaoyen	384f0eb2f9	Map optimizer to correct device after loading from checkpoint. (#4403 ) * Map optimizer to correct device after loading from checkpoint. * Make style test pass Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-18 23:16:05 -04:00
Julien Chaumond	bf14ef75f1	[Trainer] move model to device before setting optimizer (#4450 )	2020-05-18 23:13:33 -04:00
Julien Chaumond	5e7fe8b585	Distributed eval: SequentialDistributedSampler + gather all results (#4243 ) * Distributed eval: SequentialDistributedSampler + gather all results * For consistency only write to disk from world_master Close https://github.com/huggingface/transformers/issues/4272 * Working distributed eval * Hook into scripts * Fix #3721 again * TPU.mesh_reduce: stay in tensor space Thanks @jysohn23 * Just a small comment * whitespace * torch.hub: pip install packaging * Add test scenarii	2020-05-18 22:02:39 -04:00
Julien Chaumond	4c06893610	Fix nn.DataParallel compatibility in PyTorch 1.5 (#4300 ) * Test case for #3936 * multigpu tests pass on pytorch 1.4.0 * Fixup * multigpu tests pass on pytorch 1.5.0 * Update src/transformers/modeling_utils.py * Update src/transformers/modeling_utils.py * rename multigpu to require_multigpu * mode doc	2020-05-18 20:34:50 -04:00
Rakesh Chada	9de4afa897	Make get_last_lr in trainer backward compatible (#4446 ) * makes fetching last learning late in trainer backward compatible * split comment to multiple lines * fixes black styling issue * uses version to create a more explicit logic	2020-05-18 20:17:36 -04:00
Stefan Dumitrescu	42e8fbfc51	Added model cards for Romanian BERT models (#4437 ) * Create README.md * Create README.md * Update README.md * Update README.md * Apply suggestions from code review Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-18 18:48:56 -04:00
Oliver Guhr	54065d68b8	added model card for german-sentiment-bert (#4435 )	2020-05-18 18:44:41 -04:00
Martin Müller	e28b7e2311	Create README.md (#4433 )	2020-05-18 18:41:34 -04:00
sy-wada	09b933f19d	Update README.md (model_card) (#4424 ) - add a citation. - modify the table of the BLUE benchmark. The table of the first version was not displayed correctly on https://huggingface.co/seiya/oubiobert-base-uncased. Could you please confirm that this fix will allow you to display it correctly?	2020-05-18 18:18:17 -04:00
Manuel Romero	235777ccc9	Modify example of usage (#4413 ) I followed the google example of usage for its electra small model but i have seen it is not meaningful, so i created a better example	2020-05-18 18:17:33 -04:00
Suraj Patil	9ddd3a6548	add model card for t5-base-squad (#4409 ) * add model card for t5-base-squad * Update model_cards/valhalla/t5-base-squad/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-18 18:17:14 -04:00
HUSEIN ZOLKEPLI	c5aa114392	Added README huseinzol05/t5-base-bahasa-cased (#4377 ) * add bert bahasa readme * update readme * update readme * added xlnet * added tiny-bert and fix xlnet readme * added albert base * added albert tiny * added electra model * added gpt2 117m bahasa readme * added gpt2 345m bahasa readme * added t5-base-bahasa * fix readme * Update model_cards/huseinzol05/t5-base-bahasa-cased/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-18 18:10:23 -04:00
Funtowicz Morgan	ca4a3f4da9	Adding optimizations block from ONNXRuntime. (#4431 ) * Adding optimizations block from ONNXRuntime. * Turn off external data format by default for PyTorch export. * Correct the way use_external_format is passed through the cmdline args.	2020-05-18 20:32:33 +02:00
Patrick von Platen	24538df919	[Community notebooks] General notebooks (#4441 ) * Update README.md * Update README.md * Update README.md * Update README.md	2020-05-18 20:23:57 +02:00
Sam Shleifer	a699525d25	[test_pipelines] Mark tests > 10s @slow, small speedups (#4421 )	2020-05-18 12:23:21 -04:00
Boris Dayma	d9ece8233d	fix(run_language_modeling): use arg overwrite_cache (#4407 )	2020-05-18 11:37:35 -04:00
Patrick von Platen	d39bf0ac2d	better naming in tf t5 (#4401 )	2020-05-18 11:34:00 -04:00
Patrick von Platen	590adb130b	improve docstring (#4422 )	2020-05-18 11:31:35 -04:00
Patrick von Platen	026a5d0888	[T5 fp16] Fix fp16 in T5 (#4436 ) * fix fp16 in t5 * make style * refactor invert_attention_mask fn * fix typo	2020-05-18 17:25:58 +02:00
Soham Chatterjee	fa6113f9a0	Fixed spelling of training (#4416 )	2020-05-18 11:23:29 -04:00
Julien Chaumond	757baee846	Fix un-prefixed f-string see https://github.com/huggingface/transformers/pull/4367#discussion_r426356693 Hat/tip @girishponkiya	2020-05-18 11:20:46 -04:00
Patrick von Platen	a27c795908	fix (#4419 )	2020-05-18 15:51:40 +02:00
Funtowicz Morgan	31c799a0c9	Tag onnx export tests as slow (#4432 )	2020-05-18 09:24:41 -04:00
Mehrad Moradshahi	8581a670e3	[MbartTokenizer] save to sentencepiece.bpe.model (#4335 )	2020-05-18 08:54:04 -04:00
Lorenzo Ampil	18d233d525	Allow the creation of "entity groups" for NerPipeline #3548 (#3957 ) * Add index to be returned by NerPipeline to allow for the creation of * Add entity groups * Convert entity list to dict * Add entity to entity_group_disagg atfter updating entity gorups * Change 'group' parameter to 'grouped_entities' * Add unit tests for grouped NER pipeline case * Correct variable name typo for NER_FINETUNED_MODELS * Sync grouped tests to recent test updates	2020-05-17 09:25:17 +02:00
Julien Chaumond	3e0f062106	Fix addcmul_	2020-05-15 17:44:17 -04:00
Julien Chaumond	fc2a4c88ce	Fix: one more try	2020-05-15 17:38:48 -04:00
Julien Chaumond	55bda52555	Same fix for `addcmul_`	2020-05-15 17:23:48 -04:00
Julien Chaumond	ad02c961c6	Fix UserWarning: This overload of add_ is deprecated in pytorch==1.5.0	2020-05-15 17:09:11 -04:00
Julien Chaumond	15550ce0d1	[skip ci] remove local rank	2020-05-15 17:08:38 -04:00
Nikita	62427d0815	rerun notebook 02-transformers (#4341 )	2020-05-15 10:33:08 -04:00
Jared T Nielsen	34706ba050	Allow for None gradients in GradientAccumulator. (#4372 )	2020-05-15 09:52:00 -04:00
Lysandre Debut	edf9ac11d4	Should return overflowing information for the log (#4385 )	2020-05-15 09:49:11 -04:00
Funtowicz Morgan	b908f2e9dd	Attempt to unpin torch version for Github Action. (#4384 )	2020-05-15 15:47:15 +02:00
Julien Chaumond	af2e6bf87c	[examples] Streamline doc	2020-05-14 20:34:31 -04:00
Lysandre Debut	7defc6670f	p_mask in SQuAD pre-processing (#4049 ) * Better p_mask building * Adressing @mfuntowicz comments	2020-05-14 17:07:52 -04:00
Morgan Funtowicz	84894974bd	Updated ONNX notebook link in README.	2020-05-14 22:40:59 +02:00
Funtowicz Morgan	db0076a9df	Conversion script to export transformers models to ONNX IR. (#4253 ) * Added generic ONNX conversion script for PyTorch model. * WIP initial TF support. * TensorFlow/Keras ONNX export working. * Print framework version info * Add possibility to check the model is correctly loading on ONNX runtime. * Remove quantization option. * Specify ONNX opset version when exporting. * Formatting. * Remove unused imports. * Make functions more generally reusable from other part of the code. * isort happy. * flake happy * Export only feature-extraction for now * Correctly check inputs order / filter before export. * Removed task variable * Fix invalid args call in load_graph_from_args. * Fix invalid args call in convert. * Fix invalid args call in infer_shapes. * Raise exception and catch in caller function instead of exit. * Add 04-onnx-export.ipynb notebook * More WIP on the notebook * Remove unused imports * Simplify & remove unused constants. * Export with constant_folding in PyTorch * Let's try to put function args in the right order this time ... * Disable external_data_format temporary * ONNX notebook draft ready. * Updated notebooks charts + wording * Correct error while exporting last chart in notebook. * Adressing @LysandreJik comment. * Set ONNX opset to 11 as default value. * Set opset param mandatory * Added ONNX export unittests * Quality. * flake8 happy * Add keras2onnx dependency on extras["tf"] * Pin keras2onnx on github master to v1.6.5 * Second attempt. * Third attempt. * Use the right repo URL this time ... * Do the same for onnxconverter-common * Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2 * Correct commit hash. * Addressing PR review: Optimization are enabled by default. * Addressing PR review: small changes in the notebook * setup.py comment about keras2onnx versioning.	2020-05-14 16:35:52 -04:00
Suraj Patil	2d05480174	Fix trainer evaluation (#4363 ) * fix loss calculation in evaluation * fix evaluation on TPU when prediction_loss_only is True	2020-05-14 14:39:44 -04:00
Savaş Yıldırım	035678efdb	Create README.md (#4359 ) * Create README.md * Update model_cards/savasy/bert-base-turkish-squad/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-14 14:07:32 -04:00
sy-wada	b9c9e05381	Create README.md (#4357 )	2020-05-14 14:06:10 -04:00
Sam Shleifer	9535bf1977	Tokenizer.batch_decode convenience method (#4159 )	2020-05-14 13:50:47 -04:00
Sam Shleifer	7822cd38a0	[tests] make pipelines tests faster with smaller models (#4238 ) covers torch and tf. Also fixes a failing @slow test	2020-05-14 13:36:02 -04:00
Julien Chaumond	448c467256	Fix: unpin flake8 and fix cs errors (#4367 ) * Fix: unpin flake8 and fix cs errors * Ok we still need to quote those	2020-05-14 13:14:26 -04:00
Julien Chaumond	c547f15a17	Use Filelock to ensure distributed barriers see context in https://github.com/huggingface/transformers/pull/4223	2020-05-14 11:58:32 -04:00
Julien Chaumond	015f7812ed	[ci skip] Pin isort	2020-05-14 10:12:18 -04:00
Lysandre Debut	ef46ccb05c	TPU needs a rendezvous (#4339 )	2020-05-14 08:59:52 -04:00
Viktor Alm	94cb73c2d2	Add image and metadata (#4345 ) Unfortunately i accidentally orphaned my other PR	2020-05-13 20:05:15 -04:00
Manuel Romero	a0eebdc404	Add link to W&B to see whole training logs (#4348 )	2020-05-13 20:04:57 -04:00