Release: v2.9.1

[Marian Fixes] prevent predicting pad_token_id before softmax, support language codes, name multilingual models (#4290 )
[Docs, Notebook] Include generation pipeline (#4295 )
2020-05-13 17:38:50 -04:00 · 2020-05-13 17:29:41 -04:00 · 2020-05-13 14:24:08 -04:00 · 2020-05-13 10:22:03 -04:00 · 2020-05-13 09:22:31 -04:00 · 2020-05-13 14:32:57 +02:00
120 changed files with 3810 additions and 550 deletions
--- a/.github/workflows/self-push.yml
+++ b/.github/workflows/self-push.yml
@@ -1,9 +1,13 @@
 name: Self-hosted runner (push)

 on: 
-  # push:
-  #   branches:
-  #     - master
+  push:
+    branches:
+      - master
+    paths: 
+      - "src/**"
+      - "tests/**"
+      - ".github/**"
  # pull_request:
  repository_dispatch:

@@ -31,8 +35,8 @@ jobs:
    - name: Install dependencies
      run: |
        source .env/bin/activate
-        pip install .[sklearn,tf,torch,testing]
-        pip uninstall -y tensorflow
+        pip install torch==1.4.0
+        pip install .[sklearn,testing]

    - name: Are GPUs recognized by our DL frameworks
      run: |
--- a/README.md
+++ b/README.md
@@ -164,8 +164,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
 17. **[ELECTRA](https://huggingface.co/transformers/model_doc/electra.html)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
 18. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 19. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-20. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
-21. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
+20. **[MarianMT](https://huggingface.co/transformers/model_doc/marian.html)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+21. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
+22. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

@@ -414,7 +415,7 @@ Training with these hyper-parameters gave us the following results:
 This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
@@ -447,7 +448,7 @@ The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-g
 Here is how to run the script with the small version of OpenAI GPT-2 model:

 ```shell
-python ./examples/run_generation.py \
+python ./examples/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=20 \
    --model_name_or_path=gpt2 \
@@ -455,7 +456,7 @@ python ./examples/run_generation.py \

 and from the Salesforce CTRL model:
 ```shell
-python ./examples/run_generation.py \
+python ./examples/text-generation/run_generation.py \
    --model_type=ctrl \
    --length=20 \
    --model_name_or_path=ctrl \
--- a/docs/README.md
+++ b/docs/README.md
@@ -67,3 +67,131 @@ It should build the static app that will be available under `/docs/_build/html`

 Accepted files are reStructuredText (.rst) and Markdown (.md). Create a file with its extension and put it
 in the source directory. You can then link it to the toc-tree by putting the filename without the extension.
+
+## Writing Documentation - Specification
+
+The `huggingface/transformers` documentation follows the
+[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style. It is
+mostly written in ReStructuredText 
+([Sphinx simple documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html), 
+[Sourceforge complete documentation](https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html))
+
+### Adding a new section
+
+A section is a page held in the `Notes` toc-tree on the documentation. Adding a new section is done in two steps:
+
+- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
+- Link that file in `./source/index.rst` on the correct toc-tree.
+
+### Adding a new model
+
+When adding a new model:
+ 
+- Create a file `xxx.rst` under `./source/model_doc`. 
+- Link that file in `./source/index.rst` on the `model_doc` toc-tree.
+- Write a short overview of the model:
+    - Overview with paper & authors
+    - Paper abstract
+    - Tips and tricks and how to use it best
+- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
+  every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
+  The order is generally: 
+    - Configuration, 
+    - Tokenizer
+    - PyTorch base model
+    - PyTorch head models
+    - TensorFlow base model
+    - TensorFlow head models
+
+These classes should be added using the RST syntax. Usually as follows:
+```
+XXXConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XXXConfig
+    :members:
+```
+
+This will include every public method of the configuration. If for some reason you wish for a method not to be displayed
+in the documentation, you can do so by specifying which methods should be in the docs:
+
+```
+XXXTokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.XXXTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+```
+
+### Writing source documentation
+
+Values that should be put in `code` should either be surrounded by double backticks: \`\`like so\`\` or be written as an object
+using the :obj: syntax: :obj:\`like so\`.
+
+When mentionning a class, it is recommended to use the :class: syntax as the mentioned class will be automatically
+linked by Sphinx: :class:\`transformers.XXXClass\`
+
+When mentioning a function, it is recommended to use the :func: syntax as the mentioned method will be automatically
+linked by Sphinx: :func:\`transformers.XXXClass.method\`
+
+Links should be done as so (note the double underscore at the end): \`text for the link <./local-link-or-global-link#loc>\`__
+
+#### Defining arguments in a method
+
+Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation. 
+The argument should be followed by its type, with its shape if it is a tensor, and a line return.
+Another indentation is necessary before writing the description of the argument.
+
+Here's an example showcasing everything so far:
+
+```
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using :class:`transformers.AlbertTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.encode` and
+            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+
+            `What are input IDs? <../glossary.html#input-ids>`__
+```
+
+#### Writing a multi-line code block 
+
+Multi-line code blocks can be useful for displaying examples. They are done like so:
+
+```
+Example::
+
+    # first line of code
+    # second line
+    # etc
+```
+
+The `Example` string at the beginning can be replaced by anything as long as there are two semicolons following it.
+
+#### Writing a return block
+
+Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation. 
+The first line should be the type of the return, followed by a line return. No need to indent further for the elements
+building the return.
+
+Here's an example for tuple return, comprising several objects:
+
+```
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+```
+
+Here's an example for a single value return:
+
+```
+    Returns:
+        A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+```
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
@@ -15,4 +15,4 @@ In order to help this new field develop, we have included a few additional featu
 * accessing all the attention weights for each head of BERT/GPT/GPT-2,
 * retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.

-To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
+To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/transformers/blob/master/examples/bertology/run_bertology.py>`_ while extract information and prune a model pre-trained on GLUE.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.9.0'
+release = u'2.9.1'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
@@ -12,7 +12,7 @@ A command-line interface is provided to convert original Bert/GPT/GPT-2/Transfor
 BERT
 ^^^^

-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/transformers/convert_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_bert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.

 This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).

@@ -33,6 +33,26 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas

 You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.

+ALBERT
+^^^^^^
+
+Convert TensorFlow model checkpoints of ALBERT to PyTorch using the `convert_albert_original_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/transformers/blob/master/src/transformers/convert_bert_original_tf_checkpoint_to_pytorch.py>`_ script.
+
+The CLI takes as input a TensorFlow checkpoint (three files starting with ``model.ckpt-best``\ ) and the accompanying configuration file (\ ``albert_config.json``\ ), then creates and saves a PyTorch model. To run this conversion you will need to have TensorFlow and PyTorch installed.
+
+Here is an example of the conversion process for the pre-trained ``ALBERT Base`` model:
+
+.. code-block:: shell
+
+   export ALBERT_BASE_DIR=/path/to/albert/albert_base
+
+   transformers-cli convert --model_type albert \
+     --tf_checkpoint $ALBERT_BASE_DIR/model.ckpt-best \
+     --config $ALBERT_BASE_DIR/albert_config.json \
+     --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin
+
+You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/albert#pre-trained-models>`__.
+
 OpenAI GPT
 ^^^^^^^^^^

--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -23,13 +23,13 @@ pip install -r ./examples/requirements.txt
 | [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
 | [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
 | [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
+| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
 | [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
 | [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |

 ## TensorFlow 2.0 Bert models on GLUE

-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
+Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).

 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).

@@ -93,7 +93,7 @@ python run_glue_tpu.py \

 ## Language model training

-Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
+Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).

 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
@@ -155,7 +155,7 @@ python run_language_modeling.py \

 ## Language generation

-Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
+Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).

 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
@@ -364,7 +364,7 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/run_multiple_choice.py \
+python ./examples/multiple-choice/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
@@ -388,7 +388,7 @@ eval_loss = 0.44457291918821606

 ## SQuAD

-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
+Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).

 #### Fine-tuning BERT on SQuAD1.0

@@ -437,7 +437,7 @@ exact_match = 81.22
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
@@ -548,7 +548,7 @@ Larger batch size may improve the performance while costing more memory.

 ## XNLI

-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
+Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).

 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -108,3 +108,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/electra
    model_doc/dialogpt
    model_doc/reformer
+    model_doc/marian
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -74,7 +74,7 @@ This library hosts the processor to load the XNLI data:
 Please note that since the gold labels are available on the test set, evaluation is performed on the test set.

 An example using these processors is given in the
-`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/run_xnli.py>`__ script.
+`run_xnli.py <https://github.com/huggingface/pytorch-transformers/blob/master/examples/text-classification/run_xnli.py>`__ script.


 SQuAD
@@ -150,4 +150,4 @@ Example::


 Another example using these processors is given in the
-`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/run_squad.py>`__ script.
+`run_squad.py <https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py>`__ script.
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
@@ -1,5 +1,18 @@
-# Migrating from pytorch-pretrained-bert
+# Migrating from previous packages

+## Migrating from pytorch-transformers to transformers
+
+Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
+
+### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed
+
+To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed.
+
+If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change.
+
+If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments.
+
+## Migrating from pytorch-pretrained-bert

 Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `transformers`

--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -1,6 +1,6 @@
 Bart
 ----------------------------------------------------
-**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+**DISCLAIMER:** If you see something strange,
 file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
@sshleifer

--- a/docs/source/model_doc/marian.rst
+++ b/docs/source/model_doc/marian.rst
@@ -0,0 +1,105 @@
+MarianMT
+----------------------------------------------------
+**DISCLAIMER:** If you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@sshleifer. Translations should be similar, but not identical to, output in the test set linked to in each model card.
+
+Implementation Notes
+~~~~~~~~~~~~~~~~~~~~
+- each model is about 298 MB on disk, there are 1,000+ models.
+- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
+- The 1,000+ models were originally trained by `Jörg Tiedemann <https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian <https://marian-nmt.github.io/>`_ C++ library, which supports fast training and translation.
+- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented in a model card.
+- the 80 opus models that require BPE preprocessing are not supported.
+- The modeling code is the same as ``BartForConditionalGeneration`` with a few minor modifications:
+    - static (sinusoid) positional embeddings (``MarianConfig.static_position_embeddings=True``)
+    - a new final_logits_bias (``MarianConfig.add_bias_logits=True``)
+    - no layernorm_embedding (``MarianConfig.normalize_embedding=False``)
+    - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. (Bart uses <s/>)
+- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``
+
+Naming
+~~~~~~
+- All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``
+- The language codes used to name models are inconsistent. Two digit codes can usually be found `here <https://developers.google.com/admin-sdk/directory/v1/languages>`_, three digit codes require googling "language code {code}".
+- Codes formatted like ``es_AR`` are usually ``code_{region}``. That one is spanish documents from Argentina.
+
+
+Multilingual Models
+~~~~~~~~~~~~~~~~~~~~
+
+All  model names use the following format: ``Helsinki-NLP/opus-mt-{src}-{tgt}``:
+    - if ``src`` is in all caps, the model supports multiple input languages, you can figure out which ones by looking at the model card, or the Group Members `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_ .
+    - if ``tgt`` is in all caps, the model can output multiple languages, and you should specify a language code by prepending the desired output language to the src_text
+    - You can see a tokenizer's supported language codes in ``tokenizer.supported_language_codes``
+
+Example of translating english to many romance languages, using language codes:
+
+.. code-block:: python
+
+    from transformers import MarianMTModel, MarianTokenizer
+    src_text = [
+        '>>fr<< this is a sentence in english that we want to translate to french',
+        '>>pt<< This should go to portuguese',
+        '>>es<< And this to Spanish'
+    ]
+
+    model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
+    tokenizer = MarianTokenizer.from_pretrained(model_name)
+    print(tokenizer.supported_language_codes)
+    model = MarianMTModel.from_pretrained(model_name)
+    translated = model.generate(**tokenizer.prepare_translation_batch(src_text))
+    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
+    # ["c'est une phrase en anglais que nous voulons traduire en français",
+    # 'Isto deve ir para o português.',
+    # 'Y esto al español']
+
+Sometimes, models were trained on collections of languages that do not resolve to a group. In this case, _ is used as a separator for src or tgt, as in ``'Helsinki-NLP/opus-mt-en_el_es_fi-en_el_es_fi'``. These still require language codes.
+There are many supported regional language codes, like ``>>es_ES<<`` (Spain) and ``>>es_AR<<`` (Argentina), that do not seem to change translations. I have not found these to provide different results than just using ``>>es<<``.
+
+For Example:
+    - ``Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU``: translates from all NORTH_EU languages (see `mapping <https://gist.github.com/sshleifer/6d20e7761931b08e73c3219027b97b8a>`_) to all NORTH_EU languages. Use a special language code like ``>>de<<`` to specify output language.
+    - ``Helsinki-NLP/opus-mt-ROMANCE-en``: translates from many romance languages to english, no codes needed since there is only 1 tgt language.
+
+
+
+.. code-block:: python
+
+    GROUP_MEMBERS = {
+     'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
+     'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
+     'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+     'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
+     'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
+     'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
+     'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
+    }
+
+Code to see available pretrained models:
+
+.. code-block:: python
+
+    from transformers.hf_api import HfApi
+    model_list = HfApi().model_list()
+    org = "Helsinki-NLP"
+    model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
+    suffix = [x.split('/')[1] for x in model_ids]
+    multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
+
+MarianMTModel
+~~~~~~~~~~~~~
+
+Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
+Model API is identical to BartForConditionalGeneration.
+Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__
+This class inherits all functionality from ``BartForConditionalGeneration``, see that page for method signatures.
+
+.. autoclass:: transformers.MarianMTModel
+    :members:
+
+
+MarianTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.MarianTokenizer
+    :members: prepare_translation_batch
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -29,7 +29,7 @@ Tips:
  XLNet is pretrained using only a sub-set of the output tokens as target which are selected
  with the `target_mapping` input.
 - To use XLNet for sequential decoding (i.e. not in fully bi-directional setting), use the `perm_mask` and
-  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/run_generation.py`)
+  `target_mapping` inputs to control the attention span and outputs (see examples in `examples/text-generation/run_generation.py`)
 - XLNet is one of the few models that has no sequence length limit.

 The original code can be found `here <https://github.com/zihangdai/xlnet/>`_.
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -80,7 +80,7 @@ You can then feed it all as input to your model:
    outputs = model(input_ids, langs=langs)


-The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
+The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
 can generate text using the CLM checkpoints from XLM, using the language embeddings.

 XLM without Language Embeddings
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -275,7 +275,7 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Bart              | ``bart-large``                                             | | 12-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
+| Bart              | ``bart-large``                                             | | 24-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``bart-large-mnli``                                        | | Adds a 2 layer classification head with 1 million parameters                                                                        |
@@ -296,6 +296,12 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``DialoGPT-large``                                         | | 36-layer, 1280-hidden, 20-heads, 774M parameters                                                                                    |
 |                   |                                                            | | Trained on English text: 147M conversation-like exchanges extracted from Reddit.                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-| Reformer          | ``reformer-crime-and-punishment``                          | | 6-layer, 256-hidden, 2-heads, 3M parameters                                                                                         |
-|                   |                                                            | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky                                                           |
+| Reformer          | ``reformer-enwik8``                                        | | 12-layer, 1024-hidden, 8-heads, 149M parameters                                                                                     |
+|                   |                                                            | | Trained on English Wikipedia data - enwik8.                                                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``reformer-crime-and-punishment``                          | | 6-layer, 256-hidden, 2-heads, 3M parameters                                                                                         |
+|                   |                                                            | | Trained on English text: Crime and Punishment novel by Fyodor Dostoyevsky.                                                          |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| MarianMT          | ``Helsinki-NLP/opus-mt-{src}-{tgt}``                       | | 12-layer, 512-hidden, 8-heads, ~74M parameter Machine translation models. Parameter counts vary depending on vocab size.            |
+|                   |                                                            | | (see `model list <https://huggingface.co/Helsinki-NLP>`_)                                                                           |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -8,7 +8,7 @@ The library was designed with two strong goals in mind:

 - be as easy and fast to use as possible:

-  - we strongly limited the number of user-facing abstractions to learn, in fact there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
+  - we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
  - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
  - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.

@@ -31,27 +31,27 @@ A few other goals:

 ## Main concepts

-The library is build around three type of classes for each models:
+The library is build around three types of classes for each model:

- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
- **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`
+- **model classes**  e.g., `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
+- **configuration classes** which store all the parameters required to build a model, e.g., `BertConfig`. You don't always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
+- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., `BertTokenizer`

 All these classes can be instantiated from pretrained instances and saved locally using two methods:

 - `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
 - `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.

-We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized in two parts:
+We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized into two parts:

 - the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and in particular the input/output that you should expect when calling each of them.
+- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and, in particular, the input/output that you should expect when calling each of them.

 ## Quick tour: Usage

 Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.

-See full API reference for examples for each model class.
+See the full API reference for examples of each model class.

 ### BERT example

@@ -191,7 +191,7 @@ Examples for each model class of each model architecture (Bert, GPT, GPT-2, Tran

 #### Using the past

-GPT-2 as well as some other models (GPT, XLNet, Transfo-XL, CTRL) make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
+GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.

 Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):

--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -45,7 +45,7 @@ Sequence classification is the task of classifying sequences according to a give
 of sequence classification is the GLUE dataset, which is entirely based on that task. If you would like to fine-tune
 a model on a GLUE sequence classification task, you may leverage the
 `run_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_glue.py>`_ or
-`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/run_tf_glue.py>`_ scripts.
+`run_tf_glue.py <https://github.com/huggingface/transformers/tree/master/examples/text-classification/run_tf_glue.py>`_ scripts.

 Here is an example using the pipelines do to sentiment analysis: identifying if a sequence is positive or negative.
 It leverages a fine-tuned model on sst2, which is a GLUE task.
@@ -404,48 +404,150 @@ Causal language modeling is the task of predicting the token following a sequenc
 model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting
 for generation tasks.

-There is currently no pipeline to do causal language modeling/generation.
+Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.

-Here is an example using the tokenizer and model. leveraging the :func:`~transformers.PreTrainedModel.generate` method
-to generate the tokens following the initial sequence in PyTorch, and creating a simple loop in TensorFlow.
+Here is an example using the tokenizer and model and leveraging the :func:`~transformers.PreTrainedModel.top_k_top_p_filtering` method to sample the next token following an input sequence of tokens.
+
+::
+
+    ## PYTORCH CODE
+    from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
+    import torch
+    from torch.nn import functional as F
+
+
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    model = AutoModelWithLMHead.from_pretrained("gpt2")
+
+    sequence = f"Hugging Face is based in DUMBO, New York City, and "
+
+    input_ids = tokenizer.encode(sequence, return_tensors="pt")
+
+    # get logits of last hidden state
+    next_token_logits = model(input_ids)[0][:, -1, :]
+
+    # filter
+    filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+    # sample
+    probs = F.softmax(filtered_next_token_logits, dim=-1)
+    next_token = torch.multinomial(probs, num_samples=1)
+
+    generated = torch.cat([input_ids, next_token], dim=-1)
+
+    resulting_string = tokenizer.decode(generated.tolist()[0])
+    print(resulting_string)
+    ## TENSORFLOW CODE
+    from transformers import TFAutoModelWithLMHead, AutoTokenizer, tf_top_k_top_p_filtering
+    import tensorflow as tf
+
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")
+    model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+
+    sequence = f"Hugging Face is based in DUMBO, New York City, and "
+
+    input_ids = tokenizer.encode(sequence, return_tensors="tf")
+
+    # get logits of last hidden state
+    next_token_logits = model(input_ids)[0][:, -1, :]
+
+    # filter
+    filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
+
+    # sample
+    next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)
+
+    generated = tf.concat([input_ids, next_token], axis=1)
+
+    resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
+    print(resulting_string)
+
+
+This outputs a (hopefully) coherent next token following the original sequence, which is in our case is the word *has*:
+
+::
+
+    Hugging Face is based in DUMBO, New York City, and has
+
+In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to generate multiple tokens up to a user-defined length.
+
+Text Generation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In text generation (*a.k.a* *open-ended text generation*) the goal is to create a coherent portion of text that is a continuation from the given context. As an example, is it shown how *GPT-2* can be used in pipelines to generate text. As a default all models apply *Top-K* sampling when used in pipelines as configured in their respective configurations (see `gpt-2 config <https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json>`_ for example).
+
+::
+
+    from transformers import pipeline
+
+    text_generator = pipeline("text-generation")
+    print(text_generator("As far as I am concerned, I will", max_length=50))
+
+
+Here the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am concerned, I will"*.
+The default arguments of ``PreTrainedModel.generate()`` can directly be overriden in the pipeline as is shown above for the argument ``max_length``.
+
+Here is an example for text generation using XLNet and its tokenzier. 

 ::

    ## PYTORCH CODE
    from transformers import AutoModelWithLMHead, AutoTokenizer

-    tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    model = AutoModelWithLMHead.from_pretrained("gpt2")
+    model = AutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

-    sequence = f"Hugging Face is based in DUMBO, New York City, and is"
+    # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+    PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+    (except for Alexei and Maria) are discovered.
+    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+    remainder of the story. 1883 Western Siberia,
+    a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+    Rasputin has a vision and denounces one of the men as a horse thief. Although his
+    father initially slaps him for making such an accusation, Rasputin watches as the
+    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+    with people, even a bishop, begging for his blessing. <eod> </s> <eos>""" 

-    input = tokenizer.encode(sequence, return_tensors="pt")
-    generated = model.generate(input, max_length=50, do_sample=True)
+    prompt = "Today the weather is really nice and I am planning on "
+    inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")
+    
+    prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
+    outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+    generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

-    resulting_string = tokenizer.decode(generated.tolist()[0])
-    print(resulting_string)
+    print(generated)
    ## TENSORFLOW CODE
    from transformers import TFAutoModelWithLMHead, AutoTokenizer
-    import tensorflow as tf

-    tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    model = TFAutoModelWithLMHead.from_pretrained("gpt2")
+    model = TFAutoModelWithLMHead.from_pretrained("xlnet-base-cased")
+    tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

-    sequence = f"Hugging Face is based in DUMBO, New York City, and is"
-    input = tokenizer.encode(sequence, return_tensors="tf")
-    generated = model.generate(input, max_length=50, do_sample=True)
+    # Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
+    PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
+    (except for Alexei and Maria) are discovered.
+    The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
+    remainder of the story. 1883 Western Siberia,
+    a young Grigori Rasputin is asked by his father and a group of men to perform magic.
+    Rasputin has a vision and denounces one of the men as a horse thief. Although his
+    father initially slaps him for making such an accusation, Rasputin watches as the
+    man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
+    the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
+    with people, even a bishop, begging for his blessing. <eod> </s> <eos>""" 

-    resulting_string = tokenizer.decode(generated.tolist()[0])
-    print(resulting_string)
+    prompt = "Today the weather is really nice and I am planning on "
+    inputs = tokenizer.encode(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")

+    prompt_length = len(tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
+    outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
+    generated = prompt + tokenizer.decode(outputs[0])[prompt_length:]

-This outputs a (hopefully) coherent string from the original sequence, as the
-:func:`~transformers.PreTrainedModel.generate` samples from a top_p/tok_k distribution:
+    print(generated)

-::
+Text generation is currently possible with *GPT-2*, *OpenAi-GPT*, *CTRL*, *XLNet*, *Transfo-XL* and *Reformer* in PyTorch and for most models in Tensorflow as well. As can be seen in the example above *XLNet* and *Transfo-xl* often need to be padded to work well.
+GPT-2 is usually a good choice for *open-ended text generation* because it was trained on millions on webpages with a causal language modeling objective.

-    Hugging Face is based in DUMBO, New York City, and is a live-action TV series based on the novel by John
-    Carpenter, and its producers, David Kustlin and Steve Pichar. The film is directed by!
+For more information on how to apply different decoding strategies for text generation, please also refer to our generation blog post `here <https://huggingface.co/blog/how-to-generate>`_.


 Named Entity Recognition
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,10 +1,48 @@
 # Examples

-In this section a few examples are put together. All of these examples work for several models, making use of the very
-similar API between the different models.
+Version 2.9 of `transformers` introduces a new `Trainer` class for PyTorch, and its equivalent `TFTrainer` for TF 2.
+
+Here is the list of all our examples:
+- **grouped by task** (all official examples work for multiple models)
+- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might just lack some features),
+- whether they also include examples for **`pytorch-lightning`**, which is a great fully-featured, general-purpose training library for PyTorch,
+- links to **Colab notebooks** to walk through the scripts and run them easily,
+- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
+
+This is still a work-in-progress – in particular documentation is still sparse – so please **contribute improvements/pull requests.**
+
+
+## Tasks built on Trainer
+
+| Task | Example datasets | Trainer support | TFTrainer support | pytorch-lightning | Colab | One-click Deploy to Azure (wip) |
+|---|---|:---:|:---:|:---:|:---:|:---:|
+| [`language-modeling`](./language-modeling) | Raw text | ✅ | - | - | - | - |
+| [`text-classification`](./text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure%2Fazure-quickstart-templates%2Fmaster%2F101-storage-account-create%2Fazuredeploy.json) |
+| [`token-classification`](./token-classification) | CoNLL NER | ✅ | ✅ | ✅ | - | - |
+| [`multiple-choice`](./multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb) | - |
+| [`question-answering`](./question-answering) | SQuAD | - | ✅ | - | - | - |
+
+
+
+## Other examples and how-to's
+
+| Section | Description |
+|---|---|
+| [TensorFlow 2.0 models on GLUE](./text-classification) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
+| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
+| [Language Model training](./language-modeling) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
+| [Language Generation](./text-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
+| [GLUE](./text-classification) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
+| [SQuAD](./question-answering) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
+| [Multiple Choice](./multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
+| [Named Entity Recognition](./token-classification) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
+| [XNLI](./text-classification) | Examples running BERT/XLM on the XNLI benchmark. |
+| [Adversarial evaluation of model performances](./adversarial) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+
+## Important note

 **Important**
-To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
+To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements.
 Execute the following steps in a new virtual environment:

 ```bash
@@ -14,16 +52,30 @@ pip install .
 pip install -r ./examples/requirements.txt
 ```

-| Section                    | Description                                                                                                                                                |
-|----------------------------|-----------------------------------------------------
-| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
-| [Running on TPUs](#running-on-tpus) | Examples on running fine-tuning tasks on Google TPUs to accelerate workloads. |
-| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
-| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
-| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
-| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
-| [Named Entity Recognition](https://github.com/huggingface/transformers/tree/master/examples/ner) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
-| [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+## Running on TPUs

+When using Tensorflow, TPUs are supported out of the box as a `tf.distribute.Strategy`.
+
+When using PyTorch, we support TPUs thanks to `pytorch/xla`. For more context and information on how to setup your TPU environment refer to Google's documentation and to the
+very detailed [pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).
+
+In this repo, we provide a very simple launcher script named [xla_spawn.py](./xla_spawn.py) that lets you run our example scripts on multiple TPU cores without any boilerplate.
+Just pass a `--num_cores` flag to this script, then your regular training script with its arguments (this is similar to the `torch.distributed.launch` helper for torch.distributed).
+
+For example for `run_glue`:
+
+```bash
+python examples/xla_spawn.py --num_cores 8 \
+	examples/text-classification/run_glue.py
+	--model_name_or_path bert-base-cased \
+	--task_name mnli \
+	--data_dir ./data/glue_data/MNLI \
+	--output_dir ./models/tpu \
+	--overwrite_output_dir \
+	--do_train \
+	--do_eval \
+	--num_train_epochs 1 \
+	--save_steps 20000
+```
+
+Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.
--- a/examples/bertology/run_bertology.py
+++ b/examples/bertology/run_bertology.py
@@ -404,7 +404,7 @@ def main():
    logger.info("Training/evaluation parameters %s", args)

    # Prepare dataset for the GLUE task
-    eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True, local_rank=args.local_rank)
+    eval_dataset = GlueDataset(args, tokenizer=tokenizer, evaluate=True)
    if args.data_subset > 0:
        eval_dataset = Subset(eval_dataset, list(range(min(args.data_subset, len(eval_dataset)))))
    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
--- a/examples/distillation/run_squad_w_distillation.py
+++ b/examples/distillation/run_squad_w_distillation.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" This is the exact same script as `examples/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""
+""" This is the exact same script as `examples/question-answering/run_squad.py` (as of 2020, January 8th) with an additional and optional step of distillation."""

 import argparse
 import glob
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -1,7 +1,7 @@

 ## Language model training

-Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).
+Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py).

 Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa
--- a/examples/language-modeling/run_language_modeling.py
+++ b/examples/language-modeling/run_language_modeling.py
@@ -265,7 +265,7 @@ def main():

        eval_output = trainer.evaluate()

-        perplexity = math.exp(eval_output["loss"])
+        perplexity = math.exp(eval_output["eval_loss"])
        result = {"perplexity": perplexity}

        output_eval_file = os.path.join(training_args.output_dir, "eval_results_lm.txt")
@@ -280,5 +280,10 @@ def main():
    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/multiple-choice/README.md
+++ b/examples/multiple-choice/README.md
@@ -8,7 +8,7 @@ Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data
 ```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/run_multiple_choice.py \
+python ./examples/multiple-choice/run_multiple_choice.py \
 --task_name swag \
 --model_name_or_path roberta-base \
 --do_train \
@@ -29,3 +29,28 @@ Training with the defined hyper-parameters yields the following results:
 eval_acc = 0.8338998300509847
 eval_loss = 0.44457291918821606
 ```
+
+
+## Tensorflow
+
+```bash
+export SWAG_DIR=/path/to/swag_data_dir
+python ./examples/multiple-choice/run_tf_multiple_choice.py \
+--task_name swag \
+--model_name_or_path bert-base-cased \
+--do_train \
+--do_eval \
+--data_dir $SWAG_DIR \
+--learning_rate 5e-5 \
+--num_train_epochs 3 \
+--max_seq_length 80 \
+--output_dir models_bert/swag_base \
+--per_gpu_eval_batch_size=16 \
+--per_gpu_train_batch_size=16 \
+--logging-dir logs \
+--gradient_accumulation_steps 2 \
+--overwrite_output
+```
+
+# Run it in colab
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
--- a/examples/multiple-choice/run_multiple_choice.py
+++ b/examples/multiple-choice/run_multiple_choice.py
@@ -221,5 +221,10 @@ def main():
    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/multiple-choice/run_tf_multiple_choice.py
+++ b/examples/multiple-choice/run_tf_multiple_choice.py
@@ -0,0 +1,211 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for multiple choice (Bert, Roberta, XLNet)."""
+
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import Dict, Optional
+
+import numpy as np
+
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    EvalPrediction,
+    HfArgumentParser,
+    TFAutoModelForMultipleChoice,
+    TFTrainer,
+    TFTrainingArguments,
+    set_seed,
+)
+from utils_multiple_choice import Split, TFMultipleChoiceDataset, processors
+
+
+logger = logging.getLogger(__name__)
+
+
+def simple_accuracy(preds, labels):
+    return (preds == labels).mean()
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    task_name: str = field(metadata={"help": "The name of the task to train on: " + ", ".join(processors.keys())})
+    data_dir: str = field(metadata={"help": "Should contain the data files for the task."})
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if (
+        os.path.exists(training_args.output_dir)
+        and os.listdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.warning(
+        "device: %s, n_gpu: %s, 16-bits training: %s", training_args.device, training_args.n_gpu, training_args.fp16,
+    )
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Set seed
+    set_seed(training_args.seed)
+
+    try:
+        processor = processors[data_args.task_name]()
+        label_list = processor.get_labels()
+        num_labels = len(label_list)
+    except KeyError:
+        raise ValueError("Task not found: %s" % (data_args.task_name))
+
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        num_labels=num_labels,
+        finetuning_task=data_args.task_name,
+        cache_dir=model_args.cache_dir,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+    )
+    with training_args.strategy.scope():
+        model = TFAutoModelForMultipleChoice.from_pretrained(
+            model_args.model_name_or_path,
+            from_pt=bool(".bin" in model_args.model_name_or_path),
+            config=config,
+            cache_dir=model_args.cache_dir,
+        )
+    # Get datasets
+    train_dataset = (
+        TFMultipleChoiceDataset(
+            data_dir=data_args.data_dir,
+            tokenizer=tokenizer,
+            task=data_args.task_name,
+            max_seq_length=data_args.max_seq_length,
+            overwrite_cache=data_args.overwrite_cache,
+            mode=Split.train,
+        )
+        if training_args.do_train
+        else None
+    )
+    eval_dataset = (
+        TFMultipleChoiceDataset(
+            data_dir=data_args.data_dir,
+            tokenizer=tokenizer,
+            task=data_args.task_name,
+            max_seq_length=data_args.max_seq_length,
+            overwrite_cache=data_args.overwrite_cache,
+            mode=Split.dev,
+        )
+        if training_args.do_eval
+        else None
+    )
+
+    def compute_metrics(p: EvalPrediction) -> Dict:
+        preds = np.argmax(p.predictions, axis=1)
+        return {"acc": simple_accuracy(preds, p.label_ids)}
+
+    # Initialize our Trainer
+    trainer = TFTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset.get_dataset() if train_dataset else None,
+        eval_dataset=eval_dataset.get_dataset() if eval_dataset else None,
+        compute_metrics=compute_metrics,
+    )
+
+    # Training
+    if training_args.do_train:
+        trainer.train()
+        trainer.save_model()
+        tokenizer.save_pretrained(training_args.output_dir)
+    # Evaluation
+    results = {}
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+
+        result = trainer.evaluate()
+
+        output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt")
+        with open(output_eval_file, "w") as writer:
+            logger.info("***** Eval results *****")
+            for key, value in result.items():
+                logger.info("  %s = %s", key, value)
+                writer.write("%s = %s\n" % (key, value))
+
+            results.update(result)
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/multiple-choice/utils_multiple_choice.py
+++ b/examples/multiple-choice/utils_multiple_choice.py
@@ -25,11 +25,9 @@ from dataclasses import dataclass
 from enum import Enum
 from typing import List, Optional

-import torch
 import tqdm
-from torch.utils.data.dataset import Dataset

-from transformers import PreTrainedTokenizer, torch_distributed_zero_first
+from transformers import PreTrainedTokenizer, is_tf_available, is_torch_available


 logger = logging.getLogger(__name__)
@@ -76,66 +74,160 @@ class Split(Enum):
    test = "test"


-class MultipleChoiceDataset(Dataset):
-    """
-    This will be superseded by a framework-agnostic approach
-    soon.
-    """
+if is_torch_available():
+    import torch
+    from torch.utils.data.dataset import Dataset
+    from transformers import torch_distributed_zero_first

-    features: List[InputFeatures]
+    class MultipleChoiceDataset(Dataset):
+        """
+        This will be superseded by a framework-agnostic approach
+        soon.
+        """

-    def __init__(
-        self,
-        data_dir: str,
-        tokenizer: PreTrainedTokenizer,
-        task: str,
-        max_seq_length: Optional[int] = None,
-        overwrite_cache=False,
-        mode: Split = Split.train,
-        local_rank=-1,
-    ):
-        processor = processors[task]()
+        features: List[InputFeatures]

-        cached_features_file = os.path.join(
-            data_dir,
-            "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
-        )
-        with torch_distributed_zero_first(local_rank):
-            # Make sure only the first process in distributed training processes the dataset,
-            # and the others will use the cache.
+        def __init__(
+            self,
+            data_dir: str,
+            tokenizer: PreTrainedTokenizer,
+            task: str,
+            max_seq_length: Optional[int] = None,
+            overwrite_cache=False,
+            mode: Split = Split.train,
+            local_rank=-1,
+        ):
+            processor = processors[task]()

-            if os.path.exists(cached_features_file) and not overwrite_cache:
-                logger.info(f"Loading features from cached file {cached_features_file}")
-                self.features = torch.load(cached_features_file)
-            else:
-                logger.info(f"Creating features from dataset file at {data_dir}")
-                label_list = processor.get_labels()
-                if mode == Split.dev:
-                    examples = processor.get_dev_examples(data_dir)
-                elif mode == Split.test:
-                    examples = processor.get_test_examples(data_dir)
+            cached_features_file = os.path.join(
+                data_dir,
+                "cached_{}_{}_{}_{}".format(mode.value, tokenizer.__class__.__name__, str(max_seq_length), task,),
+            )
+            with torch_distributed_zero_first(local_rank):
+                # Make sure only the first process in distributed training processes the dataset,
+                # and the others will use the cache.
+
+                if os.path.exists(cached_features_file) and not overwrite_cache:
+                    logger.info(f"Loading features from cached file {cached_features_file}")
+                    self.features = torch.load(cached_features_file)
                else:
-                    examples = processor.get_train_examples(data_dir)
-                logger.info("Training examples: %s", len(examples))
-                # TODO clean up all this to leverage built-in features of tokenizers
-                self.features = convert_examples_to_features(
-                    examples,
-                    label_list,
-                    max_seq_length,
-                    tokenizer,
-                    pad_on_left=bool(tokenizer.padding_side == "left"),
-                    pad_token=tokenizer.pad_token_id,
-                    pad_token_segment_id=tokenizer.pad_token_type_id,
-                )
-                if local_rank in [-1, 0]:
-                    logger.info("Saving features into cached file %s", cached_features_file)
-                    torch.save(self.features, cached_features_file)
+                    logger.info(f"Creating features from dataset file at {data_dir}")
+                    label_list = processor.get_labels()
+                    if mode == Split.dev:
+                        examples = processor.get_dev_examples(data_dir)
+                    elif mode == Split.test:
+                        examples = processor.get_test_examples(data_dir)
+                    else:
+                        examples = processor.get_train_examples(data_dir)
+                    logger.info("Training examples: %s", len(examples))
+                    # TODO clean up all this to leverage built-in features of tokenizers
+                    self.features = convert_examples_to_features(
+                        examples,
+                        label_list,
+                        max_seq_length,
+                        tokenizer,
+                        pad_on_left=bool(tokenizer.padding_side == "left"),
+                        pad_token=tokenizer.pad_token_id,
+                        pad_token_segment_id=tokenizer.pad_token_type_id,
+                    )
+                    if local_rank in [-1, 0]:
+                        logger.info("Saving features into cached file %s", cached_features_file)
+                        torch.save(self.features, cached_features_file)

-    def __len__(self):
-        return len(self.features)
+        def __len__(self):
+            return len(self.features)

-    def __getitem__(self, i) -> InputFeatures:
-        return self.features[i]
+        def __getitem__(self, i) -> InputFeatures:
+            return self.features[i]
+
+
+if is_tf_available():
+    import tensorflow as tf
+
+    class TFMultipleChoiceDataset:
+        """
+        This will be superseded by a framework-agnostic approach
+        soon.
+        """
+
+        features: List[InputFeatures]
+
+        def __init__(
+            self,
+            data_dir: str,
+            tokenizer: PreTrainedTokenizer,
+            task: str,
+            max_seq_length: Optional[int] = 128,
+            overwrite_cache=False,
+            mode: Split = Split.train,
+        ):
+            processor = processors[task]()
+
+            logger.info(f"Creating features from dataset file at {data_dir}")
+            label_list = processor.get_labels()
+            if mode == Split.dev:
+                examples = processor.get_dev_examples(data_dir)
+            elif mode == Split.test:
+                examples = processor.get_test_examples(data_dir)
+            else:
+                examples = processor.get_train_examples(data_dir)
+            logger.info("Training examples: %s", len(examples))
+            # TODO clean up all this to leverage built-in features of tokenizers
+            self.features = convert_examples_to_features(
+                examples,
+                label_list,
+                max_seq_length,
+                tokenizer,
+                pad_on_left=bool(tokenizer.padding_side == "left"),
+                pad_token=tokenizer.pad_token_id,
+                pad_token_segment_id=tokenizer.pad_token_type_id,
+            )
+
+            def gen():
+                for (ex_index, ex) in tqdm.tqdm(enumerate(self.features), desc="convert examples to features"):
+                    if ex_index % 10000 == 0:
+                        logger.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+                    yield (
+                        {
+                            "example_id": 0,
+                            "input_ids": ex.input_ids,
+                            "attention_mask": ex.attention_mask,
+                            "token_type_ids": ex.token_type_ids,
+                        },
+                        ex.label,
+                    )
+
+            self.dataset = tf.data.Dataset.from_generator(
+                gen,
+                (
+                    {
+                        "example_id": tf.int32,
+                        "input_ids": tf.int32,
+                        "attention_mask": tf.int32,
+                        "token_type_ids": tf.int32,
+                    },
+                    tf.int64,
+                ),
+                (
+                    {
+                        "example_id": tf.TensorShape([]),
+                        "input_ids": tf.TensorShape([None, None]),
+                        "attention_mask": tf.TensorShape([None, None]),
+                        "token_type_ids": tf.TensorShape([None, None]),
+                    },
+                    tf.TensorShape([]),
+                ),
+            )
+
+        def get_dataset(self):
+            return self.dataset
+
+        def __len__(self):
+            return len(self.features)
+
+        def __getitem__(self, i) -> InputFeatures:
+            return self.features[i]


 class DataProcessor:
@@ -225,6 +317,52 @@ class RaceProcessor(DataProcessor):
        return examples


+class SynonymProcessor(DataProcessor):
+    """Processor for the Synonym data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} train".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mctrain.csv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mchp.csv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        logger.info("LOOKING AT {} dev".format(data_dir))
+
+        return self._create_examples(self._read_csv(os.path.join(data_dir, "mctest.csv")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1", "2", "3", "4"]
+
+    def _read_csv(self, input_file):
+        with open(input_file, "r", encoding="utf-8") as f:
+            return list(csv.reader(f))
+
+    def _create_examples(self, lines: List[List[str]], type: str):
+        """Creates examples for the training and dev sets."""
+
+        examples = [
+            InputExample(
+                example_id=line[0],
+                question="",  # in the swag dataset, the
+                # common beginning of each
+                # choice is stored in "sent2".
+                contexts=[line[1], line[1], line[1], line[1], line[1]],
+                endings=[line[2], line[3], line[4], line[5], line[6]],
+                label=line[7],
+            )
+            for line in lines  # we skip the line with the column names
+        ]
+
+        return examples
+
+
 class SwagProcessor(DataProcessor):
    """Processor for the SWAG data set."""

@@ -435,7 +573,5 @@ def convert_examples_to_features(
    return features


-processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor}
-
-
-MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4}
+processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}
+MULTIPLE_CHOICE_TASKS_NUM_LABELS = {"race", 4, "swag", 4, "arc", 4, "syn", 5}
--- a/examples/question-answering/README.md
+++ b/examples/question-answering/README.md
@@ -2,7 +2,7 @@

 ## SQuAD

-Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py).
+Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).

 #### Fine-tuning BERT on SQuAD1.0

@@ -51,7 +51,7 @@ exact_match = 81.22
 Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:

 ```bash
-python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
+python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
@@ -157,3 +157,23 @@ Larger batch size may improve the performance while costing more memory.
 }
 ```

+## SQuAD with the Tensorflow Trainer
+
+```bash
+python run_tf_squad.py \
+    --model_name_or_path bert-base-uncased \
+    --output_dir model \
+    --max-seq-length 384 \
+    --num_train_epochs 2 \
+    --per_gpu_train_batch_size 8 \
+    --per_gpu_eval_batch_size 16 \
+    --do_train \
+    --logging_dir logs \
+    --mode question-answering \
+    --logging_steps 10 \
+    --learning_rate 3e-5 \
+    --doc_stride 128 \
+    --optimizer_name adamw
+```
+
+For the moment the evaluation is not available in the Tensorflow Trainer only the training.
--- a/examples/question-answering/run_tf_squad.py
+++ b/examples/question-answering/run_tf_squad.py
@@ -0,0 +1,237 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Fine-tuning the library models for question-answering."""
+
+
+import logging
+import os
+from dataclasses import dataclass, field
+from typing import Optional
+
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    HfArgumentParser,
+    TFAutoModelForQuestionAnswering,
+    TFTrainer,
+    TFTrainingArguments,
+    squad_convert_examples_to_features,
+)
+from transformers.data.processors.squad import SquadV1Processor, SquadV2Processor
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."})
+    # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
+    # or just modify its tokenizer_config.json.
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+
+    data_dir: Optional[str] = field(
+        default=None, metadata={"help": "The input data dir. Should contain the .json files for the SQuAD task."}
+    )
+    max_seq_length: int = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    doc_stride: int = field(
+        default=128,
+        metadata={"help": "When splitting up a long document into chunks, how much stride to take between chunks."},
+    )
+    max_query_length: int = field(
+        default=64,
+        metadata={
+            "help": "The maximum number of tokens for the question. Questions longer than this will "
+            "be truncated to this length."
+        },
+    )
+    max_answer_length: int = field(
+        default=30,
+        metadata={
+            "help": "The maximum length of an answer that can be generated. This is needed because the start "
+            "and end predictions are not conditioned on one another."
+        },
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    version_2_with_negative: bool = field(
+        default=False, metadata={"help": "If true, the SQuAD examples contain some that do not have an answer."}
+    )
+    null_score_diff_threshold: float = field(
+        default=0.0, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
+    )
+    n_best_size: int = field(
+        default=20, metadata={"help": "If null_score - best_non_null is greater than the threshold predict null."}
+    )
+    lang_id: int = field(
+        default=0,
+        metadata={
+            "help": "language id of input for language-specific xlm models (see tokenization_xlm.PRETRAINED_INIT_CONFIGURATION)"
+        },
+    )
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    if (
+        os.path.exists(training_args.output_dir)
+        and os.listdir(training_args.output_dir)
+        and training_args.do_train
+        and not training_args.overwrite_output_dir
+    ):
+        raise ValueError(
+            f"Output directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome."
+        )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(
+        "n_gpu: %s, distributed training: %s, 16-bits training: %s",
+        training_args.n_gpu,
+        bool(training_args.n_gpu > 1),
+        training_args.fp16,
+    )
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Prepare Question-Answering task
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_fast=model_args.use_fast,
+    )
+
+    with training_args.strategy.scope():
+        model = TFAutoModelForQuestionAnswering.from_pretrained(
+            model_args.model_name_or_path,
+            from_pt=bool(".bin" in model_args.model_name_or_path),
+            config=config,
+            cache_dir=model_args.cache_dir,
+        )
+
+    # Get datasets
+    if not data_args.data_dir:
+        if data_args.version_2_with_negative:
+            logger.warn("tensorflow_datasets does not handle version 2 of SQuAD. Switch to version 1 automatically")
+
+        try:
+            import tensorflow_datasets as tfds
+        except ImportError:
+            raise ImportError("If not data_dir is specified, tensorflow_datasets needs to be installed.")
+
+        tfds_examples = tfds.load("squad")
+        train_examples = (
+            SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=False)
+            if training_args.do_train
+            else None
+        )
+        eval_examples = (
+            SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=True)
+            if training_args.do_eval
+            else None
+        )
+    else:
+        processor = SquadV2Processor() if data_args.version_2_with_negative else SquadV1Processor()
+        train_examples = processor.get_train_examples(data_args.data_dir) if training_args.do_train else None
+        eval_examples = processor.get_dev_examples(data_args.data_dir) if training_args.do_eval else None
+
+    train_dataset = (
+        squad_convert_examples_to_features(
+            examples=train_examples,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            doc_stride=data_args.doc_stride,
+            max_query_length=data_args.max_query_length,
+            is_training=True,
+            return_dataset="tf",
+        )
+        if training_args.do_train
+        else None
+    )
+
+    eval_dataset = (
+        squad_convert_examples_to_features(
+            examples=eval_examples,
+            tokenizer=tokenizer,
+            max_seq_length=data_args.max_seq_length,
+            doc_stride=data_args.doc_stride,
+            max_query_length=data_args.max_query_length,
+            is_training=False,
+            return_dataset="tf",
+        )
+        if training_args.do_eval
+        else None
+    )
+
+    # Initialize our Trainer
+    trainer = TFTrainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset,)
+
+    # Training
+    if training_args.do_train:
+        trainer.train()
+        trainer.save_model()
+        tokenizer.save_pretrained(training_args.output_dir)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -72,7 +72,7 @@ class ExamplesTests(unittest.TestCase):
            """.split()
        with patch.object(sys, "argv", testargs):
            result = run_glue.main()
-            del result["loss"]
+            del result["eval_loss"]
            for value in result.values():
                self.assertGreaterEqual(value, 0.75)

--- a/examples/text-classification/README.md
+++ b/examples/text-classification/README.md
@@ -2,7 +2,7 @@

 # Run TensorFlow 2.0 version

-Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_glue.py).
+Based on the script [`run_tf_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py).

 Fine-tuning the library TensorFlow 2.0 Bert model for sequence classification on the  MRPC task of the GLUE benchmark: [General Language Understanding Evaluation](https://gluebenchmark.com/).

@@ -85,10 +85,12 @@ CoLA, SST-2. The following section provides details on how to run half-precision
 said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.

-## Running on TPUs
+## Running on TPUs in PyTorch

-You can accelerate your workloads on Google's TPUs. For information on how to setup your TPU environment refer to this
-[README](https://github.com/pytorch/xla/blob/master/README.md).
+**Update**: read the more up-to-date [Running on TPUs](../README.md#running-on-tpus) in the main README.md instead.
+
+Even when running PyTorch, you can accelerate your workloads on Google's TPUs, using `pytorch/xla`. For information on how to setup your TPU environment refer to the
+[pytorch/xla README](https://github.com/pytorch/xla/blob/master/README.md).

 The following are some examples of running the `*_tpu.py` finetuning scripts on TPUs. All steps for data preparation are
 identical to your normal GPU + Huggingface setup.
@@ -101,7 +103,6 @@ export GLUE_DIR=/path/to/glue
 export TASK_NAME=MNLI

 python run_glue_tpu.py \
-  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
@@ -115,8 +116,7 @@ python run_glue_tpu.py \
  --overwrite_output_dir \
  --logging_steps 50 \
  --save_steps 200 \
-  --num_cores=8 \
-  --only_log_master
+  --num_cores=8
 ```

 ### MRPC
@@ -256,7 +256,7 @@ TEST RESULTS {'val_loss': tensor(0.0707), 'precision': 0.852427800698191, 'recal

 # XNLI

-Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).
+Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_xnli.py).

 [XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

--- a/examples/text-classification/run_glue.py
+++ b/examples/text-classification/run_glue.py
@@ -134,16 +134,8 @@ def main():
    )

    # Get datasets
-    train_dataset = (
-        GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank)
-        if training_args.do_train
-        else None
-    )
-    eval_dataset = (
-        GlueDataset(data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-        if training_args.do_eval
-        else None
-    )
+    train_dataset = GlueDataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
+    eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, evaluate=True) if training_args.do_eval else None

    def compute_metrics(p: EvalPrediction) -> Dict:
        if output_mode == "classification":
@@ -181,9 +173,7 @@ def main():
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            mnli_mm_data_args = dataclasses.replace(data_args, task_name="mnli-mm")
-            eval_datasets.append(
-                GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, local_rank=training_args.local_rank, evaluate=True)
-            )
+            eval_datasets.append(GlueDataset(mnli_mm_data_args, tokenizer=tokenizer, evaluate=True))

        for eval_dataset in eval_datasets:
            result = trainer.evaluate(eval_dataset=eval_dataset)
--- a/examples/text-generation/README.md
+++ b/examples/text-generation/README.md
@@ -1,6 +1,6 @@
 ## Language generation

-Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py).
+Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py).

 Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL.
 A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you
--- a/examples/text-generation/pplm/run_pplm.py
+++ b/examples/text-generation/pplm/run_pplm.py
@@ -17,10 +17,10 @@

 """
 Example command with bag of words:
-python examples/run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95
+python run_pplm.py -B space --cond_text "The president" --length 100 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.01 --window_length 5 --kl_scale 0.01 --gm_scale 0.95

 Example command with discriminator:
-python examples/run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
+python run_pplm.py -D sentiment --class_label 3 --cond_text "The lake" --length 10 --gamma 1.0 --num_iterations 30 --num_samples 10 --stepsize 0.01 --kl_scale 0.01 --gm_scale 0.95
 """

 import argparse
--- a/examples/token-classification/README.md
+++ b/examples/token-classification/README.md
@@ -1,7 +1,7 @@
 ## Named Entity Recognition

-Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py) for Pytorch and
-[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_tf_ner.py) for Tensorflow 2.
+Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) for Pytorch and
+[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
 This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
 Details and results for the fine-tuning provided by @stefan-it.

--- a/examples/token-classification/run_ner.py
+++ b/examples/token-classification/run_ner.py
@@ -292,5 +292,10 @@ def main():
    return results


+def _mp_fn(index):
+    # For xla_spawn (TPUs)
+    main()
+
+
 if __name__ == "__main__":
    main()
--- a/examples/token-classification/test_ner_examples.py
+++ b/examples/token-classification/test_ner_examples.py
@@ -6,7 +6,7 @@ from unittest.mock import patch
 import run_ner


-logging.basicConfig(level=logging.DEBUG)
+logging.basicConfig(level=logging.INFO)

 logger = logging.getLogger()

@@ -30,4 +30,4 @@ class ExamplesTests(unittest.TestCase):
            """.split()
        with patch.object(sys, "argv", ["run.py"] + testargs):
            result = run_ner.main()
-            self.assertLess(result["loss"], 1.5)
+            self.assertLess(result["eval_loss"], 1.5)
--- a/examples/xla_spawn.py
+++ b/examples/xla_spawn.py
@@ -12,17 +12,13 @@ Inspired by https://github.com/pytorch/pytorch/blob/master/torch/distributed/lau


 import importlib
-import os
 import sys
 from argparse import REMAINDER, ArgumentParser
+from pathlib import Path

 import torch_xla.distributed.xla_multiprocessing as xmp


-def trim_suffix(s: str, suffix: str):
-    return s if not s.endswith(suffix) or len(suffix) == 0 else s[: -len(suffix)]
-
-
 def parse_args():
    """
    Helper function parsing the command line options
@@ -44,7 +40,7 @@ def parse_args():
        "training_script",
        type=str,
        help=(
-            "The full module name to the single TPU training "
+            "The full path to the single TPU training "
            "program/script to be launched in parallel, "
            "followed by all the arguments for the "
            "training script"
@@ -61,7 +57,9 @@ def main():
    args = parse_args()

    # Import training_script as a module.
-    mod_name = trim_suffix(os.path.basename(args.training_script), ".py")
+    script_fpath = Path(args.training_script)
+    sys.path.append(str(script_fpath.parent.resolve()))
+    mod_name = script_fpath.stem
    mod = importlib.import_module(mod_name)

    # Patch sys.argv
--- a/model_cards/LorenzoDeMattei/GePpeTto/README.md
+++ b/model_cards/LorenzoDeMattei/GePpeTto/README.md
@@ -59,56 +59,64 @@ tokenizer = GPT2Tokenizer.from_pretrained(
 ## Example using GPT2LMHeadModel

 ```python
-from transformers import GPT2Tokenizer, GPT2LMHeadModel
+from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline, GPT2Tokenizer

-tokenizer = GPT2Tokenizer.from_pretrained('LorenzoDeMattei/GePpeTto')
-model = GPT2LMHeadModel.from_pretrained(
-    'LorenzoDeMattei/GePpeTto', pad_token_id = tokenizer.eos_token_id
+tokenizer = AutoTokenizer.from_pretrained("LorenzoDeMattei/GePpeTto")
+model = AutoModelWithLMHead.from_pretrained("LorenzoDeMattei/GePpeTto")
+
+text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
+prompts = [
+    "Wikipedia Geppetto",
+    "Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso"]
+
+
+samples_outputs = text_generator(
+    prompts,
+    do_sample=True,
+    max_length=50,
+    top_k=50,
+    top_p=0.95,
+    num_return_sequences=3
 )

-input_ids = tokenizer.encode(
-    'Wikipedia Geppetto', return_tensors = 'pt'
-)
-sample_outputs = model.generate(
-    input_ids,
-    do_sample = True,
-    max_length = 50,
-    top_k = 50,
-    top_p = 0.95,
-    num_return_sequences = 3,
-)

-print('Output:\n' + 100 * '-')
-for i, sample_output in enumerate(sample_outputs):
-    print(
-        '{}: {}'.format(
-            i, tokenizer.decode(sample_output, skip_special_tokens = True)
-        )
-    )
+for i, sample_outputs in enumerate(samples_outputs):
+    print(100 * '-')
+    print("Prompt:", prompts[i])
+    for sample_output in sample_outputs:
+        print("Sample:", sample_output['generated_text'])
+        print()
+
 ```

 Output is,

-```text
-Output:
+```
 ----------------------------------------------------------------------------------------------------
-0: Wikipedia Geppetto
+Prompt: Wikipedia Geppetto
+Sample: Wikipedia Geppetto rosso (film 1920)

-Geppetto è una città degli Stati Uniti d'America, situata nello Stato dell'Iowa, nella Contea di Greene.
+Geppetto rosso ("The Smokes in the Black") è un film muto del 1920 diretto da Henry H. Leonard.

-Wikipedia The Sax
+Il film fu prodotto dalla Selig Poly

-The Sax è il primo album discografico
-2: Wikipedia Geppetto/Passione
+Sample: Wikipedia Geppetto

-Geppetto è il primo album in studio dei Saturday Night Live, pubblicato dalla Iron Maiden nel 1974.
+Geppetto ("Geppetto" in piemontese) è un comune italiano di 978 abitanti della provincia di Cuneo in Piemonte.

-L'album è un lavoro di debutto che lo porta a definire
-3: Wikipedia Geppetto
+L'abitato, che si trova nel versante valtellinese, si sviluppa nella

-Geppetto ("Fenëvëv" in calabrese) è un comune italiano di abitanti della regione Calabria.
+Sample: Wikipedia Geppetto di Natale (romanzo)

-Zona di particolare pregio storico-artistico, paesaggistico, storico-artistico,
+Geppetto di Natale è un romanzo di Mario Caiano, pubblicato nel 2012.
+
+----------------------------------------------------------------------------------------------------
+Prompt: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso. Il burattino riesce a scappare. Dopo aver trovato un prezioso sacchetto si reca
+
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso, e l'unico che lo possiede, ma, di fronte a tutte queste prove
+
+Sample: Maestro Ciliegia regala il pezzo di legno al suo amico Geppetto, il quale lo prende per fabbricarsi un burattino maraviglioso: - A voi gli occhi, le guance! A voi il mio pezzo!
 ```

 ## Citation
--- a/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
+++ b/model_cards/ViktorAlm/electra-base-norwegian-uncased-discriminator/README.md
@@ -0,0 +1,19 @@
+# Norwegian Electra
+Image incoming, im going to have som fun with this one.
+
+Trained on Oscar + wikipedia + opensubtitles + some other data I had with the awesome power of TPUs(V3-8)
+
+Use with caution. I have no downstream tasks in Norwegian to test on so I have no idea of its performance yet.
+
+# Acknowledgments
+
+### TensorFlow Research Cloud
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
+- https://www.tensorflow.org/tfrc
+
+#### OSCAR corpus
+- https://oscar-corpus.com/
+
+#### OPUS
+- http://opus.nlpl.eu/
+- http://www.opensubtitles.org/
--- a/model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
+++ b/model_cards/allegro/herbert-klej-cased-tokenizer-v1/README.md
@@ -0,0 +1,44 @@
+---
+language: polish
+---
+
+# HerBERT tokenizer
+
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** tokenizer is a character level byte-pair encoding with
+vocabulary size of 50k tokens. The tokenizer was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of
+[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with [fastBPE](https://github.com/glample/fastBPE) library.
+Tokenizer utilize `XLMTokenizer` implementation from [transformers](https://github.com/huggingface/transformers).
+
+## Tokenizer usage
+Herbert tokenizer should be used together with [HerBERT model](https://huggingface.co/allegro/herbert-klej-cased-v1):
+```python
+from transformers import XLMTokenizer, RobertaModel
+
+tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
+
+encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
+outputs = model(encoded_input)
+```
+
+## License
+CC BY-SA 4.0
+
+## Citation
+If you use this tokenizer, please cite the following paper:
+```
+@misc{rybak2020klej,
+    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
+    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
+    year={2020},
+    eprint={2005.00630},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
+
+## Authors
+Tokenizer was created by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/allegro/herbert-klej-cased-v1/README.md
+++ b/model_cards/allegro/herbert-klej-cased-v1/README.md
@@ -0,0 +1,85 @@
+---
+language: polish
+---
+
+# HerBERT 
+**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
+using only MLM objective with dynamic masking of whole words. For more details, please refer to: 
+[KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://arxiv.org/abs/2005.00630).
+
+## Dataset
+**HerBERT** training dataset is a combination of several publicly available corpora for Polish language:
+
+| Corpus | Tokens | Texts |
+| :------ | ------: | ------: |
+| [OSCAR](https://traces1.inria.fr/oscar/)| 6710M  | 145M |
+| [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1084M  | 1.1M |
+| [Wikipedia](https://dumps.wikimedia.org/) | 260M  | 1.5M |
+| [Wolne Lektury](https://wolnelektury.pl/) | 41M  | 5.5k |
+| [Allegro Articles](https://allegro.pl/artykuly) | 18M  | 33k |
+
+## Tokenizer
+The training dataset was tokenized into subwords using [HerBERT Tokenizer](https://huggingface.co/allegro/herbert-klej-cased-tokenizer-v1); a character level byte-pair encoding with
+a vocabulary size of 50k tokens. The tokenizer itself was trained on [Wolne Lektury](https://wolnelektury.pl/) and a publicly available subset of 
+[National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=0) with a [fastBPE](https://github.com/glample/fastBPE) library.
+
+Tokenizer utilizes `XLMTokenizer` implementation for that reason, one should load it as `allegro/herbert-klej-cased-tokenizer-v1`.
+
+## HerBERT models summary
+| Model | WWM | Cased | Tokenizer | Vocab Size  | Batch Size | Train Steps |
+| :------ | ------: | ------: | ------: | ------: | ------: | ------: |
+| herbert-klej-cased-v1 | YES | YES | BPE | 50K | 570 | 180k | 
+
+## Model evaluation
+HerBERT was evaluated on the [KLEJ](https://klejbenchmark.com/) benchmark, publicly available set of nine evaluation tasks for the Polish language understanding.
+It had the best average performance and obtained the best results for three of them.
+
+| Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN	|PolEmo2.0-OUT | DYK | PSC | AR	|
+| :------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: |  ------: | ------: | ------: |
+| herbert-klej-cased-v1 | **80.5** | 92.7 | 92.5 | 91.9 | **50.3** | **89.2** |**76.3** |52.1 |95.3 | 84.5 |
+
+Full leaderboard is available [online](https://klejbenchmark.com/leaderboard). 
+
+
+## HerBERT usage
+Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.0.
+
+Example code:
+```python
+from transformers import XLMTokenizer, RobertaModel
+
+tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
+
+encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
+outputs = model(encoded_input)
+```
+
+HerBERT can also be loaded using `AutoTokenizer` and `AutoModel`:
+
+```python
+tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
+model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
+```
+
+## License
+CC BY-SA 4.0
+
+## Citation
+If you use this model, please cite the following paper:
+```
+@misc{rybak2020klej,
+    title={KLEJ: Comprehensive Benchmark for Polish Language Understanding},
+    author={Piotr Rybak and Robert Mroczkowski and Janusz Tracz and Ireneusz Gawlik},
+    year={2020},
+    eprint={2005.00630},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+Paper is accepted at ACL 2020, as soon as proceedings appear, we will update the BibTeX.
+
+## Authors
+Model was trained by **Allegro Machine Learning Research** team.
+
+You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>
--- a/model_cards/dbmdz/electra-base-turkish-cased-discriminator/README.md
+++ b/model_cards/dbmdz/electra-base-turkish-cased-discriminator/README.md
@@ -0,0 +1,79 @@
+---
+language: turkish
+license: mit
+---
+
+# 🤗 + 📚 dbmdz Turkish ELECTRA model
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources a cased ELECTRA base model for Turkish 🎉
+
+# Turkish ELECTRA model
+
+We release a base ELEC**TR**A model for Turkish, that was trained on the same data as *BERTurk*.
+
+> ELECTRA is a new method for self-supervised language representation learning. It can be used to
+> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
+> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
+> the discriminator of a GAN.
+
+More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
+or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
+
+## Stats
+
+The current version of the model is trained on a filtered and sentence
+segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
+a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
+special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
+
+The final training corpus has a size of 35GB and 44,04,976,662 tokens.
+
+Thanks to Google's TensorFlow Research Cloud (TFRC) we could train a cased model
+on a TPU v3-8 for 1M steps.
+
+## Model weights
+
+[Transformers](https://github.com/huggingface/transformers)
+compatible weights for both PyTorch and TensorFlow are available.
+
+| Model                                            | Downloads
+| ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/electra-base-turkish-cased-discriminator` | [`config.json`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-base-turkish-cased-discriminator/vocab.txt)
+
+## Usage
+
+With Transformers >= 2.8 our ELECTRA base cased model can be loaded like:
+
+```python
+from transformers import AutoModelWithLMHead, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")
+model = AutoModelWithLMHead.from_pretrained("dbmdz/electra-base-turkish-cased-discriminator")
+```
+
+## Results
+
+For results on PoS tagging or NER tasks, please refer to
+[this repository](https://github.com/stefan-it/turkish-bert/electra).
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our ELECTRA models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
+additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
+us the Turkish NER dataset for evaluation.
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/electra-small-turkish-cased-discriminator/README.md
+++ b/model_cards/dbmdz/electra-small-turkish-cased-discriminator/README.md
@@ -0,0 +1,79 @@
+---
+language: turkish
+license: mit
+---
+
+# 🤗 + 📚 dbmdz Turkish ELECTRA model
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources a cased ELECTRA small model for Turkish 🎉
+
+# Turkish ELECTRA model
+
+We release a small ELEC**TR**A model for Turkish, that was trained on the same data as *BERTurk*.
+
+> ELECTRA is a new method for self-supervised language representation learning. It can be used to
+> pre-train transformer networks using relatively little compute. ELECTRA models are trained to
+> distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to
+> the discriminator of a GAN.
+
+More details about ELECTRA can be found in the [ICLR paper](https://openreview.net/forum?id=r1xMH1BtvB)
+or in the [official ELECTRA repository](https://github.com/google-research/electra) on GitHub.
+
+## Stats
+
+The current version of the model is trained on a filtered and sentence
+segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
+a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
+special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
+
+The final training corpus has a size of 35GB and 44,04,976,662 tokens.
+
+Thanks to Google's TensorFlow Research Cloud (TFRC) we could train a cased model
+on a TPU v3-8 for 1M steps.
+
+## Model weights
+
+[Transformers](https://github.com/huggingface/transformers)
+compatible weights for both PyTorch and TensorFlow are available.
+
+| Model                                             | Downloads
+| ------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/electra-small-turkish-cased-discriminator` | [`config.json`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/electra-small-turkish-cased-discriminator/vocab.txt)
+
+## Usage
+
+With Transformers >= 2.8 our ELECTRA small cased model can be loaded like:
+
+```python
+from transformers import AutoModelWithLMHead, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-small-turkish-cased-discriminator")
+model = AutoModelWithLMHead.from_pretrained("dbmdz/electra-small-turkish-cased-discriminator")
+```
+
+## Results
+
+For results on PoS tagging or NER tasks, please refer to
+[this repository](https://github.com/stefan-it/turkish-bert/electra).
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our ELECTRA models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
+additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
+us the Turkish NER dataset for evaluation.
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/fmikaelian/camembert-base-fquad/README.md
+++ b/model_cards/fmikaelian/camembert-base-fquad/README.md
@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([CamemBERT](https://camembert
 ## Training hyperparameters

 ```shell
-python3 ./examples/run_squad.py \
+python3 ./examples/question-answering/run_squad.py \
 --model_type camembert \
 --model_name_or_path camembert-base \
 --do_train \
--- a/model_cards/fmikaelian/camembert-base-squad/README.md
+++ b/model_cards/fmikaelian/camembert-base-squad/README.md
@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([CamemBERT](https://camembert
 ## Training hyperparameters

 ```shell
-python3 ./examples/run_squad.py \
+python3 ./examples/question-answering/run_squad.py \
 --model_type camembert \
 --model_name_or_path camembert-base \
 --do_train \
--- a/model_cards/fmikaelian/flaubert-base-uncased-squad/README.md
+++ b/model_cards/fmikaelian/flaubert-base-uncased-squad/README.md
@@ -11,7 +11,7 @@ A baseline model for question-answering in french ([flaubert](https://github.com
 ## Training hyperparameters

 ```shell
-python3 ./examples/run_squad.py \
+python3 ./examples/question-answering/run_squad.py \
 --model_type flaubert \
 --model_name_or_path flaubert-base-uncased \
 --do_train \
--- a/model_cards/google/electra-base-generator/README.md
+++ b/model_cards/google/electra-base-generator/README.md
@@ -25,7 +25,7 @@ fill_mask = pipeline(
 )

 print(
-	fill_mask(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks.")
+	fill_mask(f"HuggingFace is creating a {fill_mask.tokenizer.mask_token} that the community uses to solve NLP tasks.")
 )

 ```
--- a/model_cards/google/reformer-enwik8/README.md
+++ b/model_cards/google/reformer-enwik8/README.md
@@ -0,0 +1,57 @@
+## Reformer Language model on character level and trained on enwik8. 
+
+*enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in 
+the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
+
+`reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
+The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted 
+to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
+
+The model is a language model that operates on characters. 
+Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
+
+```python
+import torch
+
+# Encoding
+def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
+    max_length = max([len(string) for string in list_of_strings])
+
+    # create emtpy tensors
+    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
+    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
+
+    for idx, string in enumerate(list_of_strings):
+        # make sure string is in byte format
+        if not isinstance(string, bytes):
+            string = str.encode(string)
+
+        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
+        attention_masks[idx, :len(string)] = 1
+
+    return input_ids, attention_masks
+    
+# Decoding
+def decode(outputs_ids):
+    decoded_outputs = []
+    for output_ids in outputs_ids.tolist():
+        # transform id back to char IDs < 2 are simply transformed to ""
+        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
+    return decoded_outputs
+```
+
+Text can be generated as follows:
+
+```python
+from transformers import ReformerModelWithLMHead
+
+model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
+encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
+decode(model.generate(encoded, do_sample=True, max_length=150))
+
+# gives:
+# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
+
+```
+
+***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.
--- a/model_cards/jplu/tf-xlm-r-ner-40-lang/README.md
+++ b/model_cards/jplu/tf-xlm-r-ner-40-lang/README.md
@@ -1,3 +1,4 @@
+
 # XLM-R + NER

 This model is a fine-tuned  [XLM-Roberta-base](https://arxiv.org/abs/1911.02116) over the 40 languages proposed in [XTREME]([https://github.com/google-research/xtreme](https://github.com/google-research/xtreme)) from [Wikiann](https://aclweb.org/anthology/P17-1178). This is still an on-going work and the results will be updated everytime an improvement is reached. 
@@ -12,6 +13,7 @@ O

 ## Metrics on evaluation set:
 ### Average over the 40 languages
+Number of documents: 262300
 ```
           precision    recall  f1-score   support

@@ -24,6 +26,7 @@ macro avg       0.86      0.87      0.87    333298
 ```

 ### Afrikaans
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -36,6 +39,7 @@ macro avg       0.87      0.91      0.89      1469
 ``` 

 ### Arabic
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -48,6 +52,7 @@ macro avg       0.87      0.88      0.88     10754
 ```

 ### Basque
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -60,6 +65,7 @@ macro avg       0.89      0.89      0.89     12954
 ```

 ### Bengali
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -72,6 +78,7 @@ macro avg       0.91      0.92      0.91      1095
 ```

 ### Bulgarian
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -84,6 +91,7 @@ macro avg       0.91      0.92      0.91     14116
 ```

 ### Burmese
+Number of documents: 100
 ```
           precision    recall  f1-score   support

@@ -96,6 +104,7 @@ macro avg       0.57      0.65      0.60       103
 ```

 ### Chinese
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -108,6 +117,7 @@ macro avg       0.76      0.78      0.77     11558
 ```

 ### Dutch
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -120,6 +130,7 @@ macro avg       0.91      0.92      0.91     13120
 ```

 ### English
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -132,6 +143,7 @@ macro avg       0.82      0.83      0.83     13973
 ```

 ### Estonian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -144,6 +156,7 @@ macro avg       0.90      0.91      0.90     13558
 ```

 ### Finnish
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -156,6 +169,7 @@ macro avg       0.89      0.89      0.89     13930
 ```

 ### French
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -168,6 +182,7 @@ macro avg       0.89      0.90      0.90     12933
 ```

 ### Georgian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -180,6 +195,7 @@ macro avg       0.84      0.86      0.85     12615
 ```

 ### German
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -192,6 +208,7 @@ macro avg       0.86      0.86      0.86     13638
 ```

 ### Greek
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -204,6 +221,7 @@ macro avg       0.88      0.90      0.89     12101
 ```

 ### Hebrew
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -216,6 +234,7 @@ macro avg       0.82      0.83      0.83     12934
 ```

 ### Hindi
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -228,6 +247,7 @@ macro avg       0.84      0.87      0.85      1211
 ```

 ### Hungarian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -240,6 +260,7 @@ macro avg       0.91      0.92      0.91     13879
 ```

 ### Indonesian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -252,6 +273,7 @@ macro avg       0.91      0.92      0.92     11376
 ```

 ### Italian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -264,6 +286,7 @@ macro avg       0.90      0.90      0.90     13412
 ```

 ### Japanese
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -276,6 +299,7 @@ macro avg       0.69      0.72      0.70     12277
 ```

 ### Javanese
+Number of documents: 100
 ```
           precision    recall  f1-score   support

@@ -288,6 +312,7 @@ macro avg       0.78      0.82      0.80       112
 ```

 ### Kazakh
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -300,6 +325,7 @@ macro avg       0.81      0.83      0.81      1135
 ```

 ### Korean
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -312,6 +338,7 @@ macro avg       0.83      0.83      0.83     13329
 ```

 ### Malay
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -324,6 +351,7 @@ macro avg       0.91      0.92      0.91      1088
 ```

 ### Malayalam
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -336,6 +364,7 @@ macro avg       0.78      0.80      0.79      1155
 ```

 ### Marathi
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -348,6 +377,7 @@ macro avg       0.85      0.86      0.85      1190
 ```

 ### Persian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -360,6 +390,7 @@ macro avg       0.92      0.92      0.92     10494
 ```

 ### Portuguese
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -372,6 +403,7 @@ macro avg       0.90      0.91      0.90     12673
 ```

 ### Russian
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -384,6 +416,7 @@ macro avg       0.87      0.88      0.88     12051
 ```

 ### Spanish
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -396,6 +429,7 @@ macro avg       0.90      0.91      0.90     12153
 ```

 ### Swahili
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -408,6 +442,7 @@ macro avg       0.88      0.89      0.88      1202
 ```

 ### Tagalog
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -420,6 +455,7 @@ macro avg       0.90      0.92      0.91      1027
 ```

 ### Tamil
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -432,6 +468,7 @@ macro avg       0.82      0.83      0.82      1183
 ```

 ### Telugu
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -444,6 +481,7 @@ macro avg       0.73      0.77      0.75      1193
 ```

 ### Thai
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -456,6 +494,7 @@ macro avg       0.68      0.74      0.71     14722
 ```

 ### Turkish
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -468,6 +507,7 @@ macro avg       0.91      0.92      0.91     13360
 ```

 ### Urdu
+Number of documents: 1000
 ```
           precision    recall  f1-score   support

@@ -480,6 +520,7 @@ macro avg       0.92      0.94      0.93      1011
 ```

 ### Vietnamese
+Number of documents: 10000
 ```
           precision    recall  f1-score   support

@@ -492,6 +533,7 @@ macro avg       0.89      0.90      0.90     11107
 ```

 ### Yoruba
+Number of documents: 100
 ```
           precision    recall  f1-score   support

@@ -504,7 +546,7 @@ macro avg       0.63      0.68      0.63       107
 ```

 ## Reproduce the results
-Download and prepare the dataset from the [[https://github.com/google-research/xtreme#download-the-data](https://github.com/google-research/xtreme#download-the-data)](XTREME repo). Next, from the root of the transformers repo run:
+Download and prepare the dataset from the [XTREME repo](https://github.com/google-research/xtreme#download-the-data). Next, from the root of the transformers repo run:
 ```
 cd examples/ner
 python run_tf_ner.py \
@@ -533,8 +575,9 @@ nlp_ner = pipeline(
    model="jplu/tf-xlm-r-ner-40-lang",
    tokenizer=(
        'jplu/tf-xlm-r-ner-40-lang',  
-        {"use_fast": True}
-))
+        {"use_fast": True}),
+    framework="tf"
+)

 text_fr = "Barack Obama est né à Hawaï."
 text_en = "Barack Obama was born in Hawaii."
@@ -553,4 +596,4 @@ nlp_ner(test_zh)
 nlp_ner(test_ar)
 #Output: [{'word': '▁با', 'score': 0.9903655648231506, 'entity': 'PER'}, {'word': 'راك', 'score': 0.9850614666938782, 'entity': 'PER'}, {'word': '▁أوباما', 'score': 0.9850308299064636, 'entity': 'PER'}, {'word': '▁ها', 'score': 0.9477543234825134, 'entity': 'LOC'}, {'word': 'وا', 'score': 0.9428229928016663, 'entity': 'LOC'}, {'word': 'ي', 'score': 0.9319471716880798, 'entity': 'LOC'}]

-```
+```
--- a/model_cards/ktrapeznikov/albert-xlarge-v2-squad-v2/README.md
+++ b/model_cards/ktrapeznikov/albert-xlarge-v2-squad-v2/README.md
@@ -1,5 +1,5 @@
 ### Model
-**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
+**[`albert-xlarge-v2`](https://huggingface.co/albert-xlarge-v2)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**

 ### Training Parameters
 Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
--- a/model_cards/ktrapeznikov/biobert_v1.1_pubmed_squad_v2/README.md
+++ b/model_cards/ktrapeznikov/biobert_v1.1_pubmed_squad_v2/README.md
@@ -1,5 +1,5 @@
 ### Model
-**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
+**[`monologg/biobert_v1.1_pubmed`](https://huggingface.co/monologg/biobert_v1.1_pubmed)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**

 This model is cased.

--- a/model_cards/ktrapeznikov/scibert_scivocab_uncased_squad_v2/README.md
+++ b/model_cards/ktrapeznikov/scibert_scivocab_uncased_squad_v2/README.md
@@ -1,5 +1,5 @@
 ### Model
-**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)**
+**[`allenai/scibert_scivocab_uncased`](https://huggingface.co/allenai/scibert_scivocab_uncased)** fine-tuned on **[`SQuAD V2`](https://rajpurkar.github.io/SQuAD-explorer/)** using **[`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)**

 ### Training Parameters
 Trained on 4 NVIDIA GeForce RTX 2080 Ti 11Gb
--- a/model_cards/lserinol/bert-turkish-question-answering/README.md
+++ b/model_cards/lserinol/bert-turkish-question-answering/README.md
@@ -0,0 +1,61 @@
+---
+language: turkish
+---
+
+# bert-turkish-question-answering
+
+## Usage
+
+```python
+from transformers import pipeline
+nlp = pipeline('question-answering', model='lserinol/bert-turkish-question-answering', tokenizer='lserinol/bert-turkish-question-answering')
+nlp({
+    'question': "Ankara'da kaç ilçe vardır?",
+    'context': r"""Türkiye'nin başkenti Ankara'dır. Ülkenin en büyük idari birimleri illerdir ve 81 il vardır. Bu iller ilçelere ayrılmıştır, toplamda 973 ilçe mevcuttur."""
+})
+```
+
+```python
+from transformers import AutoTokenizer, AutoModelForQuestionAnswering
+import torch
+
+tokenizer = AutoTokenizer.from_pretrained("lserinol/bert-turkish-question-answering")
+model = AutoModelForQuestionAnswering.from_pretrained("lserinol/bert-turkish-question-answering")
+text = r"""
+Ankara'nın başkent ilan edilmesinin ardından (13 Ekim 1923) şehir hızla gelişmiş ve Türkiye'nin ikinci en kalabalık ili olmuştur.
+Türkiye Cumhuriyeti'nin ilk yıllarında ekonomisi tarım ve hayvancılığa dayanan ilin topraklarının yarısı hâlâ tarım amaçlı 
+kullanılmaktadır. Ekonomik etkinlik büyük oranda ticaret ve sanayiye dayalıdır. Tarım ve hayvancılığın ağırlığı ise giderek 
+azalmaktadır. Ankara ve civarındaki gerek kamu sektörü gerek özel sektör yatırımları, başka illerden büyük bir nüfus göçünü 
+teşvik etmiştir. Cumhuriyetin kuruluşundan günümüze, nüfusu ülke nüfusunun iki katı hızda artmıştır. Nüfusun yaklaşık dörtte 
+üçü hizmet sektörü olarak tanımlanabilecek memuriyet, ulaşım, haberleşme ve ticaret benzeri işlerde, dörtte biri sanayide, 
+%2'si ise tarım alanında çalışır. Sanayi, özellikle tekstil, gıda ve inşaat sektörlerinde yoğunlaşmıştır. Günümüzde ise en çok 
+savunma, metal ve motor sektörlerinde yatırım yapılmaktadır. Türkiye'nin en çok sayıda üniversiteye sahip ili olan Ankara'da 
+ayrıca, üniversite diplomalı kişi oranı ülke ortalamasının iki katıdır. Bu eğitimli nüfus, teknoloji ağırlıklı yatırımların 
+gereksinim duyduğu iş gücünü oluşturur. Ankara'dan otoyollar, demir yolu ve hava yoluyla Türkiye'nin diğer şehirlerine ulaşılır.
+Ankara aynı zamanda başkent olarak Türkiye Büyük Millet Meclisi (TBMM)'ye de ev sahipliği yapmaktadır.
+"""
+
+questions = [
+    "Ankara kaç yılında başkent oldu?",
+    "Ankara ne zaman başkent oldu?",
+    "Ankara'dan başka şehirlere nasıl ulaşılır?",
+    "TBMM neyin kısaltmasıdır?"
+]
+
+for question in questions:
+    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
+    input_ids = inputs["input_ids"].tolist()[0]
+
+    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
+    answer_start_scores, answer_end_scores = model(**inputs)
+
+    answer_start = torch.argmax(
+        answer_start_scores
+    )  # Get the most likely beginning of answer with the argmax of the score
+    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
+
+    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
+
+    print(f"Question: {question}")
+    print(f"Answer: {answer}\n")
+  ```
--- a/model_cards/mrm8488/GPT-2-finetuned-CORD19/README.md
+++ b/model_cards/mrm8488/GPT-2-finetuned-CORD19/README.md
@@ -40,7 +40,7 @@ python run_language_modeling.py \

 ## Model in action / Example of usage ✒

-You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
+You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py)

 ```bash
 python run_generation.py \
--- a/model_cards/mrm8488/GPT-2-finetuned-covid-bio-medrxiv/README.md
+++ b/model_cards/mrm8488/GPT-2-finetuned-covid-bio-medrxiv/README.md
@@ -37,7 +37,7 @@ python run_language_modeling.py \

 ## Model in action / Example of usage: ✒

-You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
+You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py)

 ```bash
 python run_generation.py \
--- a/model_cards/mrm8488/TinyBERT-spanish-uncased-finetuned-ner/README.md
+++ b/model_cards/mrm8488/TinyBERT-spanish-uncased-finetuned-ner/README.md
@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
 | Dev                    | 2.2 K |


- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)

 - Labels covered:

--- a/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
+++ b/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
@@ -29,7 +29,7 @@ The model was trained on a Tesla P100 GPU and 25GB of RAM with the following com

 ```bash
 export SQUAD_DIR=path/to/nl_squad
-python transformers/examples/run_squad.py \
+python transformers/examples/question-answering/run_squad.py \
  --model_type bert \
  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
  --do_train \
--- a/model_cards/mrm8488/bert-medium-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/bert-medium-finetuned-squadv2/README.md
@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/mrm8488/bert-mini-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/bert-mini-finetuned-squadv2/README.md
@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/mrm8488/bert-small-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/bert-small-finetuned-squadv2/README.md
@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/mrm8488/bert-small-finetuned-typo-detection/README.md
+++ b/model_cards/mrm8488/bert-small-finetuned-typo-detection/README.md
@@ -11,7 +11,7 @@ thumbnail:

 - Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚

- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py) 🏋️‍♂️
+- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) 🏋️‍♂️

 ## Metrics on test set 📋

--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-ner/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-ner/README.md
@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
 | Dev                    | 2.2 K |


- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)

 - Labels covered:

--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-pos-syntax/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-pos-syntax/README.md
@@ -11,7 +11,7 @@ This model is a fine-tuned version of the Spanish BERT [(BETO)](https://github.c

 - [Dataset: CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora)

-#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+#### [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)

 #### 21 Syntax annotations (Labels) covered:

--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
@@ -19,7 +19,7 @@ I preprocessed the dataset and splitted it as train / dev (80/20)
 | Dev                    | 50 K |


- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py)

 - **60** Labels covered:

--- a/model_cards/mrm8488/bert-tiny-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/bert-tiny-finetuned-squadv2/README.md
@@ -29,7 +29,7 @@ The smaller BERT models are intended for environments with restricted computatio
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/mrm8488/chEMBL_smiles_v1/README.md
+++ b/model_cards/mrm8488/chEMBL_smiles_v1/README.md
@@ -0,0 +1,77 @@
+# *De Novo* Drug Design with MLM
+
+## What is it?
+
+An approximation to [Generative Recurrent Networks for De Novo Drug Design](https://onlinelibrary.wiley.com/doi/full/10.1002/minf.201700111) but training a MLM (RoBERTa like) from scratch.
+
+## Why?
+
+As mentioned in the paper:
+Generative artificial intelligence models present a fresh approach to chemogenomics and de novo drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest.
+They used a generative *recurrent neural network (RNN)* containing long short‐term memory (LSTM) cell to capture the syntax of molecular representations in terms of SMILES strings.
+The learned pattern probabilities can be used for de novo SMILES generation. This molecular design concept **eliminates the need for virtual compound library enumeration** and **enables virtual compound design without requiring secondary or external activity prediction**.
+
+
+## My Goal 🎯
+
+By training a MLM from scratch on 438552 (cleaned*) SMILES I wanted to build a model that learns this kind of molecular combinations so that given a partial SMILE it can generate plausible combinations so that it can be proposed as new drugs.
+By cleaned SMILES I mean that I used their [SMILES cleaning script](https://github.com/topazape/LSTM_Chem/blob/master/cleanup_smiles.py) to remove duplicates, salts, and stereochemical information.
+You can see the detailed process of gathering the data, preprocess it and train the LSTM in their [repo](https://github.com/topazape/LSTM_Chem).
+
+## Fast usage with ```pipelines``` 🧪
+
+```python
+from transformers import pipeline
+
+fill_mask = pipeline(
+    "fill-mask",
+    model='/mrm8488/chEMBL_smiles_v1',
+    tokenizer='/mrm8488/chEMBL_smiles_v1'
+)
+
+# CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc1 Atazanavir
+smile1 = "CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)<mask>"
+
+fill_mask(smile1)
+
+# Output:
+'''
+[{'score': 0.6040295958518982,
+  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)nc</s>',
+  'token': 265},
+ {'score': 0.2185731679201126,
+  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)N</s>',
+  'token': 50},
+ {'score': 0.0642734169960022,
+  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)cc</s>',
+  'token': 261},
+ {'score': 0.01932266168296337,
+  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)CCCl</s>',
+  'token': 452},
+ {'score': 0.005068355705589056,
+  'sequence': '<s> CC(C)CN(CC(OP(=O)(O)O)C(Cc1ccccc1)NC(=O)OC1CCOC1)S(=O)(=O)c1ccc(N)C</s>',
+  'token': 39}]
+  '''
+  ```
+  ## More
+  I also created a [second version](https://huggingface.co/mrm8488/chEMBL26_smiles_v2) without applying the cleaning SMILES script mentioned above. You can use it in the same way as this one.
+  
+  ```python
+  fill_mask = pipeline(
+    "fill-mask",
+    model='/mrm8488/chEMBL26_smiles_v2',
+    tokenizer='/mrm8488/chEMBL26_smiles_v2'
+)
+```
+  
+ [Original paper](https://www.ncbi.nlm.nih.gov/pubmed/29095571) Authors:
+ <details>
+Swiss Federal Institute of Technology (ETH), Department of Chemistry and Applied Biosciences, Vladimir–Prelog–Weg 4, 8093, Zurich, Switzerland,
+Stanford University, Department of Computer Science, 450 Sierra Mall, Stanford, CA, 94305, USA,
+inSili.com GmbH, 8049, Zurich, Switzerland,
+Gisbert Schneider, Email: hc.zhte@trebsig.
+</details>
+  
+ > Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection/README.md
+++ b/model_cards/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection/README.md
@@ -11,7 +11,7 @@ thumbnail:

 - Dataset: [GitHub Typo Corpus](https://github.com/mhagiwara/github-typo-corpus) 📚 for 15 languages

- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py) 🏋️‍♂️
+- [Fine-tune script on NER dataset provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) 🏋️‍♂️

 ## Metrics on test set 📋

--- a/model_cards/mrm8488/distilbert-multi-finetuned-for-xqua-on-tydiqa/README.md
+++ b/model_cards/mrm8488/distilbert-multi-finetuned-for-xqua-on-tydiqa/README.md
@@ -31,7 +31,7 @@ The model was fine-tuned on a Tesla P100 GPU and 25GB of RAM.
 The script is the following:

 ```python
-python transformers/examples/run_squad.py \
+python transformers/examples/question-answering/run_squad.py \
  --model_type distilbert \
  --model_name_or_path distilbert-base-multilingual-cased \
  --do_train \
--- a/model_cards/mrm8488/electricidad-small-discriminator/README.md
+++ b/model_cards/mrm8488/electricidad-small-discriminator/README.md
@@ -0,0 +1,67 @@
+---
+language: spanish
+thumbnail: https://i.imgur.com/uxAvBfh.png
+
+
+---
+
+## ELECTRICIDAD: The Spanish Electra [Imgur](https://imgur.com/uxAvBfh)
+
+**ELECTRICIDAD** is a small Electra like model (discriminator in this case) trained on a + 20 GB of  the [OSCAR](https://oscar-corpus.com/) Spanish corpus.
+
+As mentioned in the original [paper](https://openreview.net/pdf?id=r1xMH1BtvB):
+**ELECTRA** is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a [GAN](https://arxiv.org/pdf/1406.2661.pdf). At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the [SQuAD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset.
+
+For a detailed description and experimental results, please refer the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/pdf?id=r1xMH1BtvB).
+
+## Model details ⚙
+
+|Param| # Value|
+|-----|--------|
+|Layers|	12   |
+|Hidden |256 	|
+|Params| 14M|
+
+## Evaluation metrics (for discriminator) 🧾
+
+|Metric | # Score |
+|-------|---------|
+|Accuracy| 0.94|
+|Precision| 0.76|
+|AUC | 0.92|
+
+## Benchmarks 🔨
+
+WIP 🚧
+
+## How to use the discriminator in `transformers`
+
+```python
+from transformers import ElectraForPreTraining, ElectraTokenizerFast
+import torch
+
+discriminator = ElectraForPreTraining.from_pretrained("mrm8488/electricidad-small-discriminator")
+tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/electricidad-small-discriminator")
+
+sentence = "El rápido zorro marrón salta sobre el perro perezoso"
+fake_sentence = "El rápido zorro marrón falsea sobre el perro perezoso"
+
+fake_tokens = tokenizer.tokenize(sentence)
+fake_inputs = tokenizer.encode(sentence, return_tensors="pt")
+discriminator_outputs = discriminator(fake_inputs)
+predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
+
+[print("%7s" % token, end="") for token in fake_tokens]
+
+[print("%7s" % prediction, end="") for prediction in predictions.tolist()]
+```
+
+## Acknowledgments
+
+I thank [🤗/transformers team](https://github.com/huggingface/transformers) for answering my doubts and Google for helping me with the [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc) program.
+
+
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/spanbert-finetuned-squadv1/README.md
+++ b/model_cards/mrm8488/spanbert-finetuned-squadv1/README.md
@@ -26,7 +26,7 @@ thumbnail:
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/mrm8488/spanbert-finetuned-squadv2/README.md
+++ b/model_cards/mrm8488/spanbert-finetuned-squadv2/README.md
@@ -23,7 +23,7 @@ thumbnail:
 ## Model training

 The model was trained on a Tesla P100 GPU and 25GB of RAM.
-The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/run_squad.py)
+The script for fine tuning can be found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py)

 ## Results:

--- a/model_cards/savasy/bert-base-turkish-ner-cased/README.md
+++ b/model_cards/savasy/bert-base-turkish-ner-cased/README.md
@@ -0,0 +1,90 @@
+
+# For Turkish language, here is an easy-to-use NER application. 
+ ** Türkçe için kolay bir python  NER (Bert + Transfer Learning)  (İsim Varlık Tanıma) modeli... 
+
+
+Thanks to @stefan-it, I applied the followings for training
+
+
+cd tr-data
+
+for file in train.txt dev.txt test.txt labels.txt
+do
+  wget https://schweter.eu/storage/turkish-bert-wikiann/$file
+done
+
+cd ..
+It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
+
+Run pre-training
+After downloading the dataset, pre-training can be started. Just set the following environment variables:
+```
+export MAX_LENGTH=128
+export BERT_MODEL=dbmdz/bert-base-turkish-cased 
+export OUTPUT_DIR=tr-new-model
+export BATCH_SIZE=32
+export NUM_EPOCHS=3
+export SAVE_STEPS=625
+export SEED=1
+```
+Then run pre-training:
+```
+python3 run_ner.py --data_dir ./tr-data3 \
+--model_type bert \
+--labels ./tr-data/labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR-$SEED \
+--max_seq_length $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--per_gpu_train_batch_size $BATCH_SIZE \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_eval \
+--do_predict \
+--fp16
+```
+
+
+# Usage
+
+```
+from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
+model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
+tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
+ner=pipeline('ner', model=model, tokenizer=tokenizer)
+ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
+```
+# Some results
+Data1:  For the data above
+Eval Results:
+
+* precision = 0.916400580551524
+* recall = 0.9342309684101502
+* f1 = 0.9252298787412536
+* loss = 0.11335893666411284
+
+Test Results:
+* precision = 0.9192058759362955
+* recall = 0.9303010230367262
+* f1 = 0.9247201697271198
+* loss = 0.11182546521618497
+
+
+
+Data2:
+https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
+The performance for the data given by @kemalaraz is as follows
+
+savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
+* precision = 0.9461980692049029
+* recall = 0.959309358847465
+* f1 = 0.9527086063783312
+* loss = 0.037054269206847804
+
+savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
+* precision = 0.9458370635631155
+* recall = 0.9588201928530913
+* f1 = 0.952284378344882
+* loss = 0.035431676572445225
+
--- a/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
+++ b/model_cards/savasy/bert-base-turkish-sentiment-cased/README.md
@@ -0,0 +1,146 @@
+# Bert-base Turkish Sentiment Model
+
+https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
+
+This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
+
+
+# Dataset
+
+The dataset is taken from the studies [2] and [3] and merged.
+
+* The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
+The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
+5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
+scale from 0 to 5 by the users who made the reviews. The study considered a review
+sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
+or equal to 2. They also built Turkish product review dataset from an online retailer
+Web page. They constructed benchmark dataset consisting of reviews regarding some
+products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
+and majority class of reviews are 5. Each category has 700 positive and 700 negative
+reviews in which average rating of negative reviews is 2.27 and of positive reviews
+is 4.5. This dataset is also used the study [1]
+
+* The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 
+
+*Merged Dataset* 
+
+| *size*   | *data* |
+|--------|----|
+|   8000 |dev.tsv|
+|   8262 |test.tsv|
+|  32000 |train.tsv|
+|  *48290* |*total*|
+
+
+The dataset is used by following papers
+ 
+* 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
+* 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
+Discovery and Opinion Mining (WISDOM ’13)
+* [3] Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
+
+# Training
+
+```
+export GLUE_DIR="./sst-2-newall"
+export TASK_NAME=SST-2
+ 
+
+python3 run_glue.py \
+  --model_type bert \
+  --model_name_or_path dbmdz/bert-base-turkish-uncased\
+  --task_name "SST-2" \
+  --do_train \
+  --do_eval \
+  --data_dir "./sst-2-newall" \
+  --max_seq_length 128 \
+  --per_gpu_train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir "./model"
+
+```
+
+
+
+
+# Results
+
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -   ***** Running Evaluation *****
+
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999
+
+> 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8
+
+>Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
+
+>05/10/2020 17:01:17 - INFO - __main__ -   ***** Eval results sst-2 *****
+
+>05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602
+
+>05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363
+
+
+Accuracy is about *%95.4*
+# Code Usage
+
+```
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
+tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
+sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
+
+p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
+print(p)
+#[{'label': 'LABEL_1', 'score': 0.9871089}]
+print (p[0]['label']=='LABEL_1')
+#True
+
+
+p= sa("Film çok kötü ve çok sahteydi")
+print(p)
+#[{'label': 'LABEL_0', 'score': 0.9975505}]
+print (p[0]['label']=='LABEL_1')
+#False
+```
+
+# Test your data
+
+Suppose your file has lots of lines of comment and label (1 or 0) at the end  (tab seperated)
+
+> comment1 ... \t label
+
+> comment2 ... \t label
+ 
+> ...
+
+
+
+```
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+
+f="/path/to/your/file/yourfile.tsv"
+model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
+tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
+sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
+
+i,crr=0,0
+for line in open(f):
+ lines=line.strip().split("\t")
+ if len(lines)==2:
+  i=i+1
+  if i%100==0:
+   print(i)
+  pred= sa(lines[0])
+  pred=pred[0]["label"].split("_")[1]
+  if pred== lines[1]:
+   crr=crr+1
+
+print(crr, i, crr/i)
+```
+
+
+
+
+
--- a/notebooks/03-pipelines.ipynb
+++ b/notebooks/03-pipelines.ipynb
@@ -30,7 +30,8 @@
    },
    "colab": {
      "name": "03-pipelines.ipynb",
-      "provenance": []
+      "provenance": [],
+      "include_colab_link": true
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
@@ -1504,6 +1505,251 @@
            "left": null
          }
        },
+        "3c86415352574190b71e1fe5a15d36f1": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "state": {
+            "_view_name": "HBoxView",
+            "_dom_classes": [],
+            "_model_name": "HBoxModel",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "box_style": "",
+            "layout": "IPY_MODEL_dd2c9dd935754cf2802233053554c21c",
+            "_model_module": "@jupyter-widgets/controls",
+            "children": [
+              "IPY_MODEL_8ae3be32d9c845e59fdb1c47884d48aa",
+              "IPY_MODEL_4dea0031f3554752ad5aad01fe516a60"
+            ]
+          }
+        },
+        "dd2c9dd935754cf2802233053554c21c": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "8ae3be32d9c845e59fdb1c47884d48aa": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "state": {
+            "_view_name": "ProgressView",
+            "style": "IPY_MODEL_1efb96d931a446de92f1930b973ae846",
+            "_dom_classes": [],
+            "description": "Downloading: 100%",
+            "_model_name": "FloatProgressModel",
+            "bar_style": "success",
+            "max": 230,
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": 230,
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "orientation": "horizontal",
+            "min": 0,
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_6a4f5aab5ba949fd860b5a35bba7db9c"
+          }
+        },
+        "4dea0031f3554752ad5aad01fe516a60": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "state": {
+            "_view_name": "HTMLView",
+            "style": "IPY_MODEL_4b02b2e964ad49af9f7ce7023131ceb8",
+            "_dom_classes": [],
+            "description": "",
+            "_model_name": "HTMLModel",
+            "placeholder": "",
+            "_view_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "value": " 230/230 [00:00&lt;00:00, 8.69kB/s]",
+            "_view_count": null,
+            "_view_module_version": "1.5.0",
+            "description_tooltip": null,
+            "_model_module": "@jupyter-widgets/controls",
+            "layout": "IPY_MODEL_0ae8a68c3668401da8d8a6d5ec9cac8f"
+          }
+        },
+        "1efb96d931a446de92f1930b973ae846": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "ProgressStyleModel",
+            "description_width": "initial",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "bar_color": null,
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "6a4f5aab5ba949fd860b5a35bba7db9c": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
+        "4b02b2e964ad49af9f7ce7023131ceb8": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "state": {
+            "_view_name": "StyleView",
+            "_model_name": "DescriptionStyleModel",
+            "description_width": "",
+            "_view_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.5.0",
+            "_view_count": null,
+            "_view_module_version": "1.2.0",
+            "_model_module": "@jupyter-widgets/controls"
+          }
+        },
+        "0ae8a68c3668401da8d8a6d5ec9cac8f": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "state": {
+            "_view_name": "LayoutView",
+            "grid_template_rows": null,
+            "right": null,
+            "justify_content": null,
+            "_view_module": "@jupyter-widgets/base",
+            "overflow": null,
+            "_model_module_version": "1.2.0",
+            "_view_count": null,
+            "flex_flow": null,
+            "width": null,
+            "min_width": null,
+            "border": null,
+            "align_items": null,
+            "bottom": null,
+            "_model_module": "@jupyter-widgets/base",
+            "top": null,
+            "grid_column": null,
+            "overflow_y": null,
+            "overflow_x": null,
+            "grid_auto_flow": null,
+            "grid_area": null,
+            "grid_template_columns": null,
+            "flex": null,
+            "_model_name": "LayoutModel",
+            "justify_items": null,
+            "grid_row": null,
+            "max_height": null,
+            "align_content": null,
+            "visibility": null,
+            "align_self": null,
+            "height": null,
+            "min_height": null,
+            "padding": null,
+            "grid_auto_rows": null,
+            "grid_gap": null,
+            "max_width": null,
+            "order": null,
+            "_view_module_version": "1.2.0",
+            "grid_template_areas": null,
+            "object_position": null,
+            "object_fit": null,
+            "grid_auto_columns": null,
+            "margin": null,
+            "display": null,
+            "left": null
+          }
+        },
        "fd44cf6ab17e4b768b2e1d5cb8ce5af9": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
@@ -2105,6 +2351,16 @@
    }
  },
  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/huggingface/transformers/blob/generation_pipeline_docs/notebooks/03-pipelines.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
    {
      "cell_type": "markdown",
      "metadata": {
@@ -2170,13 +2426,29 @@
        },
        "id": "4maAknWNrl_N",
        "colab_type": "code",
-        "colab": {}
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 102
+        },
+        "outputId": "467e3cc8-a069-47da-8029-86e4142c7dde"
      },
      "source": [
        "!pip install -q transformers"
      ],
-      "execution_count": 0,
-      "outputs": []
+      "execution_count": 2,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "\u001b[K     |████████████████████████████████| 645kB 4.4MB/s \n",
+            "\u001b[K     |████████████████████████████████| 3.8MB 11.7MB/s \n",
+            "\u001b[K     |████████████████████████████████| 890kB 51.5MB/s \n",
+            "\u001b[K     |████████████████████████████████| 1.0MB 46.0MB/s \n",
+            "\u001b[?25h  Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
+          ],
+          "name": "stdout"
+        }
+      ]
    },
    {
      "cell_type": "code",
@@ -2219,6 +2491,7 @@
        },
        "id": "AMRXHQw9rl_d",
        "colab_type": "code",
+        "outputId": "a7a10851-b71e-4553-9afc-04066120410d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 83,
@@ -2232,14 +2505,13 @@
            "ad84da685cf44abb90d17d9d2e023b48",
            "a246f9eea2d7440cb979e728741d2e32"
          ]
-        },
-        "outputId": "a7a10851-b71e-4553-9afc-04066120410d"
+        }
      },
      "source": [
        "nlp_sentence_classif = pipeline('sentiment-analysis')\n",
        "nlp_sentence_classif('Such a nice weather outside !')"
      ],
-      "execution_count": 3,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2300,6 +2572,7 @@
        },
        "id": "B3BDRX_Krl_n",
        "colab_type": "code",
+        "outputId": "a6b90b11-a272-4ecb-960d-4c682551b399",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 185,
@@ -2313,14 +2586,13 @@
            "405afa5bb8b840d8bc0850e02f593ce4",
            "78c718e3d5fa4cb892217260bea6d540"
          ]
-        },
-        "outputId": "a6b90b11-a272-4ecb-960d-4c682551b399"
+        }
      },
      "source": [
        "nlp_token_class = pipeline('ner')\n",
        "nlp_token_class('Hugging Face is a French company based in New-York.')"
      ],
-      "execution_count": 4,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2384,6 +2656,7 @@
        },
        "id": "ND_8LzQKrl_u",
        "colab_type": "code",
+        "outputId": "c59ae695-c465-4de6-fa6e-181d8f1a3992",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 117,
@@ -2397,14 +2670,13 @@
            "cd64e3f20b23483daa79712bde6622ea",
            "67cbaa1f55d24e62ad6b022af36bca56"
          ]
-        },
-        "outputId": "c59ae695-c465-4de6-fa6e-181d8f1a3992"
+        }
      },
      "source": [
        "nlp_qa = pipeline('question-answering')\n",
        "nlp_qa(context='Hugging Face is a French company based in New-York.', question='Where is based Hugging Face ?')"
      ],
-      "execution_count": 5,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2470,6 +2742,7 @@
        },
        "id": "zpJQ2HXNrl_4",
        "colab_type": "code",
+        "outputId": "3fb62e7a-25a6-4b06-ced8-51eb8aa6bf33",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 321,
@@ -2483,14 +2756,13 @@
            "a35703cc8ff44e93a8c0eb413caddc40",
            "9df7014c99b343f3b178fa020ff56010"
          ]
-        },
-        "outputId": "3fb62e7a-25a6-4b06-ced8-51eb8aa6bf33"
+        }
      },
      "source": [
        "nlp_fill = pipeline('fill-mask')\n",
        "nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)"
      ],
-      "execution_count": 6,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2560,11 +2832,11 @@
      "metadata": {
        "id": "8BaOgzi1u1Yc",
        "colab_type": "code",
+        "outputId": "2168e437-cfba-4247-a38c-07f02f555c6e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
-        },
-        "outputId": "2168e437-cfba-4247-a38c-07f02f555c6e"
+        }
      },
      "source": [
        "TEXT_TO_SUMMARIZE = \"\"\" \n",
@@ -2590,7 +2862,7 @@
        "summarizer = pipeline('summarization')\n",
        "summarizer(TEXT_TO_SUMMARIZE)"
      ],
-      "execution_count": 7,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
@@ -2631,6 +2903,7 @@
      "metadata": {
        "id": "8FwayP4nwV3Z",
        "colab_type": "code",
+        "outputId": "66956816-c924-4718-fe58-cabef7d51974",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 83,
@@ -2644,15 +2917,14 @@
            "ad78042ee71a41fd989e4b4ce9d2e3c1",
            "40c8d2617f3d4c84b923b140456fa5da"
          ]
-        },
-        "outputId": "66956816-c924-4718-fe58-cabef7d51974"
+        }
      },
      "source": [
        "# English to French\n",
        "translator = pipeline('translation_en_to_fr')\n",
        "translator(\"HuggingFace is a French company that is based in New York City. HuggingFace's mission is to solve NLP one commit at a time\")"
      ],
-      "execution_count": 8,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2696,6 +2968,7 @@
      "metadata": {
        "colab_type": "code",
        "id": "ra0-WfznwoIW",
+        "outputId": "278a3d5f-cc42-40bc-a9db-c92ec5a3a2f0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 83,
@@ -2709,15 +2982,14 @@
            "4486f8a2efc34b9aab3864eb5ad2ba48",
            "d6228324f3444aa6bd1323d65ae4ff75"
          ]
-        },
-        "outputId": "278a3d5f-cc42-40bc-a9db-c92ec5a3a2f0"
+        }
      },
      "source": [
        "# English to German\n",
        "translator = pipeline('translation_en_to_de')\n",
        "translator(\"The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.\")"
      ],
-      "execution_count": 9,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2756,6 +3028,89 @@
        }
      ]
    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qPUpg0M8hCtB",
+        "colab_type": "text"
+      },
+      "source": [
+        "## 7. Text Generation\n",
+        "\n",
+        "Text generation is currently supported by GPT-2, OpenAi-GPT, TransfoXL, XLNet, CTRL and Reformer."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "5pKfxTxohXuZ",
+        "colab_type": "code",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 120,
+          "referenced_widgets": [
+            "3c86415352574190b71e1fe5a15d36f1",
+            "dd2c9dd935754cf2802233053554c21c",
+            "8ae3be32d9c845e59fdb1c47884d48aa",
+            "4dea0031f3554752ad5aad01fe516a60",
+            "1efb96d931a446de92f1930b973ae846",
+            "6a4f5aab5ba949fd860b5a35bba7db9c",
+            "4b02b2e964ad49af9f7ce7023131ceb8",
+            "0ae8a68c3668401da8d8a6d5ec9cac8f"
+          ]
+        },
+        "outputId": "8705f6b4-2413-4ac6-f72d-e5ecce160662"
+      },
+      "source": [
+        "text_generator = pipeline(\"text-generation\")\n",
+        "text_generator(\"Today is a beautiful day and I will\")"
+      ],
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/vnd.jupyter.widget-view+json": {
+              "model_id": "3c86415352574190b71e1fe5a15d36f1",
+              "version_minor": 0,
+              "version_major": 2
+            },
+            "text/plain": [
+              "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          }
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "\n"
+          ],
+          "name": "stdout"
+        },
+        {
+          "output_type": "stream",
+          "text": [
+            "Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence\n"
+          ],
+          "name": "stderr"
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "[{'generated_text': 'Today is a beautiful day and I will celebrate my birthday!\"\\n\\nThe mother told CNN the two had planned their meal together. After dinner, she added that she and I walked down the street and stopped at a diner near her home. \"He'}]"
+            ]
+          },
+          "metadata": {
+            "tags": []
+          },
+          "execution_count": 5
+        }
+      ]
+    },
    {
      "cell_type": "markdown",
      "metadata": {
@@ -2763,7 +3118,7 @@
        "colab_type": "text"
      },
      "source": [
-        "## 7. Projection - Features Extraction "
+        "## 8. Projection - Features Extraction "
      ]
    },
    {
@@ -2775,6 +3130,7 @@
        },
        "id": "O4SjR1QQrl__",
        "colab_type": "code",
+        "outputId": "2ce966d5-7a89-4488-d48f-626d1c2a8222",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 83,
@@ -2788,8 +3144,7 @@
            "31d97ecf78fa412c99e6659196d82828",
            "c6be5d48ec3c4c799d1445607e5f1ac6"
          ]
-        },
-        "outputId": "2ce966d5-7a89-4488-d48f-626d1c2a8222"
+        }
      },
      "source": [
        "import numpy as np\n",
@@ -2797,7 +3152,7 @@
        "output = nlp_features('Hugging Face is a French company based in Paris')\n",
        "np.array(output).shape   # (Samples, Tokens, Vector Size)\n"
      ],
-      "execution_count": 10,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2861,6 +3216,7 @@
        },
        "id": "yFlBPQHtrmAH",
        "colab_type": "code",
+        "outputId": "03cc3207-a7e8-49fd-904a-63a7a1d0eb7a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 116,
@@ -2872,8 +3228,7 @@
            "62b10ca525cc4ac68f3a006434eb7416",
            "211109537fbe4e60b89a238c89db1346"
          ]
-        },
-        "outputId": "03cc3207-a7e8-49fd-904a-63a7a1d0eb7a"
+        }
      },
      "source": [
        "task = widgets.Dropdown(\n",
@@ -2906,7 +3261,7 @@
        "input.on_submit(forward)\n",
        "display(task, input)"
      ],
-      "execution_count": 11,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
@@ -2958,6 +3313,7 @@
        },
        "id": "GCoKbBTYrmAN",
        "colab_type": "code",
+        "outputId": "57c3a647-160a-4b3a-e852-e7a1daf1294a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 143,
@@ -2969,8 +3325,7 @@
            "d305ba1662e3466c93ab5cca7ebf8f33",
            "879f7a3747ad455d810c7a29918648ee"
          ]
-        },
-        "outputId": "57c3a647-160a-4b3a-e852-e7a1daf1294a"
+        }
      },
      "source": [
        "context = widgets.Textarea(\n",
@@ -2995,7 +3350,7 @@
        "query.on_submit(forward)\n",
        "display(context, query)"
      ],
-      "execution_count": 12,
+      "execution_count": 0,
      "outputs": [
        {
          "output_type": "display_data",
--- a/setup.py
+++ b/setup.py
@@ -79,13 +79,13 @@ extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rt
 extras["quality"] = [
    "black",
    "isort",
-    "flake8",
+    "flake8==3.7.9",
 ]
 extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]

 setup(
    name="transformers",
-    version="2.9.0",
+    version="2.9.1",
    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "2.9.0"
+__version__ = "2.9.1"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -248,7 +248,7 @@ if is_torch_available():
        BART_PRETRAINED_MODEL_ARCHIVE_MAP,
    )
    from .modeling_marian import MarianMTModel
-    from .tokenization_marian import MarianSentencePieceTokenizer
+    from .tokenization_marian import MarianTokenizer
    from .modeling_roberta import (
        RobertaForMaskedLM,
        RobertaModel,
@@ -287,6 +287,7 @@ if is_torch_available():
    from .modeling_albert import (
        AlbertPreTrainedModel,
        AlbertModel,
+        AlbertForPreTraining,
        AlbertForMaskedLM,
        AlbertForSequenceClassification,
        AlbertForQuestionAnswering,
@@ -358,6 +359,7 @@ if is_tf_available():
    from .modeling_tf_auto import (
        TFAutoModel,
        TFAutoModelForPreTraining,
+        TFAutoModelForMultipleChoice,
        TFAutoModelForSequenceClassification,
        TFAutoModelForQuestionAnswering,
        TFAutoModelWithLMHead,
@@ -490,7 +492,9 @@ if is_tf_available():
        TFAlbertPreTrainedModel,
        TFAlbertMainLayer,
        TFAlbertModel,
+        TFAlbertForPreTraining,
        TFAlbertForMaskedLM,
+        TFAlbertForMultipleChoice,
        TFAlbertForSequenceClassification,
        TFAlbertForQuestionAnswering,
        TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
--- a/src/transformers/activations.py
+++ b/src/transformers/activations.py
@@ -26,7 +26,7 @@ def gelu_new(x):
    """ Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
        Also see https://arxiv.org/abs/1606.08415
    """
-    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
+    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))


 if torch.__version__ < "1.4.0":
@@ -36,7 +36,7 @@ else:


 def gelu_fast(x):
-    return 0.5 * x * (1 + torch.tanh(x * 0.7978845608 * (1 + 0.044715 * x * x)))
+    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))


 ACT2FN = {
--- a/src/transformers/commands/convert.py
+++ b/src/transformers/commands/convert.py
@@ -62,7 +62,21 @@ class ConvertCommand(BaseTransformersCLICommand):
        self._finetuning_task_name = finetuning_task_name

    def run(self):
-        if self._model_type == "bert":
+        if self._model_type == "albert":
+            try:
+                from transformers.convert_albert_original_tf_checkpoint_to_pytorch import (
+                    convert_tf_checkpoint_to_pytorch,
+                )
+            except ImportError:
+                msg = (
+                    "transformers can only be used from the commandline to convert TensorFlow models in PyTorch, "
+                    "In that case, it requires TensorFlow to be installed. Please see "
+                    "https://www.tensorflow.org/install/ for installation instructions."
+                )
+                raise ImportError(msg)
+
+            convert_tf_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
+        elif self._model_type == "bert":
            try:
                from transformers.convert_bert_original_tf_checkpoint_to_pytorch import (
                    convert_tf_checkpoint_to_pytorch,
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -28,6 +28,7 @@ from .configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, Electr
 from .configuration_encoder_decoder import EncoderDecoderConfig
 from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
 from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
+from .configuration_marian import MarianConfig
 from .configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
 from .configuration_reformer import ReformerConfig
 from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
@@ -73,6 +74,7 @@ CONFIG_MAPPING = OrderedDict(
        ("albert", AlbertConfig,),
        ("camembert", CamembertConfig,),
        ("xlm-roberta", XLMRobertaConfig,),
+        ("marian", MarianConfig,),
        ("bart", BartConfig,),
        ("reformer", ReformerConfig,),
        ("roberta", RobertaConfig,),
--- a/src/transformers/configuration_marian.py
+++ b/src/transformers/configuration_marian.py
@@ -23,4 +23,5 @@ PRETRAINED_CONFIG_ARCHIVE_MAP = {


 class MarianConfig(BartConfig):
+    model_type = "marian"
    pretrained_config_archive_map = PRETRAINED_CONFIG_ARCHIVE_MAP
--- a/src/transformers/configuration_reformer.py
+++ b/src/transformers/configuration_reformer.py
@@ -24,7 +24,8 @@ from .configuration_utils import PretrainedConfig
 logger = logging.getLogger(__name__)

 REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json"
+    "google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/config.json",
+    "google/reformer-enwik8": "https://cdn.huggingface.co/google/reformer-enwik8/config.json",
 }


--- a/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py
@@ -20,7 +20,7 @@ import logging

 import torch

-from transformers import AlbertConfig, AlbertForMaskedLM, load_tf_weights_in_albert
+from transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert


 logging.basicConfig(level=logging.INFO)
@@ -30,7 +30,7 @@ def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, albert_config_file, pyt
    # Initialise PyTorch model
    config = AlbertConfig.from_json_file(albert_config_file)
    print("Building PyTorch model from configuration: {}".format(str(config)))
-    model = AlbertForMaskedLM(config)
+    model = AlbertForPreTraining(config)

    # Load weights from tf checkpoint
    load_tf_weights_in_albert(model, config, tf_checkpoint_path)
--- a/src/transformers/convert_marian_to_pytorch.py
+++ b/src/transformers/convert_marian_to_pytorch.py
@@ -11,7 +11,8 @@ import numpy as np
 import torch
 from tqdm import tqdm

-from transformers import MarianConfig, MarianMTModel, MarianSentencePieceTokenizer
+from transformers import MarianConfig, MarianMTModel, MarianTokenizer
+from transformers.hf_api import HfApi


 def remove_prefix(text: str, prefix: str):
@@ -38,6 +39,19 @@ def load_layers_(layer_lst: torch.nn.ModuleList, opus_state: dict, converter, is
        layer.load_state_dict(sd, strict=True)


+def find_pretrained_model(src_lang: str, tgt_lang: str) -> List[str]:
+    """Find models that can accept src_lang as input and return tgt_lang as output."""
+    prefix = "Helsinki-NLP/opus-mt-"
+    api = HfApi()
+    model_list = api.model_list()
+    model_ids = [x.modelId for x in model_list if x.modelId.startswith("Helsinki-NLP")]
+    src_and_targ = [
+        remove_prefix(m, prefix).lower().split("-") for m in model_ids if "+" not in m
+    ]  # + cant be loaded.
+    matching = [f"{prefix}{a}-{b}" for (a, b) in src_and_targ if src_lang in a and tgt_lang in b]
+    return matching
+
+
 def add_emb_entries(wemb, final_bias, n_special_tokens=1):
    vsize, d_model = wemb.shape
    embs_to_add = np.zeros((n_special_tokens, d_model))
@@ -81,7 +95,103 @@ def find_model_file(dest_dir):  # this one better
    return model_file


-def parse_readmes(repo_path):
+# Group Names Logic: change long opus model names to something shorter, like opus-mt-en-ROMANCE
+ROM_GROUP = "fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la"
+GROUPS = [
+    ("cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh", "ZH"),
+    (ROM_GROUP, "ROMANCE"),
+    ("de+nl+fy+af+da+fo+is+no+nb+nn+sv", "NORTH_EU"),
+    ("da+fo+is+no+nb+nn+sv", "SCANDINAVIA"),
+    ("se+sma+smj+smn+sms", "SAMI"),
+    ("nb_NO+nb+nn_NO+nn+nog+no_nb+no", "NORWAY"),
+    ("ga+cy+br+gd+kw+gv", "CELTIC"),  # https://en.wikipedia.org/wiki/Insular_Celtic_languages
+]
+GROUP_TO_OPUS_NAME = {
+    "opus-mt-ZH-de": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de",
+    "opus-mt-ZH-fi": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-fi",
+    "opus-mt-ZH-sv": "cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-sv",
+    "opus-mt-SCANDINAVIA-SCANDINAVIA": "da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv",
+    "opus-mt-NORTH_EU-NORTH_EU": "de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv",
+    "opus-mt-de-ZH": "de-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
+    "opus-mt-en_el_es_fi-en_el_es_fi": "en+el+es+fi-en+el+es+fi",
+    "opus-mt-en-ROMANCE": "en-fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO"
+    "+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR"
+    "+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la",
+    "opus-mt-en-CELTIC": "en-ga+cy+br+gd+kw+gv",
+    "opus-mt-es-NORWAY": "es-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
+    "opus-mt-fi_nb_no_nn_ru_sv_en-SAMI": "fi+nb+no+nn+ru+sv+en-se+sma+smj+smn+sms",
+    "opus-mt-fi-ZH": "fi-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
+    "opus-mt-fi-NORWAY": "fi-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
+    "opus-mt-ROMANCE-en": "fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO"
+    "+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR"
+    "+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la-en",
+    "opus-mt-CELTIC-en": "ga+cy+br+gd+kw+gv-en",
+    "opus-mt-sv-ZH": "sv-cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh",
+    "opus-mt-sv-NORWAY": "sv-nb_NO+nb+nn_NO+nn+nog+no_nb+no",
+}
+OPUS_GITHUB_URL = "https://github.com/Helsinki-NLP/OPUS-MT-train/blob/master/models/"
+ORG_NAME = "Helsinki-NLP/"
+
+
+def convert_opus_name_to_hf_name(x):
+    for substr, grp_name in GROUPS:
+        x = x.replace(substr, grp_name)
+    return x.replace("+", "_")
+
+
+def convert_hf_name_to_opus_name(hf_model_name):
+    """Relies on the assumption that there are no language codes like pt_br in models that are not in GROUP_TO_OPUS_NAME."""
+    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)
+    if hf_model_name in GROUP_TO_OPUS_NAME:
+        opus_w_prefix = GROUP_TO_OPUS_NAME[hf_model_name]
+    else:
+        opus_w_prefix = hf_model_name.replace("_", "+")
+    return remove_prefix(opus_w_prefix, "opus-mt-")
+
+
+def write_model_card(
+    hf_model_name: str,
+    repo_path="OPUS-MT-train/models/",
+    dry_run=False,
+    model_card_dir=Path("marian_converted/model_cards/Helsinki-NLP/"),
+) -> str:
+    """Copy the most recent model's readme section from opus, and add metadata.
+    upload command: s3cmd sync --recursive model_card_dir s3://models.huggingface.co/bert/Helsinki-NLP/
+    """
+    hf_model_name = remove_prefix(hf_model_name, ORG_NAME)
+    opus_name: str = convert_hf_name_to_opus_name(hf_model_name)
+    opus_src, opus_tgt = [x.split("+") for x in opus_name.split("-")]
+    readme_url = OPUS_GITHUB_URL + f"{opus_name}/README.md"
+    s, t = ",".join(opus_src), ",".join(opus_tgt)
+    extra_markdown = f"### {hf_model_name}\n\n* source languages: {s}\n* target languages: {t}\n*  OPUS readme: [{opus_name}]({readme_url})\n"
+    # combine with opus markdown
+    opus_readme_path = Path(f"{repo_path}{opus_name}/README.md")
+    assert opus_readme_path.exists(), opus_readme_path
+    content = opus_readme_path.open().read()
+    content = content.split("\n# ")[-1]  # Get the lowest level 1 header in the README -- the most recent model.
+    content = "*".join(content.split("*")[1:])
+    content = extra_markdown + "\n* " + content.replace("download", "download original weights")
+    if dry_run:
+        return content
+    # Save string to model_cards/hf_model_name/readme.md
+    model_card_dir.mkdir(exist_ok=True)
+    sub_dir = model_card_dir / hf_model_name
+    sub_dir.mkdir(exist_ok=True)
+    dest = sub_dir / "README.md"
+    dest.open("w").write(content)
+    return content
+
+
+def get_clean_model_id_mapping(multiling_model_ids):
+    return {x: convert_opus_name_to_hf_name(x) for x in multiling_model_ids}
+
+
+def make_registry(repo_path="Opus-MT-train/models"):
+    if not (Path(repo_path) / "fr-en" / "README.md").exists():
+        raise ValueError(
+            f"repo_path:{repo_path} does not exist: "
+            "You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git before calling."
+        )
    results = {}
    for p in Path(repo_path).ls():
        n_dash = p.name.count("-")
@@ -90,22 +200,48 @@ def parse_readmes(repo_path):
        else:
            lns = list(open(p / "README.md").readlines())
            results[p.name] = _parse_readme(lns)
-    return results
+    return [(k, v["pre-processing"], v["download"], v["download"][:-4] + ".test.txt") for k, v in results.items()]


-def download_all_sentencepiece_models(repo_path="Opus-MT-train/models"):
+def convert_all_sentencepiece_models(model_list=None, repo_path=None):
    """Requires 300GB"""
    save_dir = Path("marian_ckpt")
-    if not Path(repo_path).exists():
-        raise ValueError("You must run: git clone git@github.com:Helsinki-NLP/Opus-MT-train.git")
-    results: dict = parse_readmes(repo_path)
-    for k, v in tqdm(list(results.items())):
-        if os.path.exists(save_dir / k):
-            print(f"already have path {k}")
+    dest_dir = Path("marian_converted")
+    dest_dir.mkdir(exist_ok=True)
+    if model_list is None:
+        model_list: list = make_registry(repo_path=repo_path)
+    for k, prepro, download, test_set_url in tqdm(model_list):
+        if "SentencePiece" not in prepro:  # dont convert BPE models.
            continue
-        if "SentencePiece" not in v["pre-processing"]:
+        if not os.path.exists(save_dir / k / "pytorch_model.bin"):
+            download_and_unzip(download, save_dir / k)
+        pair_name = convert_opus_name_to_hf_name(k)
+        convert(save_dir / k, dest_dir / f"opus-mt-{pair_name}")
+
+
+def lmap(f, x) -> List:
+    return list(map(f, x))
+
+
+def fetch_test_set(test_set_url):
+    import wget
+
+    fname = wget.download(test_set_url, f"opus_test.txt")
+    lns = Path(fname).open().readlines()
+    src = lmap(str.strip, lns[::4])
+    gold = lmap(str.strip, lns[1::4])
+    mar_model = lmap(str.strip, lns[2::4])
+    assert len(gold) == len(mar_model) == len(src)
+    os.remove(fname)
+    return src, mar_model, gold
+
+
+def convert_whole_dir(path=Path("marian_ckpt/")):
+    for subdir in tqdm(list(path.ls())):
+        dest_dir = f"marian_converted/{subdir.name}"
+        if (dest_dir / "pytorch_model.bin").exists():
            continue
-        download_and_unzip(v["download"], save_dir / k)
+        convert(source_dir, dest_dir)


 def _parse_readme(lns):
@@ -131,7 +267,7 @@ def _parse_readme(lns):
    return subres


-def write_metadata(dest_dir: Path):
+def save_tokenizer_config(dest_dir: Path):
    dname = dest_dir.name.split("-")
    dct = dict(target_lang=dname[-1], source_lang="-".join(dname[:-1]))
    save_json(dct, dest_dir / "tokenizer_config.json")
@@ -148,13 +284,17 @@ def add_to_vocab_(vocab: Dict[str, int], special_tokens: List[str]):
    return added


+def find_vocab_file(model_dir):
+    return list(model_dir.glob("*vocab.yml"))[0]
+
+
 def add_special_tokens_to_vocab(model_dir: Path) -> None:
-    vocab = load_yaml(model_dir / "opus.spm32k-spm32k.vocab.yml")
+    vocab = load_yaml(find_vocab_file(model_dir))
    vocab = {k: int(v) for k, v in vocab.items()}
    num_added = add_to_vocab_(vocab, ["<pad>"])
    print(f"added {num_added} tokens to vocab")
    save_json(vocab, model_dir / "vocab.json")
-    write_metadata(model_dir)
+    save_tokenizer_config(model_dir)


 def save_tokenizer(self, save_directory):
@@ -251,7 +391,6 @@ class OpusState:

        # Process decoder.yml
        decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
-        # TODO: what are normalize and word-penalty?
        check_marian_cfg_assumptions(cfg)
        self.hf_config = MarianConfig(
            vocab_size=cfg["vocab_size"],
@@ -273,6 +412,9 @@ class OpusState:
            dropout=0.1,  # see opus-mt-train repo/transformer-dropout param.
            # default: add_final_layer_norm=False,
            num_beams=decoder_yml["beam-size"],
+            decoder_start_token_id=self.pad_token_id,
+            bad_words_ids=[[self.pad_token_id]],
+            max_length=512,
        )

    def _check_layer_entries(self):
@@ -349,12 +491,12 @@ def download_and_unzip(url, dest_dir):
    os.remove(filename)


-def main(source_dir, dest_dir):
+def convert(source_dir: Path, dest_dir):
    dest_dir = Path(dest_dir)
    dest_dir.mkdir(exist_ok=True)

    add_special_tokens_to_vocab(source_dir)
-    tokenizer = MarianSentencePieceTokenizer.from_pretrained(str(source_dir))
+    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))
    save_tokenizer(tokenizer, dest_dir)

    opus_state = OpusState(source_dir)
@@ -377,7 +519,7 @@ if __name__ == "__main__":
    source_dir = Path(args.src)
    assert source_dir.exists()
    dest_dir = f"converted-{source_dir.name}" if args.dest is None else args.dest
-    main(source_dir, dest_dir)
+    convert(source_dir, dest_dir)


 def load_yaml(path):
--- a/src/transformers/convert_pytorch_checkpoint_to_tf2.py
+++ b/src/transformers/convert_pytorch_checkpoint_to_tf2.py
@@ -46,7 +46,7 @@ from transformers import (
    OpenAIGPTConfig,
    RobertaConfig,
    T5Config,
-    TFAlbertForMaskedLM,
+    TFAlbertForPreTraining,
    TFBertForPreTraining,
    TFBertForQuestionAnswering,
    TFBertForSequenceClassification,
@@ -109,7 +109,7 @@ if is_torch_available():
        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        CTRLLMHeadModel,
        CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
-        AlbertForMaskedLM,
+        AlbertForPreTraining,
        ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        T5ForConditionalGeneration,
        T5_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -148,7 +148,7 @@ else:
        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        CTRLLMHeadModel,
        CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
-        AlbertForMaskedLM,
+        AlbertForPreTraining,
        ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        T5ForConditionalGeneration,
        T5_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -318,8 +318,8 @@ MODEL_CLASSES = {
    ),
    "albert": (
        AlbertConfig,
-        TFAlbertForMaskedLM,
-        AlbertForMaskedLM,
+        TFAlbertForPreTraining,
+        AlbertForPreTraining,
        ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
    ),
--- a/src/transformers/convert_reformer_trax_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_reformer_trax_checkpoint_to_pytorch.py
@@ -93,7 +93,7 @@ def set_block_weights_in_torch(weights, torch_block, hidden_size):
        set_layer_weights_in_torch_local(attn_weights, torch_block.attention, hidden_size)

    # intermediate weighs
-    intermediate_weights = weights[2][0][2][2]
+    intermediate_weights = weights[2][0][1][2]

    # Chunked Feed Forward
    if len(intermediate_weights) == 4:
@@ -145,19 +145,16 @@ def set_model_weights_in_torch(weights, torch_model, hidden_size):
            position_embeddings.weights[emb_idx] = torch.nn.Parameter(torch.tensor(emb_weights))

    trax_layer_weights = weights[5]
-    assert len(torch_model_reformer.encoder.layers) * 4 + 1 == len(
+    assert len(torch_model_reformer.encoder.layers) * 4 == len(
        trax_layer_weights
    ), "HF and trax model do not have the same number of layers"
    for layer_idx, layer in enumerate(torch_model_reformer.encoder.layers):
        block_weights = trax_layer_weights[4 * layer_idx : 4 * (layer_idx + 1)]
        set_block_weights_in_torch(block_weights, layer, hidden_size)

-    # output weights
-    out_weights = weights[6]
-
    # output layer norm
-    layer_norm_out_weight = np.asarray(out_weights[0][0])
-    layer_norm_out_bias = np.asarray(out_weights[0][1])
+    layer_norm_out_weight = np.asarray(weights[7][0])
+    layer_norm_out_bias = np.asarray(weights[7][1])
    set_param(
        torch_model_reformer.encoder.layer_norm,
        torch.tensor(layer_norm_out_weight),
@@ -165,8 +162,8 @@ def set_model_weights_in_torch(weights, torch_model, hidden_size):
    )

    # output embeddings
-    output_embed_weights = np.asarray(out_weights[2][0])
-    output_embed_bias = np.asarray(out_weights[2][1])
+    output_embed_weights = np.asarray(weights[9][0])
+    output_embed_bias = np.asarray(weights[9][1])
    set_param(
        torch_model.lm_head.decoder,
        torch.tensor(output_embed_weights).transpose(0, 1).contiguous(),
--- a/src/transformers/data/datasets/glue.py
+++ b/src/transformers/data/datasets/glue.py
@@ -5,12 +5,12 @@ from dataclasses import dataclass, field
 from typing import List, Optional

 import torch
+from filelock import FileLock
 from torch.utils.data.dataset import Dataset

 from ...tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
 from ...tokenization_utils import PreTrainedTokenizer
 from ...tokenization_xlm_roberta import XLMRobertaTokenizer
-from ...trainer import torch_distributed_zero_first
 from ..processors.glue import glue_convert_examples_to_features, glue_output_modes, glue_processors
 from ..processors.utils import InputFeatures

@@ -63,7 +63,6 @@ class GlueDataset(Dataset):
        tokenizer: PreTrainedTokenizer,
        limit_length: Optional[int] = None,
        evaluate=False,
-        local_rank=-1,
    ):
        self.args = args
        processor = glue_processors[args.task_name]()
@@ -75,9 +74,11 @@ class GlueDataset(Dataset):
                "dev" if evaluate else "train", tokenizer.__class__.__name__, str(args.max_seq_length), args.task_name,
            ),
        )
-        with torch_distributed_zero_first(local_rank):
-            # Make sure only the first process in distributed training processes the dataset,
-            # and the others will use the cache.
+
+        # Make sure only the first process in distributed training processes the dataset,
+        # and the others will use the cache.
+        lock_path = cached_features_file + ".lock"
+        with FileLock(lock_path):

            if os.path.exists(cached_features_file) and not args.overwrite_cache:
                start = time.time()
@@ -109,13 +110,12 @@ class GlueDataset(Dataset):
                    label_list=label_list,
                    output_mode=self.output_mode,
                )
-                if local_rank in [-1, 0]:
-                    start = time.time()
-                    torch.save(self.features, cached_features_file)
-                    # ^ This seems to take a lot of time so I want to investigate why and how we can improve.
-                    logger.info(
-                        f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
-                    )
+                start = time.time()
+                torch.save(self.features, cached_features_file)
+                # ^ This seems to take a lot of time so I want to investigate why and how we can improve.
+                logger.info(
+                    f"Saving features into cached file %s [took %.3f s]", cached_features_file, time.time() - start
+                )

    def __len__(self):
        return len(self.features)
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -15,6 +15,7 @@ import tempfile
 from contextlib import contextmanager
 from functools import partial, wraps
 from hashlib import sha256
+from pathlib import Path
 from typing import Optional
 from urllib.parse import urlparse
 from zipfile import ZipFile, is_zipfile
@@ -68,19 +69,10 @@ except ImportError:
    )
 default_cache_path = os.path.join(torch_cache_home, "transformers")

-try:
-    from pathlib import Path

-    PYTORCH_PRETRAINED_BERT_CACHE = Path(
-        os.getenv("PYTORCH_TRANSFORMERS_CACHE", os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path))
-    )
-except (AttributeError, ImportError):
-    PYTORCH_PRETRAINED_BERT_CACHE = os.getenv(
-        "PYTORCH_TRANSFORMERS_CACHE", os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path)
-    )
-
-PYTORCH_TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE  # Kept for backward compatibility
-TRANSFORMERS_CACHE = PYTORCH_PRETRAINED_BERT_CACHE  # Kept for backward compatibility
+PYTORCH_PRETRAINED_BERT_CACHE = os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path)
+PYTORCH_TRANSFORMERS_CACHE = os.getenv("PYTORCH_TRANSFORMERS_CACHE", PYTORCH_PRETRAINED_BERT_CACHE)
+TRANSFORMERS_CACHE = os.getenv("TRANSFORMERS_CACHE", PYTORCH_TRANSFORMERS_CACHE)

 WEIGHTS_NAME = "pytorch_model.bin"
 TF2_WEIGHTS_NAME = "tf_model.h5"
--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -111,7 +111,8 @@ def load_tf_weights_in_albert(model, config, tf_checkpoint_path):

        # No ALBERT model currently handles the next sentence prediction task
        if "seq_relationship" in name:
-            continue
+            name = name.replace("seq_relationship/output_", "sop_classifier/classifier/")
+            name = name.replace("weights", "weight")

        name = name.split("/")

@@ -174,7 +175,7 @@ class AlbertEmbeddings(BertEmbeddings):
    def __init__(self, config):
        super().__init__(config)

-        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=0)
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
        self.LayerNorm = torch.nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
@@ -568,6 +569,115 @@ class AlbertModel(AlbertPreTrainedModel):
        return outputs


+@add_start_docstrings(
+    """Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and
+    a `sentence order prediction (classification)` head. """,
+    ALBERT_START_DOCSTRING,
+)
+class AlbertForPreTraining(AlbertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.albert = AlbertModel(config)
+        self.predictions = AlbertMLMHead(config)
+        self.sop_classifier = AlbertSOPHead(config)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        self._tie_or_clone_weights(self.predictions.decoder, self.albert.embeddings.word_embeddings)
+
+    def get_output_embeddings(self):
+        return self.predictions.decoder
+
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        masked_lm_labels=None,
+        sentence_order_label=None,
+    ):
+        r"""
+        masked_lm_labels (``torch.LongTensor`` of shape ``(batch_size, sequence_length)``, `optional`, defaults to :obj:`None`):
+            Labels for computing the masked language modeling loss.
+            Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
+            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
+            in ``[0, ..., config.vocab_size]``
+        sentence_order_label (``torch.LongTensor`` of shape ``(batch_size,)``, `optional`, defaults to :obj:`None`):
+            Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see :obj:`input_ids` docstring)
+            Indices should be in ``[0, 1]``.
+            ``0`` indicates original order (sequence A, then sequence B),
+            ``1`` indicates switched order (sequence B, then sequence A).
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        sop_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, 2)`):
+            Prediction scores of the next sequence prediction (classification) head (scores of True/False
+            continuation before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+
+    Examples::
+
+        from transformers import AlbertTokenizer, AlbertForPreTraining
+        import torch
+
+        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
+        model = AlbertForPreTraining.from_pretrained('albert-base-v2')
+
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+
+        prediction_scores, sop_scores = outputs[:2]
+
+        """
+
+        outputs = self.albert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+        )
+
+        sequence_output, pooled_output = outputs[:2]
+
+        prediction_scores = self.predictions(sequence_output)
+        sop_scores = self.sop_classifier(pooled_output)
+
+        outputs = (prediction_scores, sop_scores,) + outputs[2:]  # add hidden states and attention if they are here
+
+        if masked_lm_labels is not None and sentence_order_label is not None:
+            loss_fct = CrossEntropyLoss()
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
+            sentence_order_loss = loss_fct(sop_scores.view(-1, 2), sentence_order_label.view(-1))
+            total_loss = masked_lm_loss + sentence_order_loss
+            outputs = (total_loss,) + outputs
+
+        return outputs  # (loss), prediction_scores, sop_scores, (hidden_states), (attentions)
+
+
 class AlbertMLMHead(nn.Module):
    def __init__(self, config):
        super().__init__()
@@ -592,6 +702,19 @@ class AlbertMLMHead(nn.Module):
        return prediction_scores


+class AlbertSOPHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+
+        self.dropout = nn.Dropout(config.classifier_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+    def forward(self, pooled_output):
+        dropout_pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(dropout_pooled_output)
+        return logits
+
+
@add_start_docstrings(
    "Albert Model with a `language modeling` head on top.", ALBERT_START_DOCSTRING,
 )
@@ -932,7 +1055,7 @@ class AlbertForQuestionAnswering(AlbertPreTrainedModel):
    Examples::

        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the
-        # examples/run_squad.py example to see how to fine-tune a model to a question answering task.
+        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.

        from transformers import AlbertTokenizer, AlbertForQuestionAnswering
        import torch
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -39,10 +39,12 @@ from .configuration_auto import (
    XLMRobertaConfig,
    XLNetConfig,
 )
+from .configuration_marian import MarianConfig
 from .configuration_utils import PretrainedConfig
 from .modeling_albert import (
    ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    AlbertForMaskedLM,
+    AlbertForPreTraining,
    AlbertForQuestionAnswering,
    AlbertForSequenceClassification,
    AlbertForTokenClassification,
@@ -97,6 +99,7 @@ from .modeling_flaubert import (
    FlaubertWithLMHeadModel,
 )
 from .modeling_gpt2 import GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2LMHeadModel, GPT2Model
+from .modeling_marian import MarianMTModel
 from .modeling_openai import OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OpenAIGPTLMHeadModel, OpenAIGPTModel
 from .modeling_reformer import ReformerModel, ReformerModelWithLMHead
 from .modeling_roberta import (
@@ -189,7 +192,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
    [
        (T5Config, T5ForConditionalGeneration),
        (DistilBertConfig, DistilBertForMaskedLM),
-        (AlbertConfig, AlbertForMaskedLM),
+        (AlbertConfig, AlbertForPreTraining),
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
        (BartConfig, BartForConditionalGeneration),
@@ -213,6 +216,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (AlbertConfig, AlbertForMaskedLM),
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
+        (MarianConfig, MarianMTModel),
        (BartConfig, BartForConditionalGeneration),
        (RobertaConfig, RobertaForMaskedLM),
        (BertConfig, BertForMaskedLM),
@@ -902,7 +906,7 @@ class AutoModelForQuestionAnswering:
        Examples::

            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
-            model = AutoModelForSequenceClassification.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = AutoModelForQuestionAnswering.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
        """
        for config_class, model_class in MODEL_FOR_QUESTION_ANSWERING_MAPPING.items():
            if isinstance(config, config_class):
--- a/src/transformers/modeling_bart.py
+++ b/src/transformers/modeling_bart.py
@@ -886,7 +886,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
        if new_num_tokens <= old_num_tokens:
            new_bias = self.final_logits_bias[:, :new_num_tokens]
        else:
-            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens))
+            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)
            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)
        self.register_buffer("final_logits_bias", new_bias)

@@ -980,12 +980,12 @@ class BartForConditionalGeneration(PretrainedBartModel):
            "use_cache": use_cache,  # change this to avoid caching (presumably for debugging)
        }

-    def prepare_scores_for_generation(self, scores, cur_len, max_length):
+    def prepare_logits_for_generation(self, logits, cur_len, max_length):
        if cur_len == 1:
-            self._force_token_ids_generation(scores, self.config.bos_token_id)
+            self._force_token_ids_generation(logits, self.config.bos_token_id)
        if cur_len == max_length - 1 and self.config.eos_token_id is not None:
-            self._force_token_ids_generation(scores, self.config.eos_token_id)
-        return scores
+            self._force_token_ids_generation(logits, self.config.eos_token_id)
+        return logits

    def _force_token_ids_generation(self, scores, token_ids) -> None:
        """force one of token_ids to be generated by setting prob of all other tokens to 0"""
--- a/src/transformers/modeling_distilbert.py
+++ b/src/transformers/modeling_distilbert.py
@@ -61,7 +61,7 @@ def create_sinusoidal_embeddings(n_pos, dim, out):
 class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
-        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=0)
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
        if config.sinusoidal_pos_embds:
            create_sinusoidal_embeddings(
--- a/src/transformers/modeling_gpt2.py
+++ b/src/transformers/modeling_gpt2.py
@@ -142,10 +142,10 @@ class Attention(nn.Module):
    def _attn(self, q, k, v, attention_mask=None, head_mask=None):
        w = torch.matmul(q, k)
        if self.scale:
-            w = w / (v.size(-1) ** 0.5)
+            w = w / (float(v.size(-1)) ** 0.5)
        nd, ns = w.size(-2), w.size(-1)
        mask = self.bias[:, :, ns - nd : ns, :ns]
-        w = torch.where(mask, w, self.masked_bias)
+        w = torch.where(mask.bool(), w, self.masked_bias.to(w.dtype))

        if attention_mask is not None:
            # Apply the attention mask
--- a/src/transformers/modeling_marian.py
+++ b/src/transformers/modeling_marian.py
@@ -18,18 +18,33 @@
 from transformers.modeling_bart import BartForConditionalGeneration


-PRETRAINED_MODEL_ARCHIVE_MAP = {
-    "opus-mt-en-de": "https://cdn.huggingface.co/Helsinki-NLP/opus-mt-en-de/pytorch_model.bin",
-}
-
-
 class MarianMTModel(BartForConditionalGeneration):
-    """Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
-    Model API is identical to BartForConditionalGeneration"""
+    r"""
+    Pytorch version of marian-nmt's transformer.h (c++). Designed for the OPUS-NMT translation checkpoints.
+    Model API is identical to BartForConditionalGeneration.
+    Available models are listed at `Model List <https://huggingface.co/models?search=Helsinki-NLP>`__

-    pretrained_model_archive_map = PRETRAINED_MODEL_ARCHIVE_MAP
+    Examples::

-    def prepare_scores_for_generation(self, scores, cur_len, max_length):
+        from transformers import MarianTokenizer, MarianMTModel
+        from typing import List
+        src = 'fr'  # source language
+        trg = 'en'  # target language
+        sample_text = "où est l'arrêt de bus ?"
+        mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
+
+        model = MarianMTModel.from_pretrained(mname)
+        tok = MarianTokenizer.from_pretrained(mname)
+        batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference
+        gen = model.generate(**batch)  # for forward pass: model(**batch)
+        words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the the bus stop ?"
+
+    """
+
+    pretrained_model_archive_map = {}  # see https://huggingface.co/models?search=Helsinki-NLP
+
+    def prepare_logits_for_generation(self, logits, cur_len, max_length):
+        logits[:, self.config.pad_token_id] = float("-inf")
        if cur_len == max_length - 1 and self.config.eos_token_id is not None:
-            self._force_token_ids_generation(scores, self.config.eos_token_id)
-        return scores
+            self._force_token_ids_generation(logits, self.config.eos_token_id)
+        return logits
--- a/src/transformers/modeling_reformer.py
+++ b/src/transformers/modeling_reformer.py
@@ -36,7 +36,8 @@ from .modeling_utils import PreTrainedModel, apply_chunking_to_forward
 logger = logging.getLogger(__name__)

 REFORMER_PRETRAINED_MODEL_ARCHIVE_MAP = {
-    "google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/pytorch_model.bin"
+    "google/reformer-crime-and-punishment": "https://cdn.huggingface.co/google/reformer-crime-and-punishment/pytorch_model.bin",
+    "google/reformer-enwik8": "https://cdn.huggingface.co/google/reformer-enwik8/pytorch_model.bin",
 }


@@ -561,8 +562,8 @@ class LSHSelfAttention(nn.Module, EfficientAttentionMixin):

        # get correct mask values depending on precision
        if query_key_dots.dtype == torch.float16:
-            self_mask_value = self.self_mask_value_float16
-            mask_value = self.mask_value_float16
+            self_mask_value = self.self_mask_value_float16.half()
+            mask_value = self.mask_value_float16.half()
        else:
            self_mask_value = self.self_mask_value_float32
            mask_value = self.mask_value_float32
@@ -833,7 +834,7 @@ class LocalSelfAttention(nn.Module, EfficientAttentionMixin):
        if mask is not None:
            # get mask tensor depending on half precision or not
            if query_key_dots.dtype == torch.float16:
-                mask_value = self.mask_value_float16
+                mask_value = self.mask_value_float16.half()
            else:
                mask_value = self.mask_value_float32

--- a/src/transformers/modeling_roberta.py
+++ b/src/transformers/modeling_roberta.py
@@ -643,7 +643,7 @@ class RobertaForQuestionAnswering(BertPreTrainedModel):
    Examples::

        # The checkpoint roberta-large is not fine-tuned for question answering. Please see the
-        # examples/run_squad.py example to see how to fine-tune a model to a question answering task.
+        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.

        from transformers import RobertaTokenizer, RobertaForQuestionAnswering
        import torch
--- a/src/transformers/modeling_tf_albert.py
+++ b/src/transformers/modeling_tf_albert.py
@@ -21,7 +21,7 @@ import logging
 import tensorflow as tf

 from .configuration_albert import AlbertConfig
-from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
+from .file_utils import MULTIPLE_CHOICE_DUMMY_INPUTS, add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_tf_bert import ACT2FN, TFBertSelfAttention
 from .modeling_tf_utils import TFPreTrainedModel, get_initializer, keras_serializable, shape_list
 from .tokenization_utils import BatchEncoding
@@ -475,7 +475,6 @@ class TFAlbertMLMHead(tf.keras.layers.Layer):
        hidden_states = self.activation(hidden_states)
        hidden_states = self.LayerNorm(hidden_states)
        hidden_states = self.decoder(hidden_states, mode="linear") + self.decoder_bias
-        hidden_states = hidden_states + self.bias
        return hidden_states


@@ -718,6 +717,73 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
        return outputs


+@add_start_docstrings(
+    """Albert Model with two heads on top for pre-training:
+    a `masked language modeling` head and a `sentence order prediction` (classification) head. """,
+    ALBERT_START_DOCSTRING,
+)
+class TFAlbertForPreTraining(TFAlbertPreTrainedModel):
+    def __init__(self, config, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        self.num_labels = config.num_labels
+
+        self.albert = TFAlbertMainLayer(config, name="albert")
+        self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name="predictions")
+        self.sop_classifier = TFAlbertSOPHead(config, name="sop_classifier")
+
+    def get_output_embeddings(self):
+        return self.albert.embeddings
+
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    def call(self, inputs, **kwargs):
+        r"""
+    Return:
+        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        sop_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, 2)`):
+            Prediction scores of the sentence order prediction (classification) head (scores of True/False continuation before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
+            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
+            tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+    Examples::
+        import tensorflow as tf
+        from transformers import AlbertTokenizer, TFAlbertForPreTraining
+        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
+        model = TFAlbertForPreTraining.from_pretrained('albert-base-v2')
+        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]  # Batch size 1
+        outputs = model(input_ids)
+        prediction_scores, sop_scores = outputs[:2]
+        """
+
+        outputs = self.albert(inputs, **kwargs)
+        sequence_output, pooled_output = outputs[:2]
+        prediction_scores = self.predictions(sequence_output)
+        sop_scores = self.sop_classifier(pooled_output, training=kwargs.get("training", False))
+        outputs = (prediction_scores, sop_scores) + outputs[2:]
+        return outputs
+
+
+class TFAlbertSOPHead(tf.keras.layers.Layer):
+    def __init__(self, config, **kwargs):
+        super().__init__(**kwargs)
+
+        self.dropout = tf.keras.layers.Dropout(config.classifier_dropout_prob)
+        self.classifier = tf.keras.layers.Dense(
+            config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier",
+        )
+
+    def call(self, pooled_output, training: bool):
+        dropout_pooled_output = self.dropout(pooled_output, training=training)
+        logits = self.classifier(dropout_pooled_output)
+        return logits
+
+
@add_start_docstrings("""Albert Model with a `language modeling` head on top. """, ALBERT_START_DOCSTRING)
 class TFAlbertForMaskedLM(TFAlbertPreTrainedModel):
    def __init__(self, config, *inputs, **kwargs):
@@ -865,7 +931,7 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):
    Examples::

        # The checkpoint albert-base-v2 is not fine-tuned for question answering. Please see the
-        # examples/run_squad.py example to see how to fine-tune a model to a question answering task.
+        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.

        import tensorflow as tf
        from transformers import AlbertTokenizer, TFAlbertForQuestionAnswering
@@ -891,3 +957,127 @@ class TFAlbertForQuestionAnswering(TFAlbertPreTrainedModel):
        outputs = (start_logits, end_logits,) + outputs[2:]

        return outputs  # start_logits, end_logits, (hidden_states), (attentions)
+
+
+@add_start_docstrings(
+    """Albert Model with a multiple choice classification head on top (a linear layer on top of
+    the pooled output and a softmax) e.g. for RocStories/SWAG tasks. """,
+    ALBERT_START_DOCSTRING,
+)
+class TFAlbertForMultipleChoice(TFAlbertPreTrainedModel):
+    def __init__(self, config, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+
+        self.albert = TFAlbertMainLayer(config, name="albert")
+        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
+        self.classifier = tf.keras.layers.Dense(
+            1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
+        )
+
+    @property
+    def dummy_inputs(self):
+        """ Dummy inputs to build the network.
+
+        Returns:
+            tf.Tensor with dummy inputs
+        """
+        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS)}
+
+    @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING)
+    def call(
+        self,
+        inputs,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        training=False,
+    ):
+        r"""
+    Return:
+        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
+        classification_scores (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, num_choices)`:
+            `num_choices` is the size of the second dimension of the input tensors. (see `input_ids` above).
+
+            Classification scores (before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`):
+            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
+            tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import tensorflow as tf
+        from transformers import AlbertTokenizer, TFAlbertForMultipleChoice
+
+        tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
+        model = TFAlbertForMultipleChoice.from_pretrained('albert-base-v2')
+
+        example1 = ["This is a context", "Is it a context? Yes"]
+        example2 = ["This is a context", "Is it a context? No"]
+        encoding = tokenizer.batch_encode_plus([example1, example2], return_tensors='tf', truncation_strategy="only_first", pad_to_max_length=True, max_length=128)
+        outputs = model(encoding["input_ids"][None, :])
+        logits = outputs[0]
+
+        """
+        if isinstance(inputs, (tuple, list)):
+            input_ids = inputs[0]
+            attention_mask = inputs[1] if len(inputs) > 1 else attention_mask
+            token_type_ids = inputs[2] if len(inputs) > 2 else token_type_ids
+            position_ids = inputs[3] if len(inputs) > 3 else position_ids
+            head_mask = inputs[4] if len(inputs) > 4 else head_mask
+            inputs_embeds = inputs[5] if len(inputs) > 5 else inputs_embeds
+            assert len(inputs) <= 6, "Too many inputs."
+        elif isinstance(inputs, dict):
+            print("isdict(1)")
+            input_ids = inputs.get("input_ids")
+            print(input_ids)
+
+            attention_mask = inputs.get("attention_mask", attention_mask)
+            token_type_ids = inputs.get("token_type_ids", token_type_ids)
+            position_ids = inputs.get("position_ids", position_ids)
+            head_mask = inputs.get("head_mask", head_mask)
+            inputs_embeds = inputs.get("inputs_embeds", inputs_embeds)
+            assert len(inputs) <= 6, "Too many inputs."
+        else:
+            input_ids = inputs
+
+        if input_ids is not None:
+            num_choices = shape_list(input_ids)[1]
+            seq_length = shape_list(input_ids)[2]
+        else:
+            num_choices = shape_list(inputs_embeds)[1]
+            seq_length = shape_list(inputs_embeds)[2]
+
+        flat_input_ids = tf.reshape(input_ids, (-1, seq_length)) if input_ids is not None else None
+        flat_attention_mask = tf.reshape(attention_mask, (-1, seq_length)) if attention_mask is not None else None
+        flat_token_type_ids = tf.reshape(token_type_ids, (-1, seq_length)) if token_type_ids is not None else None
+        flat_position_ids = tf.reshape(position_ids, (-1, seq_length)) if position_ids is not None else None
+
+        flat_inputs = [
+            flat_input_ids,
+            flat_attention_mask,
+            flat_token_type_ids,
+            flat_position_ids,
+            head_mask,
+            inputs_embeds,
+        ]
+
+        outputs = self.albert(flat_inputs, training=training)
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output, training=training)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = tf.reshape(logits, (-1, num_choices))
+
+        outputs = (reshaped_logits,) + outputs[2:]  # add hidden states and attention if they are here
+
+        return outputs  # reshaped_logits, (hidden_states), (attentions)
--- a/src/transformers/modeling_tf_auto.py
+++ b/src/transformers/modeling_tf_auto.py
@@ -36,6 +36,8 @@ from .configuration_utils import PretrainedConfig
 from .modeling_tf_albert import (
    TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    TFAlbertForMaskedLM,
+    TFAlbertForMultipleChoice,
+    TFAlbertForPreTraining,
    TFAlbertForQuestionAnswering,
    TFAlbertForSequenceClassification,
    TFAlbertModel,
@@ -43,6 +45,7 @@ from .modeling_tf_albert import (
 from .modeling_tf_bert import (
    TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    TFBertForMaskedLM,
+    TFBertForMultipleChoice,
    TFBertForPreTraining,
    TFBertForQuestionAnswering,
    TFBertForSequenceClassification,
@@ -132,7 +135,7 @@ TF_MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
    [
        (T5Config, TFT5ForConditionalGeneration),
        (DistilBertConfig, TFDistilBertForMaskedLM),
-        (AlbertConfig, TFAlbertForMaskedLM),
+        (AlbertConfig, TFAlbertForPreTraining),
        (RobertaConfig, TFRobertaForMaskedLM),
        (BertConfig, TFBertForPreTraining),
        (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel),
@@ -171,6 +174,10 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
    ]
 )

+TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING = OrderedDict(
+    [(BertConfig, TFBertForMultipleChoice), (AlbertConfig, TFAlbertForMultipleChoice)]
+)
+
 TF_MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
    [
        (DistilBertConfig, TFDistilBertForQuestionAnswering),
@@ -412,7 +419,7 @@ class TFAutoModelForPreTraining(object):
        in the `pretrained_model_name_or_path` string (in the following order):
            - contains `t5`: :class:`~transformers.TFT5ModelWithLMHead` (T5 model)
            - contains `distilbert`: :class:`~transformers.TFDistilBertForMaskedLM` (DistilBERT model)
-            - contains `albert`: :class:`~transformers.TFAlbertForMaskedLM` (ALBERT model)
+            - contains `albert`: :class:`~transformers.TFAlbertForPreTraining` (ALBERT model)
            - contains `roberta`: :class:`~transformers.TFRobertaForMaskedLM` (RoBERTa model)
            - contains `bert`: :class:`~transformers.TFBertForPreTraining` (Bert model)
            - contains `openai-gpt`: :class:`~transformers.TFOpenAIGPTLMHeadModel` (OpenAI GPT model)
@@ -661,6 +668,153 @@ class TFAutoModelWithLMHead(object):
        )


+class TFAutoModelForMultipleChoice:
+    r"""
+        :class:`~transformers.TFAutoModelForMultipleChoice` is a generic model class
+        that will be instantiated as one of the multiple choice model classes of the library
+        when created with the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)`
+        class method.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        based on the `model_type` property of the config object, or when it's missing,
+        falling back to using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `albert`: TFAlbertForMultipleChoice (Albert model)
+            - contains `bert`: TFBertForMultipleChoice (Bert model)
+
+        This class cannot be instantiated using `__init__()` (throws an error).
+    """
+
+    def __init__(self):
+        raise EnvironmentError(
+            "TFAutoModelForMultipleChoice is designed to be instantiated "
+            "using the `TFAutoModelForMultipleChoice.from_pretrained(pretrained_model_name_or_path)` or "
+            "`TFAutoModelForMultipleChoice.from_config(config)` methods."
+        )
+
+    @classmethod
+    def from_config(cls, config):
+        r""" Instantiates one of the base model classes of the library
+        from a configuration.
+
+            config: (`optional`) instance of a class derived from :class:`~transformers.PretrainedConfig`:
+                The model class to instantiate is selected based on the configuration class:
+                    - isInstance of `albert` configuration class: AlbertModel (Albert model)
+                    - isInstance of `bert` configuration class: BertModel (Bert model)
+
+        Examples::
+
+            config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
+            model = AutoModelForMulitpleChoice.from_config(config)  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+        """
+        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():
+            if isinstance(config, config_class):
+                return model_class(config)
+        raise ValueError(
+            "Unrecognized configuration class {} for this kind of TFAutoModel: {}.\n"
+            "Model type should be one of {}.".format(
+                config.__class__,
+                cls.__name__,
+                ", ".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),
+            )
+        )
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
+        r""" Instantiates one of the multiple choice model classes of the library
+        from a pre-trained model configuration.
+
+        The `from_pretrained()` method takes care of returning the correct model class instance
+        based on the `model_type` property of the config object, or when it's missing,
+        falling back to using pattern matching on the `pretrained_model_name_or_path` string.
+
+        The model class to instantiate is selected as the first pattern matching
+        in the `pretrained_model_name_or_path` string (in the following order):
+            - contains `albert`: TFRobertaForMultiple (Albert model)
+            - contains `bert`: TFBertForMultipleChoice (Bert model)
+
+        The model is set in evaluation mode by default using `model.eval()` (Dropout modules are deactivated)
+        To train the model, you should first set it back in training mode with `model.train()`
+
+        Params:
+            pretrained_model_name_or_path: either:
+
+                - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
+                - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``.
+                - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
+                - a path or url to a `PyTorch, TF 1.X or TF 2.0 checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In the case of a PyTorch checkpoint, ``from_pt`` should be set to True and a configuration object should be provided as ``config`` argument.
+
+            from_pt: (`Optional`) Boolean
+                Set to True if the Checkpoint is a PyTorch checkpoint.
+
+            model_args: (`optional`) Sequence of positional arguments:
+                All remaning positional arguments will be passed to the underlying model's ``__init__`` method
+
+            config: (`optional`) instance of a class derived from :class:`~transformers.PretrainedConfig`:
+                Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when:
+
+                - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or
+                - the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory.
+                - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory.
+
+            state_dict: (`optional`) dict:
+                an optional state dictionnary for the model to use instead of a state dictionary loaded from saved weights file.
+                This option can be used if you want to create a model from a pretrained configuration but load your own weights.
+                In this case though, you should check if using :func:`~transformers.PreTrainedModel.save_pretrained` and :func:`~transformers.PreTrainedModel.from_pretrained` is not a simpler option.
+
+            cache_dir: (`optional`) string:
+                Path to a directory in which a downloaded pre-trained model
+                configuration should be cached if the standard cache should not be used.
+
+            force_download: (`optional`) boolean, default False:
+                Force to (re-)download the model weights and configuration files and override the cached versions if they exists.
+
+            resume_download: (`optional`) boolean, default False:
+                Do not delete incompletely recieved file. Attempt to resume the download if such a file exists.
+
+            proxies: (`optional`) dict, default None:
+                A dictionary of proxy servers to use by protocol or endpoint, e.g.: {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
+                The proxies are used on each request.
+
+            output_loading_info: (`optional`) boolean:
+                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
+
+            kwargs: (`optional`) Remaining dictionary of keyword arguments:
+                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
+
+                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
+                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+
+        Examples::
+
+            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
+            model = TFAutoModelFormultipleChoice.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
+            model = TFAutoModelFormultipleChoice.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
+            assert model.config.output_attention == True
+            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
+            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
+            model = TFAutoModelFormultipleChoice.from_pretrained('./pt_model/bert_pytorch_model.bin', from_pt=True, config=config)
+
+        """
+        config = kwargs.pop("config", None)
+        if not isinstance(config, PretrainedConfig):
+            config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
+
+        for config_class, model_class in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.items():
+            if isinstance(config, config_class):
+                return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
+        raise ValueError(
+            "Unrecognized configuration class {} for this kind of TFAutoModel: {}.\n"
+            "Model type should be one of {}.".format(
+                config.__class__,
+                cls.__name__,
+                ", ".join(c.__name__ for c in TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING.keys()),
+            )
+        )
+
+
 class TFAutoModelForSequenceClassification(object):
    r"""
        :class:`~transformers.TFAutoModelForSequenceClassification` is a generic model class
--- a/src/transformers/modeling_tf_roberta.py
+++ b/src/transformers/modeling_tf_roberta.py
@@ -481,7 +481,7 @@ class TFRobertaForQuestionAnswering(TFRobertaPreTrainedModel):
    Examples::

        # The checkpoint roberta-base is not fine-tuned for question answering. Please see the
-        # examples/run_squad.py example to see how to fine-tune a model to a question answering task.
+        # examples/question-answering/run_squad.py example to see how to fine-tune a model to a question answering task.

        import tensorflow as tf
        from transformers import RobertaTokenizer, TFRobertaForQuestionAnswering
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -744,8 +744,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
    def prepare_inputs_for_generation(self, input_ids, **kwargs):
        return {"input_ids": input_ids}

-    def prepare_scores_for_generation(self, scores, **kwargs):
-        return scores
+    def prepare_logits_for_generation(self, logits, **kwargs):
+        return logits

    def _use_cache(self, outputs, use_cache):
        """During generation, decide whether to pass the `past` variable to the next forward pass."""
@@ -857,7 +857,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
                Defaults to `None`.

-            `What are attention masks? <../glossary.html#attention-mask>`__
+                `What are attention masks? <../glossary.html#attention-mask>`__

            decoder_start_token_id=None: (`optional`) int
                If an encoder-decoder model starts decoding with a different token than BOS.
@@ -1342,10 +1342,13 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            if temperature != 1.0:
                next_token_logits = next_token_logits / temperature

-            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)
            if self.config.is_encoder_decoder and do_sample is False:
-                # TODO (PVP) still a bit hacky here - there might be a better solutino
-                scores = self.prepare_scores_for_generation(scores, cur_len=cur_len, max_length=max_length)
+                # TODO (PVP) still a bit hacky here - there might be a better solution
+                next_token_logits = self.prepare_logits_for_generation(
+                    next_token_logits, cur_len=cur_len, max_length=max_length
+                )
+
+            scores = F.log_softmax(next_token_logits, dim=-1)  # (batch_size * num_beams, vocab_size)

            # set eos token prob to zero if min_length is not reached
            if eos_token_id is not None and cur_len < min_length:
--- a/src/transformers/optimization_tf.py
+++ b/src/transformers/optimization_tf.py
@@ -204,7 +204,10 @@ class GradientAccumulator(object):
        """Number of accumulated steps."""
        if self._accum_steps is None:
            self._accum_steps = tf.Variable(
-                tf.constant(0, dtype=tf.int64), trainable=False, synchronization=tf.VariableSynchronization.ON_READ,
+                tf.constant(0, dtype=tf.int64),
+                trainable=False,
+                synchronization=tf.VariableSynchronization.ON_READ,
+                aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
            )

        return self._accum_steps.value()
@@ -223,7 +226,10 @@ class GradientAccumulator(object):
            self._gradients.extend(
                [
                    tf.Variable(
-                        tf.zeros_like(gradient), trainable=False, synchronization=tf.VariableSynchronization.ON_READ,
+                        tf.zeros_like(gradient),
+                        trainable=False,
+                        synchronization=tf.VariableSynchronization.ON_READ,
+                        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA,
                    )
                    for gradient in gradients
                ]
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Lysandre	7cb203fae4	Release: v2.9.1 Some checks failed GitHub-hosted runner / check_code_quality (push) Has been cancelled Details	2020-05-13 17:38:50 -04:00
Sam Shleifer	9a687ebb77	[Marian Fixes] prevent predicting pad_token_id before softmax, support language codes, name multilingual models (#4290 )	2020-05-13 17:29:41 -04:00
Patrick von Platen	839bfaedb2	[Docs, Notebook] Include generation pipeline (#4295 ) * add first text for generation * add generation pipeline to usage * Created using Colaboratory * correct docstring * finish	2020-05-13 14:24:08 -04:00
Elyes Manai	2d184cb553	wrong variable name used (#4328 )	2020-05-13 10:22:03 -04:00
Julien Plu	ca13618681	Question Answering for TF trainer (#4320 ) * Add QA trainer example for TF * Make data_dir optional * Fix parameter logic * Fix feature convert * Update the READMEs to add the question-answering task * Apply style * Change 'sequence-classification' to 'text-classification' and prefix with 'eval' all the metric names * Apply style * Apply style	2020-05-13 09:22:31 -04:00
Denis	1e51bb717c	Fix for #3865 . PretrainedTokenizer mapped " do not" into " don't" when .decode(...) is called. Removed the " do not" --> " don't" mapping from clean_up_tokenization(...). (#4024 )	2020-05-13 14:32:57 +02:00
Julien Chaumond	241759101e	(v2) Improvements to the wandb integration (#4324 ) * Improvements to the wandb integration * small reorg + no global necessary * feat(trainer): log epoch and final metrics * Simplify logging a bit * Fixup * Fix crash when just running eval Co-authored-by: Chris Van Pelt <vanpelt@gmail.com> Co-authored-by: Boris Dayma <boris.dayma@gmail.com>	2020-05-12 21:52:01 -04:00
Funtowicz Morgan	7d7fe4997f	Allow BatchEncoding to be initialized empty. (#4316 ) * Allow BatchEncoding to be initialized empty. This is required by recent changes introduced in TF 2.2. * Attempt to unpin Tensorflow to 2.2 with the previous commit.	2020-05-12 15:02:46 -04:00
Savaş Yıldırım	0a97f6312a	Update README.md (#4313 )	2020-05-12 15:01:45 -04:00
Savaş Yıldırım	15a121fec5	Update README.md (#4315 )	2020-05-12 15:01:34 -04:00
Stefan Schweter	15d45211f7	[model_cards]: 🇹🇷 Add new ELECTRA small and base models for Turkish (#4318 )	2020-05-12 15:01:17 -04:00
Viktor Alm	8a017cbb5a	Add modelcard with acknowledgements (#4321 )	2020-05-12 15:00:56 -04:00
Julien Chaumond	4bf5042240	Fix BART tests on GPU (#4298 )	2020-05-12 09:11:50 -04:00
Viktor Alm	e4512aab3b	Add MultipleChoice to TFTrainer [WIP] (#4270 ) * catch gpu len 1 set to gpu0 * Add mpc to trainer * Add MPC for TF * fix TF automodel for MPC and add Albert * Apply style * Fix import * Note to self: double check * Make shape None, None for datasetgenerator output shapes * Add from_pt bool which doesnt seem to work * Original checkpoint dir * Fix docstrings for automodel * Update readme and apply style * Colab should probably not be from users * Colabs should probably not be from users * Add colab * Update README.md * Update README.md * Cleanup __intit__ * Cleanup flake8 trailing comma * Update src/transformers/training_args_tf.py * Update src/transformers/modeling_tf_auto.py Co-authored-by: Viktor Alm <viktoralm@pop-os.localdomain> Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-12 08:48:48 -04:00
Levent Serinol	65be574aec	fixed missing torch module import (#4305 ) fixed missing torch module import in example usage code	2020-05-12 08:34:17 -04:00
Jangwon Park	31e67dd19f	Remove hard-coded pad token id in distilbert and albert (#3965 )	2020-05-12 08:32:44 -04:00
Lysandre Debut	30e343862f	pin TF to 2.1 (#4297 ) * pin TF to 2.1 * Pin flake8 as well	2020-05-11 21:03:30 -04:00
Julien Chaumond	56e8ef632f	[ci] Restrict GPU tests to actual code commits	2020-05-11 20:40:41 -04:00
Julien Chaumond	ba6f6e44a8	[ci] Re-enable torch GPU tests	2020-05-12 00:05:36 +00:00
Lysandre Debut	9524956819	Documentation specification (#4294 )	2020-05-11 16:43:57 -04:00
Bram Vanroy	61d22f9cc7	Simplify cache vars and allow for TRANSFORMERS_CACHE env (#4226 ) * simplify cache vars and allow for TRANSFORMERS_CACHE env As it currently stands, "TRANSFORMERS_CACHE" is not an accepted variable. It seems that the these variables were not updated when moving from version pytorch_transformers to transformers. In addition, the fallback procedure could be improved. and simplified. Pathlib seems redundant here. * Update file_utils.py	2020-05-11 15:24:02 -04:00
Lysandre Debut	cd40cb8879	Fix special token doc (#4292 )	2020-05-11 15:05:36 -04:00
Tianlei Wu	82601f4c1a	Allow gpt2 to be exported to valid ONNX (#4244 ) * allow gpt2 to be exported to valid ONNX model * cast size from int to float explictly	2020-05-11 14:55:55 -04:00
Guo, Quan	39994051e4	Add migrating from `pytorch-transformers` (#4273 ) "Migrating from pytorch-transformers to transformers" is missing in the main document. It is available in the main `readme` thought. Just move it to the document.	2020-05-11 13:35:13 -04:00
Lysandre Debut	051dcb2a07	CamemBERT does not make use of Token Type IDs (#4289 )	2020-05-11 13:31:03 -04:00
fgaim	41e8291217	Add ALBERT to the Tensorflow to Pytorch model conversion cli (#3933 ) * Add ALBERT to convert command of transformers-cli * Document ALBERT tf to pytorch model conversion	2020-05-11 13:10:00 -04:00
Stefan Schweter	3f42eb979f	Documentation: fix links to NER examples (#4279 ) * docs: fix link to token classification (NER) example * examples: fix links to NER scripts	2020-05-11 12:48:21 -04:00
Funtowicz Morgan	8fdb7997c6	Align sentiment-analysis' tokenizer (currently uncased) to the model (uncased). (#4264 )	2020-05-11 12:45:53 -04:00
Sam Shleifer	4658896ee1	[Marian] Fix typo in docstring (#4284 )	2020-05-11 11:47:51 -04:00
Levent Serinol	bf64b8cf09	Model card for bert-turkish-question-answering question-answering model (#4281 ) * Create README.md * Update model_cards/lserinol/bert-turkish-question-answering/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-11 11:32:25 -04:00
Julien Plu	94b57bf796	[TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283 ) * Fix the issue to properly run the accumulator with TF 2.2 * Apply style * Fix training_args_tf for TF 2.2 * Fix the TF training args when only one GPU is available * Remove the fixed version of TF in setup.py	2020-05-11 11:28:37 -04:00
Savaş Yıldırım	cffbb3d8ed	Update README.md (#4276 )	2020-05-11 11:24:41 -04:00
Julien Plu	5f50d619dd	Fix XTREME link + add number of eval documents + fix usage code (#4280 )	2020-05-11 11:24:10 -04:00
theblackcat102	7751be7cee	fix reformer apex scaling issue (#4242 )	2020-05-11 16:53:42 +02:00
Patrick von Platen	ac7d5f67a2	[Reformer] Add Enwiki8 Reformer Model - Adapt convert script (#4282 ) * adapt convert script * update convert script * finish * fix marian pretrained docs	2020-05-11 16:38:07 +02:00
Patrick von Platen	336116d960	Reformer enwik8 - Model card (#4286 )	2020-05-11 16:22:08 +02:00
flozi00	b290c32e16	[docs] fix typo (#4249 )	2020-05-10 14:07:08 -04:00
Sam Shleifer	3487be75ef	[Marian] documentation and AutoModel support (#4152 ) - MarianSentencepieceTokenizer - > MarianTokenizer - Start using unk token. - add docs page - add better generation params to MarianConfig - more conversion utilities	2020-05-10 13:54:57 -04:00
Girishkumar	9d2f467bfb	[README] Corrected some grammatical mistakes (#4199 )	2020-05-10 09:02:36 -04:00
Julien Chaumond	7b75aa9fa5	[TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223 ) * [TPU] Doc, fix xla_spawn.py, only preprocess dataset once * Update examples/README.md * [xla_spawn] Add `_mp_fn` to other Trainer scripts * [TPU] Fix: eval dataloader was None	2020-05-08 14:10:05 -04:00
Julien Chaumond	274d850d34	Fix #4098	2020-05-08 12:39:46 -04:00
Lorenzo De Mattei	26dad0a9fa	example updated to use generation pipeline (#4230 ) * example updated to use generation pipeline * Update model_cards/LorenzoDeMattei/GePpeTto/README.md Co-authored-by: Julien Chaumond <chaumond@gmail.com>	2020-05-08 09:45:10 -04:00
rmroczkowski	9ebb5b2a54	Model card for allegro/herbert-klej-cased-tokenizer-v1 (#4184 )	2020-05-08 09:42:43 -04:00
rmroczkowski	9e54efd004	Model card for allegro/herbert-klej-cased-v1 (#4183 )	2020-05-08 09:42:28 -04:00
Manuel Romero	a8b798e6c4	Model card for spanish electra small (#4196 )	2020-05-08 09:30:15 -04:00
Savaş Yıldırım	242005d762	Create README.md (#4132 ) * Create README.md * Adding code fence around code block	2020-05-08 09:27:29 -04:00
Manuel Romero	5940c73bbb	Create README.md (#4179 ) model card for my De Novo Drug discovery model using MLM	2020-05-08 09:25:36 -04:00
Patrick von Platen	cf08830c28	[Pipeline, Generation] tf generation pipeline bug (#4217 ) * fix PR * move tests to correct place	2020-05-08 08:30:05 -04:00
Jared T Nielsen	8bf7312654	Add AlbertForPreTraining and TFAlbertForPreTraining models. (#4057 ) * Add AlbertForPreTraining and TFAlbertForPreTraining models. * PyTorch conversion * TensorFlow conversion * style Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>	2020-05-07 19:44:51 -04:00
Julien Chaumond	c99fe0386b	[doc] Fix broken links + remove crazy big notebook	2020-05-07 18:44:18 -04:00
Savaş Yıldırım	66113bd626	Create README.md (#4202 )	2020-05-07 18:31:22 -04:00
Julien Chaumond	6669915b65	[examples] Add column for pytorch-lightning support	2020-05-07 15:26:58 -04:00
Julien Chaumond	612fa1b10b	Examples readme.md (#4215 ) * README * Update README.md	2020-05-07 15:00:06 -04:00
Lysandre	2e57824374	Pin isort and tf <= 2.1.0	2020-05-07 14:42:00 -04:00