Test correct tokenizers after default switch (#3003 )

False by default (#3002 )
Release: v2.5.1
2020-02-24 18:45:53 -05:00 · 2020-02-24 18:30:57 -05:00 · 2020-02-24 18:22:54 -05:00 · 2020-02-24 18:20:42 -05:00 · 2020-02-24 17:50:24 -05:00 · 2020-02-24 15:42:38 -05:00
150 changed files with 9343 additions and 1102 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -10,7 +10,7 @@ jobs:
        parallelism: 1
        steps:
            - checkout
-            - run: sudo pip install .[sklearn,tf,torch,testing]
+            - run: sudo pip install .[sklearn,tf-cpu,torch,testing]
            - run: sudo pip install codecov pytest-cov
            - run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/ --cov
            - run: codecov
@@ -26,8 +26,10 @@ jobs:
        parallelism: 1
        steps:
            - checkout
-            - run: sudo pip install .[mecab,sklearn,tf,torch,testing]
+            - run: sudo pip install .[mecab,sklearn,tf-cpu,torch,testing]
            - run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/
+            - no_output_timeout: 4h
+
    run_tests_torch:
        working_directory: ~/transformers
        docker:
@@ -52,7 +54,7 @@ jobs:
        parallelism: 1
        steps:
            - checkout
-            - run: sudo pip install .[sklearn,tf,testing]
+            - run: sudo pip install .[sklearn,tf-cpu,testing]
            - run: sudo pip install codecov pytest-cov
            - run: python -m pytest -n 8 --dist=loadfile -s -v ./tests/ --cov
            - run: codecov
--- a/.circleci/deploy.sh
+++ b/.circleci/deploy.sh
@@ -25,4 +25,5 @@ deploy_doc "fc9faa8" v2.0.0
 deploy_doc "3ddce1d" v2.1.1
 deploy_doc "3616209" v2.2.0
 deploy_doc "d0f8b9a" v2.3.0
-deploy_doc "6664ea9" v2.4.0
+deploy_doc "6664ea9" v2.4.0
+deploy_doc "fb560dc" v2.5.0
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -39,12 +39,14 @@ Steps to reproduce the behavior:

 <!-- A clear and concise description of what you would expect to happen. -->

-## Environment
-
-* OS:
-* Python version:
-* PyTorch version:
-* `transformers` version (or branch):
-* Using GPU ?
-* Distributed or parallel setup ?
-* Any other relevant information:
+## Environment info
+<!-- You can run the command `python transformers-cli env` and copy-and-paste its output below.
+     Don't forget to fill out the missing fields in that output! -->
+     
+- `transformers` version:
+- Platform:
+- Python version:
+- PyTorch version (GPU?):
+- Tensorflow version (GPU?):
+- Using GPU in script?:
+- Using distributed or parallel set-up in script?:
--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
@@ -33,16 +33,21 @@ The tasks I am working on is:
    Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
    -->

-## Environment
+## Environment info
+<!-- You can run the command `python transformers-cli env` and copy-and-paste its output below.
+     Don't forget to fill out the missing fields in that output! -->
+ 
+- `transformers` version:
+- Platform:
+- Python version:
+- PyTorch version (GPU?):
+- Tensorflow version (GPU?):
+- Using GPU in script?:
+- Using distributed or parallel set-up in script?:

-* OS:
-* Python version:
-* PyTorch version:
+<!-- IMPORTANT: which version of the former library do you use? -->
 * `pytorch-transformers` or `pytorch-pretrained-bert` version (or branch):
-* `transformers` version (or branch):
-* Using GPU?
-* Distributed or parallel setup?
-* Any other relevant information:
+

 ## Checklist

--- a/.gitignore
+++ b/.gitignore
@@ -139,3 +139,9 @@ serialization_dir
 # emacs
 *.*~
 debug.env
+
+# vim
+.*.swp
+
+#ctags
+tags
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -41,14 +41,10 @@ Did not find it? :( So we can act quickly on it, please follow these steps:
  less than 30s;
 * Provide the *full* traceback if an exception is raised.

-To get the OS and software versions, execute the following code and copy-paste
-the output:
+To get the OS and software versions automatically, you can run the following command:

-```
-import platform; print("Platform", platform.platform())
-import sys; print("Python", sys.version)
-import torch; print("PyTorch", torch.__version__)
-import tensorflow; print("Tensorflow", tensorflow.__version__)
+```bash
+python transformers-cli env
 ```

 ### Do you want to implement a new model?
@@ -202,11 +198,13 @@ Follow these steps to start contributing:
 3. To indicate a work in progress please prefix the title with `[WIP]`. These
   are useful to avoid duplicated work, and to differentiate it from PRs ready
   to be merged;
-4. Make sure pre-existing tests still pass;
-5. Add high-coverage tests. No quality test, no merge;
+4. Make sure existing tests pass;
+5. Add high-coverage tests. No quality test, no merge. 
+ - If you are adding a new model, make sure that you use `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`, which triggers the common tests.
+ - If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. 
+CircleCI does not run them. 
 6. All public methods must have informative docstrings;

-
 ### Tests

 You can run 🤗 Transformers tests with `unittest` or `pytest`.
--- a/README.md
+++ b/README.md
@@ -24,6 +24,8 @@

 🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

+[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/0)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/0)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/1)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/1)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/2)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/2)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/3)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/3)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/4)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/4)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/5)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/5)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/6)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/6)[![](https://sourcerer.io/fame/clmnt/huggingface/transformers/images/7)](https://sourcerer.io/fame/clmnt/huggingface/transformers/links/7)
+
 ### Features

 - As easy to use as pytorch-transformers
@@ -60,7 +62,7 @@ Choose the right framework for every part of a model's lifetime
 | [Quick tour: Share your models ](#Quick-tour-of-model-sharing) | Upload and share your fine-tuned models with the community |
 | [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
-| [Documentation][(v2.4.0)](https://huggingface.co/transformers/v2.4.0)[(v2.3.0)](https://huggingface.co/transformers/v2.3.0)[(v2.2.0/v2.2.1/v2.2.2)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |
+| [Documentation][(v2.5.0)](https://huggingface.co/transformers/v2.5.0)[(v2.4.0/v2.4.1)](https://huggingface.co/transformers/v2.4.0)[(v2.3.0)](https://huggingface.co/transformers/v2.3.0)[(v2.2.0/v2.2.1/v2.2.2)](https://huggingface.co/transformers/v2.2.0) [(v2.1.1)](https://huggingface.co/transformers/v2.1.1) [(v2.0.0)](https://huggingface.co/transformers/v2.0.0) [(v1.2.0)](https://huggingface.co/transformers/v1.2.0) [(v1.1.0)](https://huggingface.co/transformers/v1.1.0) [(v1.0.0)](https://huggingface.co/transformers/v1.0.0) [(master)](https://huggingface.co/transformers) | Full API documentation and more |

 ## Installation

@@ -193,7 +195,7 @@ MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
-          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
+          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]
@@ -493,19 +495,22 @@ Your model will then be accessible through its identifier, a concatenation of yo
 "username/pretrained_model"
 ```

+**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hyperparameters), evaluation results, intended uses & limitations, etc.
+
+Your model now has a page on huggingface.co/models 🔥
+
 Anyone can load it from code:
 ```python
 tokenizer = AutoTokenizer.from_pretrained("username/pretrained_model")
 model = AutoModel.from_pretrained("username/pretrained_model")
 ```

-Finally, list all your files on S3:
+List all your files on S3:
 ```shell
 transformers-cli s3 ls
-# List all your S3 objects.
 ```

-You can also delete files:
+You can also delete unneeded files:

 ```shell
 transformers-cli s3 rm …
@@ -673,7 +678,7 @@ for batch in train_data:
 ## Citation

 We now have a paper you can cite for the 🤗 Transformers library:
-```
+```bibtex
@article{Wolf2019HuggingFacesTS,
  title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
  author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R'emi Louf and Morgan Funtowicz and Jamie Brew},
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
@@ -194,3 +194,41 @@ h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
    src: url(./Calibre-Thin.otf);
    font-weight:400;
 }
+
+
+/**
+ * Nav Links to other parts of huggingface.co
+ */
+ div.menu {
+    position: absolute;
+    top: 0;
+    right: 0;
+    padding-top: 20px;
+    padding-right: 20px;
+    z-index: 1000;
+}
+div.menu a {
+    font-size: 14px;
+    letter-spacing: 0.3px;
+    text-transform: uppercase;
+    color: white;
+    -webkit-font-smoothing: antialiased;
+    background: linear-gradient(0deg, #6671ffb8, #9a66ffb8 50%);
+    padding: 10px 16px 6px 16px;
+    border-radius: 3px;
+    margin-left: 12px;
+    position: relative;
+}
+div.menu a:active {
+    top: 1px;
+}
+@media (min-width: 768px) and (max-width: 1750px) {
+    .wy-breadcrumbs {
+        margin-top: 32px;
+    }
+}
+@media (max-width: 768px) {
+    div.menu {
+        display: none;
+    }
+}
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@@ -58,6 +58,16 @@ function addGithubButton() {
    document.querySelector(".wy-side-nav-search .icon-home").insertAdjacentHTML('afterend', div);
 }

+function addHfMenu() {
+    const div = `
+    <div class="menu">
+        <a href="/welcome">🔥 Sign in</a>
+        <a href="/models">🚀 Models</a>
+    </div>
+    `;
+    document.body.insertAdjacentHTML('afterbegin', div);
+}
+
 /*!
 * github-buttons v2.2.10
 * (c) 2019 なつき
@@ -74,6 +84,7 @@ function onLoad() {
    addCustomFooter();
    addGithubButton();
    parseGithubButtons();
+    addHfMenu();
 }

 window.addEventListener("load", onLoad);
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.4.1'
+release = u'2.5.1'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -99,4 +99,5 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/camembert
    model_doc/albert
    model_doc/xlmroberta
-    model_doc/flaubert
+    model_doc/flaubert
+    model_doc/bart
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -63,7 +63,7 @@ XNLI
 `The Cross-Lingual NLI Corpus (XNLI) <https://www.nyu.edu/projects/bowman/xnli/>`__ is a benchmark that evaluates
 the quality of cross-lingual text representations. 
 XNLI is crowd-sourced dataset based on `MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>`: pairs of text are labeled with textual entailment 
-annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
+annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

 It was released together with the paper
 `XNLI: Evaluating Cross-lingual Sentence Representations <https://arxiv.org/abs/1809.05053>`__
--- a/docs/source/model_doc/bart.rst
+++ b/docs/source/model_doc/bart.rst
@@ -0,0 +1,52 @@
+Bart
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
+@sshleifer
+
+The Bart model was `proposed <https://arxiv.org/abs/1910.13461>`_ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer on 29 Oct, 2019.
+It is a sequence to sequence model where both encoder and decoder are transformers. The paper also introduces a novel pretraining objective, and demonstrates excellent summarization results.
+The authors released their code `here <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_
+
+**Abstract:**
+
+*We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.*
+`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension`
+
+
+Notes:
+- Bart doesn't use :obj:`token_type_ids`, for sequence classification just use BartTokenizer.encode to get the proper splitting.
+- Inputs to the decoder are created by BartModel.forward if they are not passed. This is different than some other model APIs.
+- Model predictions are intended to be identical to the original implementation. This only works, however, if the string you pass to fairseq.encode starts with a space.
+
+BartModel
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartModel
+    :members: forward
+
+
+BartForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartForMaskedLM
+    :members: forward
+
+
+BartForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartForSequenceClassification
+    :members: forward
+
+BartConfig
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.BartConfig
+    :members:
+
+Automatic Creation of Decoder Inputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This is enabled by default
+
+.. autofunction:: transformers.modeling_bart._prepare_bart_decoder_inputs
--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -23,6 +23,9 @@ Tips:

 - This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
  setup for Roberta pretrained models.
+- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
+  different pre-training scheme.
+- RoBERTa doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `</s>`)
 - `Camembert <./camembert.html>`__ is a wrapper around RoBERTa. Refer to this page for usage examples.

 RobertaConfig
--- a/docs/source/model_doc/xlmroberta.rst
+++ b/docs/source/model_doc/xlmroberta.rst
@@ -22,6 +22,9 @@ and XNLI benchmarks. We will make XLM-R code, data, and models publicly availabl

 Tips:

+- XLM-R is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does
+  not require `lang` tensors to understand which language is used, and should be able to determine the correct
+  language from the input ids.
 - This implementation is the same as RoBERTa. Refer to the `documentation of RoBERTa <./roberta.html>`__ for usage
  examples as well as the information relative to the inputs and outputs.

--- a/docs/source/model_sharing.md
+++ b/docs/source/model_sharing.md
@@ -26,19 +26,22 @@ Your model will then be accessible through its identifier, a concatenation of yo
 "username/pretrained_model"
 ```

+**Please add a README.md model card** to the repo under `model_cards/` with: model description, training params (dataset, preprocessing, hyperparameters), evaluation results, intended uses & limitations, etc.
+
+Your model now has a page on huggingface.co/models 🔥
+
 Anyone can load it from code:
 ```python
 tokenizer = AutoTokenizer.from_pretrained("username/pretrained_model")
 model = AutoModel.from_pretrained("username/pretrained_model")
 ```

-Finally, list all your files on S3:
+List all your files on S3:
 ```shell
 transformers-cli s3 ls
-# List all your S3 objects.
 ```

-You can also delete files:
+You can also delete unneeded files:

 ```shell
 transformers-cli s3 rm …
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -179,6 +179,14 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
 |                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``distilbert-base-cased``                                  | | 6-layer, 768-hidden, 12-heads, 65M parameters                                                                                       |
+|                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint                                                     |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``distilbert-base-cased-distilled-squad``                  | | 6-layer, 768-hidden, 12-heads, 65M parameters                                                                                       |
+|                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer.       |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``distilgpt2``                                             | | 6-layer, 768-hidden, 12-heads, 82M parameters                                                                                       |
 |                   |                                                            | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint.                                                               |
 |                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
@@ -267,6 +275,13 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Bart              | ``bart-large``                                             | | 12-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
+|                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bart-large-mnli``                                        | | Adds a 2 layer classification head with 1 million parameters                                                                        |
+|                   |                                                            | | bart-large base architecture with a classification head                                                                             |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+


 .. <https://huggingface.co/transformers/examples.html>`__
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -209,7 +209,7 @@ past = None
 for i in range(100):
    print(i)
    output, past = model(context, past=past)
-    token = torch.argmax(output[0, :])
+    token = torch.argmax(output[..., -1, :])

    generated += [token.tolist()]
    context = token.unsqueeze(0)
@@ -299,8 +299,8 @@ model = Model2Model.from_pretrained('fine-tuned-weights')
 model.eval()

 # If you have a GPU, put everything on cuda
-question_tensor = encoded_question.to('cuda')
-answer_tensor = encoded_answer.to('cuda')
+question_tensor = question_tensor.to('cuda')
+answer_tensor = answer_tensor.to('cuda')
 model.to('cuda')

 # Predict all tokens
--- a/examples/README.md
+++ b/examples/README.md
@@ -3,7 +3,7 @@
 In this section a few examples are put together. All of these examples work for several models, making use of the very
 similar API between the different models.

-**Important**  
+**Important**
 To run the latest versions of the examples, you have to install from source and install some specific requirements for the examples.
 Execute the following steps in a new virtual environment:

@@ -15,17 +15,16 @@ pip install -r ./examples/requirements.txt
 ```

 | Section                    | Description                                                                                                                                                |
-|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. 
-| [Language Model fine-tuning](#language-model-fine-tuning) | Fine-tuning the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
-| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.                                         |
-| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision.                              |
-| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training.                                                                                  |
-| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. 
-| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training.                                                                                  |
+|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------
+| [TensorFlow 2.0 models on GLUE](#TensorFlow-2.0-Bert-models-on-GLUE) | Examples running BERT TensorFlow 2.0 model on the GLUE tasks. |
+| [Language Model training](#language-model-training) | Fine-tuning (or training from scratch) the library models for language modeling on a text dataset. Causal language modeling for GPT/GPT-2, masked language modeling for BERT/RoBERTa. |
+| [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet. |
+| [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision. |
+| [SQuAD](#squad) | Using BERT/RoBERTa/XLNet/XLM for question answering, examples with distributed training. |
+| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. |
+| [Named Entity Recognition](#named-entity-recognition) | Using BERT for Named Entity Recognition (NER) on the CoNLL 2003 dataset, examples with distributed training. |
 | [XNLI](#xnli) | Examples running BERT/XLM on the XNLI benchmark. |
-| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language
-inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |
+| [Adversarial evaluation of model performances](#adversarial-evaluation-of-model-performances) | Testing a model with adversarial evaluation of natural language inference on the Heuristic Analysis for NLI Systems (HANS) dataset (McCoy et al., 2019.) |

 ## TensorFlow 2.0 Bert models on GLUE

@@ -49,16 +48,16 @@ Quick benchmarks from the script (no other modifications):

 Mixed precision (AMP) reduces the training time considerably for the same hardware and hyper-parameters (same batch size was used).

-## Language model fine-tuning
+## Language model training

-Based on the script [`run_lm_finetuning.py`](https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py).
+Based on the script [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py).

-Fine-tuning the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT 
+Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, BERT and RoBERTa (DistilBERT 
 to be added soon). GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa 
 are fine-tuned using a masked language modeling (MLM) loss.

 Before running the following example, you should get a file that contains text on which the language model will be
-fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
+trained or fine-tuned. A good example of such text is the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).

 We will refer to two different files: `$TRAIN_FILE`, which contains text for training, and `$TEST_FILE`, which contains
 text that will be used for evaluation.
@@ -72,7 +71,7 @@ the tokenization). The loss here is that of causal language modeling.
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw

-python run_lm_finetuning.py \
+python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
@@ -89,7 +88,7 @@ a score of ~20 perplexity once fine-tuned on the dataset.

 The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different
 as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their
-pre-training: masked language modeling. 
+pre-training: masked language modeling.

 In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, converge
 slightly slower (over-fitting takes more epochs).
@@ -100,7 +99,7 @@ We use the `--mlm` flag so that the script may change its loss function.
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw

-python run_lm_finetuning.py \
+python run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
@@ -131,8 +130,8 @@ python run_generation.py \

 Based on the script [`run_glue.py`](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py).

-Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding 
-Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa. 
+Fine-tuning the library models for sequence classification on the GLUE benchmark: [General Language Understanding
+Evaluation](https://gluebenchmark.com/). This script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.

 GLUE is made up of a total of 9 different tasks. We get the following results on the dev set of the benchmark with an
 uncased  BERT base model (the checkpoint `bert-base-uncased`). All experiments ran single V100 GPUs with a total train
@@ -154,7 +153,7 @@ between different runs. We report the median on 5 runs (with different seeds) fo
 Some of these results are significantly different from the ones reported on the test set
 of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.

-Before running anyone of these GLUE tasks you should download the
+Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
@@ -180,23 +179,23 @@ python run_glue.py \

 where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

-The dev set results will be present within the text file `eval_results.txt` in the specified output_dir. 
-In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate 
+The dev set results will be present within the text file `eval_results.txt` in the specified output_dir.
+In case of MNLI, since there are two separate dev sets (matched and mismatched), there will be a separate
 output folder called `/tmp/MNLI-MM/` in addition to `/tmp/MNLI/`.

-The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, 
-CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being 
-said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well, 
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI,
+CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being
+said, there shouldn’t be any issues in running half-precision training with the remaining GLUE tasks as well,
 since the data processor for each task inherits from the base class DataProcessor.

 ### MRPC

 #### Fine-tuning example

-The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less 
+The following examples fine-tune BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less
 than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

-Before running anyone of these GLUE tasks you should download the
+Before running any one of these GLUE tasks you should download the
 [GLUE data](https://gluebenchmark.com/tasks) by running
 [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
 and unpack it to some directory `$GLUE_DIR`.
@@ -220,12 +219,12 @@ python run_glue.py \
 ```

 Our test ran on a few seeds with [the original implementation hyper-
-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation 
+parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation
 results between 84% and 88%.

 #### Using Apex and mixed-precision

-Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install 
+Using Apex and 16 bit precision, the fine-tuning on MRPC only takes 27 seconds. First install
 [apex](https://github.com/NVIDIA/apex), then run the following example:

 ```bash
@@ -361,8 +360,8 @@ Based on the script [`run_squad.py`](https://github.com/huggingface/transformers

 #### Fine-tuning BERT on SQuAD1.0

-This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) 
-on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a 
+This example code fine-tunes BERT on the SQuAD1.0 dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large)
+on a single tesla V100 16GB. The data for SQuAD can be downloaded with the following links and should be saved in a
 $SQUAD_DIR directory.

 * [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
@@ -443,14 +442,14 @@ This example code fine-tunes XLNet on both SQuAD1.0 and SQuAD2.0 dataset. See ab
 ```bash
 export SQUAD_DIR=/path/to/SQUAD

-python /data/home/hlu/transformers/examples/run_squad.py \
+python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --do_lower_case \
-    --train_file /data/home/hlu/notebooks/NLP/examples/question_answering/train-v1.1.json \
-    --predict_file /data/home/hlu/notebooks/NLP/examples/question_answering/dev-v1.1.json \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
@@ -517,196 +516,17 @@ Larger batch size may improve the performance while costing more memory.



-## Named Entity Recognition
-
-Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py) for Pytorch and
-[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/run_tf_ner.py) for Tensorflow 2.
-This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
-Details and results for the fine-tuning provided by @stefan-it.
-
-### Data (Download and pre-processing steps)
-
-Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
-
-Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
-
-```bash
-curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
-curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
-curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
-| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
-```
-
-The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
-
-```bash
-wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
-```
-Let's define some variables that we need for further pre-processing steps and training the model:
-
-```bash
-export MAX_LENGTH=128
-export BERT_MODEL=bert-base-multilingual-cased
-```
-
-Run the pre-processing script on training, dev and test datasets:
-
-```bash
-python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
-python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
-python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
-```
-
-The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
-
-```bash
-cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
-```
-
-### Prepare the run
-
-Additional environment variables must be set:
-
-```bash
-export OUTPUT_DIR=germeval-model
-export BATCH_SIZE=32
-export NUM_EPOCHS=3
-export SAVE_STEPS=750
-export SEED=1
-```
-
-### Run the Pytorch version
-
-To start training, just run:
-
-```bash
-python3 run_ner.py --data_dir ./ \
--model_type bert \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
-```
-
-If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following for our example:
-
-```bash
-10/04/2019 00:42:06 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:06 - INFO - __main__ -     f1 = 0.8623348017621146
-10/04/2019 00:42:06 - INFO - __main__ -     loss = 0.07183869666975543
-10/04/2019 00:42:06 - INFO - __main__ -     precision = 0.8467916366258111
-10/04/2019 00:42:06 - INFO - __main__ -     recall = 0.8784592370979806
-```
-
-On the test dataset the following results could be achieved:
-
-```bash
-10/04/2019 00:42:42 - INFO - __main__ -   ***** Eval results  *****
-10/04/2019 00:42:42 - INFO - __main__ -     f1 = 0.8614389652384803
-10/04/2019 00:42:42 - INFO - __main__ -     loss = 0.07064602487454782
-10/04/2019 00:42:42 - INFO - __main__ -     precision = 0.8604651162790697
-10/04/2019 00:42:42 - INFO - __main__ -     recall = 0.8624150210424085
-```
-
-#### Comparing BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased)
-
-Here is a small comparison between BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased) with the same hyperparameters as specified in the [example documentation](https://huggingface.co/transformers/examples.html#named-entity-recognition) (one run):
-
-| Model | F-Score Dev | F-Score Test
-| --------------------------------- | ------- | --------
-| `bert-large-cased`            | 95.59 | 91.70
-| `roberta-large`                  | 95.96 | 91.87
-| `distilbert-base-uncased` | 94.34 | 90.32
-
-### Run the Tensorflow 2 version
-
-To start training, just run:
-
-```bash
-python3 run_tf_ner.py --data_dir ./ \
--model_type bert \
--labels ./labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict
-```
-
-Such as the Pytorch version, if your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
-
-#### Evaluation
-
-Evaluation on development dataset outputs the following for our example:
-```bash
-           precision    recall  f1-score   support
-
- LOCderiv     0.7619    0.6154    0.6809        52
-  PERpart     0.8724    0.8997    0.8858      4057
-  OTHpart     0.9360    0.9466    0.9413       711
-  ORGpart     0.7015    0.6989    0.7002       269
-  LOCpart     0.7668    0.8488    0.8057       496
-      LOC     0.8745    0.9191    0.8963       235
- ORGderiv     0.7723    0.8571    0.8125        91
- OTHderiv     0.4800    0.6667    0.5581        18
-      OTH     0.5789    0.6875    0.6286        16
- PERderiv     0.5385    0.3889    0.4516        18
-      PER     0.5000    0.5000    0.5000         2
-      ORG     0.0000    0.0000    0.0000         3
-
-micro avg     0.8574    0.8862    0.8715      5968
-macro avg     0.8575    0.8862    0.8713      5968
-```
-
-On the test dataset the following results could be achieved:
-```bash
-           precision    recall  f1-score   support
-
-  PERpart     0.8847    0.8944    0.8896      9397
-  OTHpart     0.9376    0.9353    0.9365      1639
-  ORGpart     0.7307    0.7044    0.7173       697
-      LOC     0.9133    0.9394    0.9262       561
-  LOCpart     0.8058    0.8157    0.8107      1150
-      ORG     0.0000    0.0000    0.0000         8
- OTHderiv     0.5882    0.4762    0.5263        42
- PERderiv     0.6571    0.5227    0.5823        44
-      OTH     0.4906    0.6667    0.5652        39
- ORGderiv     0.7016    0.7791    0.7383       172
- LOCderiv     0.8256    0.6514    0.7282       109
-      PER     0.0000    0.0000    0.0000        11
-
-micro avg     0.8722    0.8774    0.8748     13869
-macro avg     0.8712    0.8774    0.8740     13869
-```

 ## XNLI

 Based on the script [`run_xnli.py`](https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).

-[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-ressource language such as English and low-ressource languages such as Swahili).
+[XNLI](https://www.nyu.edu/projects/bowman/xnli/) is crowd-sourced dataset based on [MultiNLI](http://www.nyu.edu/projects/bowman/multinli/). It is an evaluation benchmark for cross-lingual text representations. Pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).

 #### Fine-tuning on XNLI

 This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It runs in 106 mins
-on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a 
+on a single tesla V100 16GB. The data for XNLI can be downloaded with the following links and should be both saved (and un-zipped) in a
 `$XNLI_DIR` directory.

 * [XNLI 1.0](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip)
@@ -773,7 +593,7 @@ export HANS_DIR=path-to-hans
 export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
 export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py

-python examples/test_hans.py \
+python examples/hans/test_hans.py \
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
@@ -781,7 +601,7 @@ python examples/test_hans.py \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
-        -output_dir $MODEL_PATH \
+        --output_dir $MODEL_PATH \
 ```

 This will create the hans_predictions.txt file in MODEL_PATH, which can then be evaluated using hans/evaluate_heur_output.py from the HANS dataset.
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -31,8 +31,10 @@ Here are the results on the dev sets of GLUE:

 | Model                     | Macro-score                    | CoLA | MNLI | MRPC | QNLI | QQP  | RTE  | SST-2| STS-B| WNLI              |
 | :---:                     |    :---:                       | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:             |
-| BERT-base-uncased         |  **77.6**                      | 49.2 | 80.8 | 87.4 | 87.5 | 86.4 | 61.7 | 92.0 | 83.8 | 45.1              |
-| DistilBERT-base-uncased   |  **76.8**                      | 43.6 | 79.0 | 87.5 | 85.3 | 84.9 | 59.9 | 90.7 | 81.2 | 56.3              |
+| BERT-base-uncased         |  **74.9**                      | 49.2 | 80.8 | 87.4 | 87.5 | 86.4 | 61.7 | 92.0 | 83.8 | 45.1              |
+| DistilBERT-base-uncased   |  **74.3**                      | 43.6 | 79.0 | 87.5 | 85.3 | 84.9 | 59.9 | 90.7 | 81.2 | 56.3              |
+| BERT-base-cased           |  **78.2**                      | 58.2 | 83.9 | 87.8 | 91.0 | 89.2 | 66.1 | 91.7 | 89.2 | 46.5              |
+| DistilBERT-base-cased     |  **75.9**                      | 47.2 | 81.5 | 85.6 | 88.2 | 87.8 | 60.6 | 90.4 | 85.5 | 56.3              |
 | ---                       |    ---                         |  --- |  --- |  --- |  --- |  --- |  --- |  --- |  --- |  ---              |
 | RoBERTa-base (reported)   |  **83.2**/**86.4**<sup>2</sup> | 63.6 | 87.6 | 90.2 | 92.8 | 91.9 | 78.7 | 94.8 | 91.2 | 57.7<sup>3</sup>  |
 | DistilRoBERTa<sup>1</sup> |  **79.0**/**82.3**<sup>2</sup> | 59.3 | 84.0 | 86.6 | 90.8 | 89.4 | 67.9 | 92.5 | 88.3 | 52.1              |
@@ -63,7 +65,9 @@ This part of the library has only be tested with Python3.6+. There are few speci
 Transformers includes five pre-trained Distil* models, currently only provided for English and German (we are investigating the possibility to train and release a multilingual version of DistilBERT):

 - `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+- `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 79.8 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 82.3 F1 score).
+- `distilbert-base-cased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-cased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 65M parameters.
+- `distilbert-base-cased-distilled-squad`: A finetuned version of `distilbert-base-cased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 87.1 on the dev set (for comparison, Bert `bert-base-cased` version reaches a 88.7 F1 score).
 - `distilbert-base-german-cased`: DistilBERT German language model pretrained on 1/2 of the data used to pretrain Bert using distillation with the supervision of the `bert-base-german-dbmdz-cased` version of German DBMDZ Bert. For NER tasks the model reaches a F1 score of 83.49 on the CoNLL-2003 test set (for comparison, `bert-base-german-dbmdz-cased` reaches a 84.52 F1 score), and a F1 score of 85.23 on the GermEval 2014 test set (`bert-base-german-dbmdz-cased` reaches a 86.89 F1 score).
 - `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
 - `distilroberta-base`: DistilRoBERTa English language model pretrained with the supervision of `roberta-base` solely on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset (it is ~4 times less training data than the teacher RoBERTa). The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base.
@@ -72,8 +76,8 @@ Transformers includes five pre-trained Distil* models, currently only provided f
 Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.

 ```python
-tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-model = DistilBertModel.from_pretrained('distilbert-base-uncased')
+tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+model = DistilBertModel.from_pretrained('distilbert-base-cased')

 input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)
 outputs = model(input_ids)
@@ -81,6 +85,7 @@ last_hidden_states = outputs[0]  # The last hidden-state is the first element of
 ```

 Similarly, using the other Distil* models simply consists in calling the base classes with a different pretrained checkpoint:
+- DistilBERT uncased: `model = DistilBertModel.from_pretrained('distilbert-base-uncased')`
 - DistilGPT2: `model = GPT2Model.from_pretrained('distilgpt2')`
 - DistilRoBERTa: `model = RobertaModel.from_pretrained('distilroberta-base')`
 - DistilmBERT: `model = DistilBertModel.from_pretrained('distilbert-base-multilingual-cased')`
@@ -174,7 +179,7 @@ Happy distillation!

 ## Citation

-If you find the ressource useful, you should cite the following paper:
+If you find the resource useful, you should cite the following paper:

 ```
@inproceedings{sanh2019distilbert,
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -75,13 +75,17 @@ def main():
        iter += 1
        if iter % interval == 0:
            end = time.time()
-            logger.info(f"{iter} examples processed. - {(end-start)/interval:.2f}s/expl")
+            logger.info(f"{iter} examples processed. - {(end-start):.2f}s/{interval}expl")
            start = time.time()
    logger.info("Finished binarization")
    logger.info(f"{len(data)} examples processed.")

    dp_file = f"{args.dump_file}.{args.tokenizer_name}.pickle"
-    rslt_ = [np.uint16(d) for d in rslt]
+    vocab_size = tokenizer.vocab_size
+    if vocab_size < (1 << 16):
+        rslt_ = [np.uint16(d) for d in rslt]
+    else:
+        rslt_ = [np.int32(d) for d in rslt]
    random.shuffle(rslt_)
    logger.info(f"Dump to {dp_file}")
    with open(dp_file, "wb") as handle:
--- a/examples/distillation/training_configs/distilbert-base-cased.json
+++ b/examples/distillation/training_configs/distilbert-base-cased.json
@@ -0,0 +1,15 @@
+{
+	"activation": "gelu",
+	"attention_dropout": 0.1,
+	"dim": 768,
+	"dropout": 0.1,
+	"hidden_dim": 3072,
+	"initializer_range": 0.02,
+	"max_position_embeddings": 512,
+	"n_heads": 12,
+	"n_layers": 6,
+	"sinusoidal_pos_embds": true,
+	"tie_weights_": true,
+	"vocab_size": 28996
+  }
+  
--- a/examples/ner/README.md
+++ b/examples/ner/README.md
@@ -0,0 +1,179 @@
+## Named Entity Recognition
+
+Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py) for Pytorch and
+[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/ner/run_tf_ner.py) for Tensorflow 2.
+This example fine-tune Bert Multilingual on GermEval 2014 (German NER).
+Details and results for the fine-tuning provided by @stefan-it.
+
+### Data (Download and pre-processing steps)
+
+Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
+
+Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
+
+```bash
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
+```
+
+The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`. One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s. I wrote a script that a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
+
+```bash
+wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
+```
+Let's define some variables that we need for further pre-processing steps and training the model:
+
+```bash
+export MAX_LENGTH=128
+export BERT_MODEL=bert-base-multilingual-cased
+```
+
+Run the pre-processing script on training, dev and test datasets:
+
+```bash
+python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
+python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
+python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
+```
+
+The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
+
+```bash
+cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
+```
+
+### Prepare the run
+
+Additional environment variables must be set:
+
+```bash
+export OUTPUT_DIR=germeval-model
+export BATCH_SIZE=32
+export NUM_EPOCHS=3
+export SAVE_STEPS=750
+export SEED=1
+```
+
+### Run the Pytorch version
+
+To start training, just run:
+
+```bash
+python3 run_ner.py --data_dir ./ \
+--model_type bert \
+--labels ./labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR \
+--max_seq_length  $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--per_gpu_train_batch_size $BATCH_SIZE \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_eval \
+--do_predict
+```
+
+If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
+
+#### Evaluation
+
+Evaluation on development dataset outputs the following for our example:
+
+```bash
+10/04/2019 00:42:06 - INFO - __main__ -   ***** Eval results  *****
+10/04/2019 00:42:06 - INFO - __main__ -     f1 = 0.8623348017621146
+10/04/2019 00:42:06 - INFO - __main__ -     loss = 0.07183869666975543
+10/04/2019 00:42:06 - INFO - __main__ -     precision = 0.8467916366258111
+10/04/2019 00:42:06 - INFO - __main__ -     recall = 0.8784592370979806
+```
+
+On the test dataset the following results could be achieved:
+
+```bash
+10/04/2019 00:42:42 - INFO - __main__ -   ***** Eval results  *****
+10/04/2019 00:42:42 - INFO - __main__ -     f1 = 0.8614389652384803
+10/04/2019 00:42:42 - INFO - __main__ -     loss = 0.07064602487454782
+10/04/2019 00:42:42 - INFO - __main__ -     precision = 0.8604651162790697
+10/04/2019 00:42:42 - INFO - __main__ -     recall = 0.8624150210424085
+```
+
+#### Comparing BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased)
+
+Here is a small comparison between BERT (large, cased), RoBERTa (large, cased) and DistilBERT (base, uncased) with the same hyperparameters as specified in the [example documentation](https://huggingface.co/transformers/examples.html#named-entity-recognition) (one run):
+
+| Model | F-Score Dev | F-Score Test
+| --------------------------------- | ------- | --------
+| `bert-large-cased`            | 95.59 | 91.70
+| `roberta-large`                  | 95.96 | 91.87
+| `distilbert-base-uncased` | 94.34 | 90.32
+
+### Run the Tensorflow 2 version
+
+To start training, just run:
+
+```bash
+python3 run_tf_ner.py --data_dir ./ \
+--model_type bert \
+--labels ./labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR \
+--max_seq_length  $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--per_device_train_batch_size $BATCH_SIZE \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_eval \
+--do_predict
+```
+
+Such as the Pytorch version, if your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
+
+#### Evaluation
+
+Evaluation on development dataset outputs the following for our example:
+```bash
+           precision    recall  f1-score   support
+
+ LOCderiv     0.7619    0.6154    0.6809        52
+  PERpart     0.8724    0.8997    0.8858      4057
+  OTHpart     0.9360    0.9466    0.9413       711
+  ORGpart     0.7015    0.6989    0.7002       269
+  LOCpart     0.7668    0.8488    0.8057       496
+      LOC     0.8745    0.9191    0.8963       235
+ ORGderiv     0.7723    0.8571    0.8125        91
+ OTHderiv     0.4800    0.6667    0.5581        18
+      OTH     0.5789    0.6875    0.6286        16
+ PERderiv     0.5385    0.3889    0.4516        18
+      PER     0.5000    0.5000    0.5000         2
+      ORG     0.0000    0.0000    0.0000         3
+
+micro avg     0.8574    0.8862    0.8715      5968
+macro avg     0.8575    0.8862    0.8713      5968
+```
+
+On the test dataset the following results could be achieved:
+```bash
+           precision    recall  f1-score   support
+
+  PERpart     0.8847    0.8944    0.8896      9397
+  OTHpart     0.9376    0.9353    0.9365      1639
+  ORGpart     0.7307    0.7044    0.7173       697
+      LOC     0.9133    0.9394    0.9262       561
+  LOCpart     0.8058    0.8157    0.8107      1150
+      ORG     0.0000    0.0000    0.0000         8
+ OTHderiv     0.5882    0.4762    0.5263        42
+ PERderiv     0.6571    0.5227    0.5823        44
+      OTH     0.4906    0.6667    0.5652        39
+ ORGderiv     0.7016    0.7791    0.7383       172
+ LOCderiv     0.8256    0.6514    0.7282       109
+      PER     0.0000    0.0000    0.0000        11
+
+micro avg     0.8722    0.8774    0.8748     13869
+macro avg     0.8712    0.8774    0.8740     13869
+```
--- a/examples/ner/run.sh
+++ b/examples/ner/run.sh
@@ -0,0 +1,32 @@
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-train.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-dev.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
+curl -L 'https://sites.google.com/site/germeval2014ner/data/NER-de-test.tsv?attredirects=0&d=1' \
+| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
+ wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"
+export MAX_LENGTH=128
+export BERT_MODEL=bert-base-multilingual-cased
+python3 preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
+python3 preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
+python3 preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
+cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
+export OUTPUT_DIR=germeval-model
+export BATCH_SIZE=32
+export NUM_EPOCHS=3
+export SAVE_STEPS=750
+export SEED=1
+
+python3 run_ner.py --data_dir ./ \
+--model_type bert \
+--labels ./labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR \
+--max_seq_length  $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--per_gpu_train_batch_size $BATCH_SIZE \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_eval \
+--do_predict
--- a/examples/ner/run_ner.py
+++ b/examples/ner/run_ner.py
@@ -160,7 +160,10 @@ def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
    # Check if continuing training from a checkpoint
    if os.path.exists(args.model_name_or_path):
        # set global_step to gobal_step of last saved checkpoint from model path
-        global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
+        try:
+            global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
+        except ValueError:
+            global_step = 0
        epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
        steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

@@ -583,6 +586,8 @@ def main():
    config = config_class.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path,
        num_labels=num_labels,
+        id2label={str(i): label for i, label in enumerate(labels)},
+        label2id={label: i for i, label in enumerate(labels)},
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    tokenizer = tokenizer_class.from_pretrained(
--- a/examples/ner/run_pl.sh
+++ b/examples/ner/run_pl.sh
@@ -0,0 +1,21 @@
+# Require pytorch-lightning=0.6
+export MAX_LENGTH=128
+export BERT_MODEL=bert-base-multilingual-cased
+export OUTPUT_DIR=germeval-model
+export BATCH_SIZE=32
+export NUM_EPOCHS=3
+export SAVE_STEPS=750
+export SEED=1
+
+python3 run_pl_ner.py --data_dir ./ \
+--model_type bert \
+--labels ./labels.txt \
+--model_name_or_path $BERT_MODEL \
+--output_dir $OUTPUT_DIR \
+--max_seq_length  $MAX_LENGTH \
+--num_train_epochs $NUM_EPOCHS \
+--train_batch_size 32 \
+--save_steps $SAVE_STEPS \
+--seed $SEED \
+--do_train \
+--do_predict
--- a/examples/ner/run_pl_ner.py
+++ b/examples/ner/run_pl_ner.py
@@ -0,0 +1,238 @@
+import argparse
+import glob
+import logging
+import os
+
+import numpy as np
+import torch
+from seqeval.metrics import f1_score, precision_score, recall_score
+from torch.nn import CrossEntropyLoss
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+
+from transformer_base import BaseTransformer, add_generic_args, generic_train
+from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
+
+
+logger = logging.getLogger(__name__)
+
+
+class NERTransformer(BaseTransformer):
+    """
+    A training module for NER. See BaseTransformer for the core options.
+    """
+
+    def __init__(self, hparams):
+        self.labels = get_labels(hparams.labels)
+        num_labels = len(self.labels)
+        super(NERTransformer, self).__init__(hparams, num_labels)
+
+    def forward(self, **inputs):
+        return self.model(**inputs)
+
+    def training_step(self, batch, batch_num):
+        "Compute loss"
+        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+        if self.hparams.model_type != "distilbert":
+            inputs["token_type_ids"] = (
+                batch[2] if self.hparams.model_type in ["bert", "xlnet"] else None
+            )  # XLM and RoBERTa don"t use segment_ids
+
+        outputs = self.forward(**inputs)
+        loss = outputs[0]
+
+        tensorboard_logs = {"loss": loss, "rate": self.lr_scheduler.get_last_lr()[-1]}
+        return {"loss": loss, "log": tensorboard_logs}
+
+    def load_dataset(self, mode, batch_size):
+        labels = get_labels(self.hparams.labels)
+        self.pad_token_label_id = CrossEntropyLoss().ignore_index
+        dataset = self.load_and_cache_examples(labels, self.pad_token_label_id, mode)
+        if mode == "train":
+            if self.hparams.n_gpu > 1:
+                sampler = DistributedSampler(dataset)
+            else:
+                sampler = RandomSampler(dataset)
+        else:
+            sampler = SequentialSampler(dataset)
+        dataloader = DataLoader(dataset, sampler=sampler, batch_size=batch_size)
+        return dataloader
+
+    def validation_step(self, batch, batch_nb):
+        inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
+        if self.hparams.model_type != "distilbert":
+            inputs["token_type_ids"] = (
+                batch[2] if self.hparams.model_type in ["bert", "xlnet"] else None
+            )  # XLM and RoBERTa don"t use segment_ids
+        outputs = self.forward(**inputs)
+        tmp_eval_loss, logits = outputs[:2]
+        preds = logits.detach().cpu().numpy()
+        out_label_ids = inputs["labels"].detach().cpu().numpy()
+
+        return {"val_loss": tmp_eval_loss, "pred": preds, "target": out_label_ids}
+
+    def _eval_end(self, outputs):
+        "Task specific validation"
+        val_loss_mean = torch.stack([x["val_loss"] for x in outputs]).mean()
+        preds = np.concatenate([x["pred"] for x in outputs], axis=0)
+        preds = np.argmax(preds, axis=2)
+        out_label_ids = np.concatenate([x["target"] for x in outputs], axis=0)
+
+        label_map = {i: label for i, label in enumerate(self.labels)}
+        out_label_list = [[] for _ in range(out_label_ids.shape[0])]
+        preds_list = [[] for _ in range(out_label_ids.shape[0])]
+
+        for i in range(out_label_ids.shape[0]):
+            for j in range(out_label_ids.shape[1]):
+                if out_label_ids[i, j] != self.pad_token_label_id:
+                    out_label_list[i].append(label_map[out_label_ids[i][j]])
+                    preds_list[i].append(label_map[preds[i][j]])
+
+        results = {
+            "val_loss": val_loss_mean,
+            "precision": precision_score(out_label_list, preds_list),
+            "recall": recall_score(out_label_list, preds_list),
+            "f1": f1_score(out_label_list, preds_list),
+        }
+
+        if self.is_logger():
+            logger.info(self.proc_rank)
+            logger.info("***** Eval results *****")
+            for key in sorted(results.keys()):
+                logger.info("  %s = %s", key, str(results[key]))
+
+        tensorboard_logs = results
+        ret = {k: v for k, v in results.items()}
+        ret["log"] = tensorboard_logs
+        return ret, preds_list, out_label_list
+
+    def validation_end(self, outputs):
+        ret, preds, targets = self._eval_end(outputs)
+        return ret
+
+    def test_end(self, outputs):
+        ret, predictions, targets = self._eval_end(outputs)
+
+        if self.is_logger():
+            # Write output to a file:
+            # Save results
+            output_test_results_file = os.path.join(self.hparams.output_dir, "test_results.txt")
+            with open(output_test_results_file, "w") as writer:
+                for key in sorted(ret.keys()):
+                    if key != "log":
+                        writer.write("{} = {}\n".format(key, str(ret[key])))
+            # Save predictions
+            output_test_predictions_file = os.path.join(self.hparams.output_dir, "test_predictions.txt")
+            with open(output_test_predictions_file, "w") as writer:
+                with open(os.path.join(self.hparams.data_dir, "test.txt"), "r") as f:
+                    example_id = 0
+                    for line in f:
+                        if line.startswith("-DOCSTART-") or line == "" or line == "\n":
+                            writer.write(line)
+                            if not predictions[example_id]:
+                                example_id += 1
+                        elif predictions[example_id]:
+                            output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
+                            writer.write(output_line)
+                        else:
+                            logger.warning(
+                                "Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]
+                            )
+        return ret
+
+    def load_and_cache_examples(self, labels, pad_token_label_id, mode):
+        args = self.hparams
+        tokenizer = self.tokenizer
+        if self.proc_rank not in [-1, 0] and mode == "train":
+            torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        # Load data features from cache or dataset file
+        cached_features_file = os.path.join(
+            args.data_dir,
+            "cached_{}_{}_{}".format(
+                mode, list(filter(None, args.model_name_or_path.split("/"))).pop(), str(args.max_seq_length)
+            ),
+        )
+        if os.path.exists(cached_features_file) and not args.overwrite_cache:
+            logger.info("Loading features from cached file %s", cached_features_file)
+            features = torch.load(cached_features_file)
+        else:
+            logger.info("Creating features from dataset file at %s", args.data_dir)
+            examples = read_examples_from_file(args.data_dir, mode)
+            features = convert_examples_to_features(
+                examples,
+                labels,
+                args.max_seq_length,
+                tokenizer,
+                cls_token_at_end=bool(args.model_type in ["xlnet"]),
+                cls_token=tokenizer.cls_token,
+                cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
+                sep_token=tokenizer.sep_token,
+                sep_token_extra=bool(args.model_type in ["roberta"]),
+                pad_on_left=bool(args.model_type in ["xlnet"]),
+                pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
+                pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
+                pad_token_label_id=pad_token_label_id,
+            )
+            if self.proc_rank in [-1, 0]:
+                logger.info("Saving features into cached file %s", cached_features_file)
+                torch.save(features, cached_features_file)
+
+        if self.proc_rank == 0 and mode == "train":
+            torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+        # Convert to Tensors and build dataset
+        all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+        all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+        all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+        all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
+
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
+        return dataset
+
+    @staticmethod
+    def add_model_specific_args(parser, root_dir):
+        # Add NER specific options
+        BaseTransformer.add_model_specific_args(parser, root_dir)
+        parser.add_argument(
+            "--max_seq_length",
+            default=128,
+            type=int,
+            help="The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded.",
+        )
+
+        parser.add_argument(
+            "--labels",
+            default="",
+            type=str,
+            help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.",
+        )
+
+        parser.add_argument(
+            "--data_dir",
+            default=None,
+            type=str,
+            required=True,
+            help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.",
+        )
+
+        parser.add_argument(
+            "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
+        )
+
+        return parser
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    add_generic_args(parser, os.getcwd())
+    parser = NERTransformer.add_model_specific_args(parser, os.getcwd())
+    args = parser.parse_args()
+    model = NERTransformer(args)
+    trainer = generic_train(model, args)
+
+    if args.do_predict:
+        checkpoints = list(sorted(glob.glob(args.output_dir + "/checkpoint_*.ckpt", recursive=True)))
+        NERTransformer.load_from_checkpoint(checkpoints[-1])
+        trainer.test(model)
--- a/examples/ner/run_tf_ner.py
+++ b/examples/ner/run_tf_ner.py
--- a/examples/ner/transformer_base.py
+++ b/examples/ner/transformer_base.py
@@ -0,0 +1,270 @@
+import os
+import random
+
+import numpy as np
+import pytorch_lightning as pl
+import torch
+
+from transformers import (
+    AdamW,
+    BertConfig,
+    BertForTokenClassification,
+    BertTokenizer,
+    CamembertConfig,
+    CamembertForTokenClassification,
+    CamembertTokenizer,
+    DistilBertConfig,
+    DistilBertForTokenClassification,
+    DistilBertTokenizer,
+    RobertaConfig,
+    RobertaForTokenClassification,
+    RobertaTokenizer,
+    XLMRobertaConfig,
+    XLMRobertaForTokenClassification,
+    XLMRobertaTokenizer,
+    get_linear_schedule_with_warmup,
+)
+
+
+ALL_MODELS = sum(
+    (
+        tuple(conf.pretrained_config_archive_map.keys())
+        for conf in (BertConfig, RobertaConfig, DistilBertConfig, CamembertConfig, XLMRobertaConfig)
+    ),
+    (),
+)
+
+MODEL_CLASSES = {
+    "bert": (BertConfig, BertForTokenClassification, BertTokenizer),
+    "roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer),
+    "distilbert": (DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer),
+    "camembert": (CamembertConfig, CamembertForTokenClassification, CamembertTokenizer),
+    "xlmroberta": (XLMRobertaConfig, XLMRobertaForTokenClassification, XLMRobertaTokenizer),
+}
+
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+
+class BaseTransformer(pl.LightningModule):
+    def __init__(self, hparams, num_labels=None):
+        "Initialize a model."
+
+        super(BaseTransformer, self).__init__()
+        self.hparams = hparams
+        self.hparams.model_type = self.hparams.model_type.lower()
+
+        config_class, model_class, tokenizer_class = MODEL_CLASSES[self.hparams.model_type]
+        config = config_class.from_pretrained(
+            self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
+            num_labels=num_labels,
+            cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
+        )
+        tokenizer = tokenizer_class.from_pretrained(
+            self.hparams.tokenizer_name if self.hparams.tokenizer_name else self.hparams.model_name_or_path,
+            do_lower_case=self.hparams.do_lower_case,
+            cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
+        )
+        model = model_class.from_pretrained(
+            self.hparams.model_name_or_path,
+            from_tf=bool(".ckpt" in self.hparams.model_name_or_path),
+            config=config,
+            cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
+        )
+        self.config, self.tokenizer, self.model = config, tokenizer, model
+        self.proc_rank = -1
+
+    def is_logger(self):
+        return self.proc_rank <= 0
+
+    def configure_optimizers(self):
+        "Prepare optimizer and schedule (linear warmup and decay)"
+        model = self.model
+
+        t_total = (
+            len(self.train_dataloader())
+            // self.hparams.gradient_accumulation_steps
+            * float(self.hparams.num_train_epochs)
+        )
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": self.hparams.weight_decay,
+            },
+            {
+                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
+        )
+        self.lr_scheduler = scheduler
+        return [optimizer]
+
+    def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
+
+        # Step each time.
+        optimizer.step()
+        self.lr_scheduler.step()
+        optimizer.zero_grad()
+
+    def get_tqdm_dict(self):
+        tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}
+
+        return tqdm_dict
+
+    def test_step(self, batch, batch_nb):
+        return self.validation_step(batch, batch_nb)
+
+    def test_end(self, outputs):
+        return self.validation_end(outputs)
+
+    @pl.data_loader
+    def train_dataloader(self):
+        return self.load_dataset("train", self.hparams.train_batch_size)
+
+    @pl.data_loader
+    def val_dataloader(self):
+        return self.load_dataset("dev", self.hparams.eval_batch_size)
+
+    @pl.data_loader
+    def test_dataloader(self):
+        return self.load_dataset("test", self.hparams.eval_batch_size)
+
+    def init_ddp_connection(self, proc_rank, world_size):
+        self.proc_rank = proc_rank
+        super(BaseTransformer, self).init_ddp_connection(proc_rank, world_size)
+
+    @staticmethod
+    def add_model_specific_args(parser, root_dir):
+        parser.add_argument(
+            "--model_type",
+            default=None,
+            type=str,
+            required=True,
+            help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
+        )
+        parser.add_argument(
+            "--model_name_or_path",
+            default=None,
+            type=str,
+            required=True,
+            help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS),
+        )
+        parser.add_argument(
+            "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
+        )
+        parser.add_argument(
+            "--tokenizer_name",
+            default="",
+            type=str,
+            help="Pretrained tokenizer name or path if not the same as model_name",
+        )
+        parser.add_argument(
+            "--cache_dir",
+            default="",
+            type=str,
+            help="Where do you want to store the pre-trained models downloaded from s3",
+        )
+        parser.add_argument(
+            "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
+        )
+        parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
+        parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
+        parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
+        parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
+        parser.add_argument(
+            "--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform."
+        )
+
+        parser.add_argument("--train_batch_size", default=32, type=int)
+        parser.add_argument("--eval_batch_size", default=32, type=int)
+
+
+def add_generic_args(parser, root_dir):
+    parser.add_argument(
+        "--output_dir",
+        default=None,
+        type=str,
+        required=True,
+        help="The output directory where the model predictions and checkpoints will be written.",
+    )
+
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit",
+    )
+
+    parser.add_argument(
+        "--fp16_opt_level",
+        type=str,
+        default="O1",
+        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+        "See details at https://nvidia.github.io/apex/amp.html",
+    )
+
+    parser.add_argument("--n_gpu", type=int, default=1)
+    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
+    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
+    parser.add_argument("--do_predict", action="store_true", help="Whether to run predictions on the test set.")
+    parser.add_argument(
+        "--gradient_accumulation_steps",
+        type=int,
+        default=1,
+        help="Number of updates steps to accumulate before performing a backward/update pass.",
+    )
+
+    parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
+    parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+
+def generic_train(model, args):
+    # init model
+    set_seed(args)
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train:
+        raise ValueError("Output directory ({}) already exists and is not empty.".format(args.output_dir))
+
+    checkpoint_callback = pl.callbacks.ModelCheckpoint(
+        filepath=args.output_dir, prefix="checkpoint", monitor="val_loss", mode="min", save_top_k=5
+    )
+
+    train_params = dict(
+        accumulate_grad_batches=args.gradient_accumulation_steps,
+        gpus=args.n_gpu,
+        max_epochs=args.num_train_epochs,
+        gradient_clip_val=args.max_grad_norm,
+        checkpoint_callback=checkpoint_callback,
+    )
+    if args.fp16:
+        train_params["use_amp"] = args.fp16
+        train_params["amp_level"] = args.fp16_opt_level
+
+    if args.n_gpu > 1:
+        train_params["distributed_backend"] = "ddp"
+
+    trainer = pl.Trainer(**train_params)
+
+    if args.do_train:
+        trainer.fit(model)
+
+    return trainer
--- a/examples/ner/utils_ner.py
+++ b/examples/ner/utils_ner.py
@@ -73,7 +73,7 @@ def read_examples_from_file(data_dir, mode):
                    # Examples could have no label for mode = "test"
                    labels.append("O")
        if words:
-            examples.append(InputExample(guid="%s-%d".format(mode, guid_index), words=words, labels=labels))
+            examples.append(InputExample(guid="{}-{}".format(mode, guid_index), words=words, labels=labels))
    return examples


--- a/examples/run_generation.py
+++ b/examples/run_generation.py
@@ -59,7 +59,7 @@ MODEL_CLASSES = {
 # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
 # in https://github.com/rusiaaman/XLNet-gen#methodology
 # and https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e
-PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
+PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
 (except for Alexei and Maria) are discovered.
 The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
 remainder of the story. 1883 Western Siberia,
@@ -106,6 +106,8 @@ def prepare_xlm_input(args, model, tokenizer, prompt_text):
            language = None
            while language not in available_languages:
                language = input("Using XLM. Select language in " + str(list(available_languages)) + " >>> ")
+
+        model.config.lang_id = model.config.lang2id[language]
        # kwargs["language"] = tokenizer.lang2id[language]

    # TODO fix mask_token_id setup when configurations will be synchronized between models and tokenizers
@@ -119,12 +121,12 @@ def prepare_xlm_input(args, model, tokenizer, prompt_text):

 def prepare_xlnet_input(args, _, tokenizer, prompt_text):
    prompt_text = (args.padding_text if args.padding_text else PADDING_TEXT) + prompt_text
-    return prompt_text, {}
+    return prompt_text


 def prepare_transfoxl_input(args, _, tokenizer, prompt_text):
    prompt_text = (args.padding_text if args.padding_text else PADDING_TEXT) + prompt_text
-    return prompt_text, {}
+    return prompt_text


 PREPROCESSING_FUNCTIONS = {
@@ -183,6 +185,7 @@ def main():

    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
+    parser.add_argument("--num_return_sequences", type=int, default=1, help="The number of samples to generate.")
    args = parser.parse_args()

    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
@@ -210,28 +213,50 @@ def main():
    requires_preprocessing = args.model_type in PREPROCESSING_FUNCTIONS.keys()
    if requires_preprocessing:
        prepare_input = PREPROCESSING_FUNCTIONS.get(args.model_type)
-        prompt_text = prepare_input(args, model, tokenizer, prompt_text)
-    encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
+        preprocessed_prompt_text = prepare_input(args, model, tokenizer, prompt_text)
+        encoded_prompt = tokenizer.encode(
+            preprocessed_prompt_text, add_special_tokens=False, return_tensors="pt", add_space_before_punct_symbol=True
+        )
+    else:
+        encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
    encoded_prompt = encoded_prompt.to(args.device)

    output_sequences = model.generate(
        input_ids=encoded_prompt,
-        max_length=args.length,
+        max_length=args.length + len(encoded_prompt[0]),
        temperature=args.temperature,
        top_k=args.k,
        top_p=args.p,
        repetition_penalty=args.repetition_penalty,
        do_sample=True,
+        num_return_sequences=args.num_return_sequences,
    )

-    # Batch size == 1. to add more examples please use num_return_sequences > 1
-    generated_sequence = output_sequences[0].tolist()
-    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
-    text = text[: text.find(args.stop_token) if args.stop_token else None]
+    # Remove the batch dimension when returning multiple sequences
+    if len(output_sequences.shape) > 2:
+        output_sequences.squeeze_()

-    print(text)
+    generated_sequences = []

-    return text
+    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
+        print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
+        generated_sequence = generated_sequence.tolist()
+
+        # Decode text
+        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
+
+        # Remove all text after the stop token
+        text = text[: text.find(args.stop_token) if args.stop_token else None]
+
+        # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
+        total_sequence = (
+            prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
+        )
+
+        generated_sequences.append(total_sequence)
+        print(total_sequence)
+
+    return generated_sequences


 if __name__ == "__main__":
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -310,7 +310,7 @@ def evaluate(args, model, tokenizer, prefix=""):
        eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

        # multi-gpu eval
-        if args.n_gpu > 1:
+        if args.n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
            model = torch.nn.DataParallel(model)

        # Eval!
--- a/examples/run_language_modeling.py
+++ b/examples/run_language_modeling.py
@@ -86,6 +86,9 @@ MODEL_CLASSES = {
 class TextDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, file_path: str, block_size=512):
        assert os.path.isfile(file_path)
+
+        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)
+
        directory, filename = os.path.split(file_path)
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size) + "_" + filename
@@ -118,7 +121,7 @@ class TextDataset(Dataset):
        return len(self.examples)

    def __getitem__(self, item):
-        return torch.tensor(self.examples[item])
+        return torch.tensor(self.examples[item], dtype=torch.long)


 class LineByLineTextDataset(Dataset):
@@ -130,15 +133,15 @@ class LineByLineTextDataset(Dataset):
        logger.info("Creating features from dataset file at %s", file_path)

        with open(file_path, encoding="utf-8") as f:
-            lines = [line for line in f.read().splitlines() if len(line) > 0]
+            lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

-        self.examples = tokenizer.batch_encode_plus(lines, max_length=block_size)["input_ids"]
+        self.examples = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
-        return torch.tensor(self.examples[i])
+        return torch.tensor(self.examples[i], dtype=torch.long)


 def load_and_cache_examples(args, tokenizer, evaluate=False):
@@ -195,6 +198,12 @@ def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -

 def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> Tuple[torch.Tensor, torch.Tensor]:
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
+
+    if tokenizer.mask_token is None:
+        raise ValueError(
+            "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
+        )
+
    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, args.mlm_probability)
@@ -704,10 +713,10 @@ def main():
        )

    if args.block_size <= 0:
-        args.block_size = tokenizer.max_len_single_sentence
+        args.block_size = tokenizer.max_len
        # Our input block size will be the max possible for the model
    else:
-        args.block_size = min(args.block_size, tokenizer.max_len_single_sentence)
+        args.block_size = min(args.block_size, tokenizer.max_len)

    if args.model_name_or_path:
        model = model_class.from_pretrained(
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -38,6 +38,9 @@ from transformers import (
    BertConfig,
    BertForQuestionAnswering,
    BertTokenizer,
+    CamembertConfig,
+    CamembertForQuestionAnswering,
+    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForQuestionAnswering,
    DistilBertTokenizer,
@@ -70,12 +73,16 @@ except ImportError:
 logger = logging.getLogger(__name__)

 ALL_MODELS = sum(
-    (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, RobertaConfig, XLNetConfig, XLMConfig)),
+    (
+        tuple(conf.pretrained_config_archive_map.keys())
+        for conf in (BertConfig, CamembertConfig, RobertaConfig, XLNetConfig, XLMConfig)
+    ),
    (),
 )

 MODEL_CLASSES = {
    "bert": (BertConfig, BertForQuestionAnswering, BertTokenizer),
+    "camembert": (CamembertConfig, CamembertForQuestionAnswering, CamembertTokenizer),
    "roberta": (RobertaConfig, RobertaForQuestionAnswering, RobertaTokenizer),
    "xlnet": (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
    "xlm": (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
@@ -212,13 +219,18 @@ def train(args, train_dataset, model, tokenizer):
                "end_positions": batch[4],
            }

-            if args.model_type in ["xlm", "roberta", "distilbert"]:
+            if args.model_type in ["xlm", "roberta", "distilbert", "camembert"]:
                del inputs["token_type_ids"]

            if args.model_type in ["xlnet", "xlm"]:
                inputs.update({"cls_index": batch[5], "p_mask": batch[6]})
                if args.version_2_with_negative:
                    inputs.update({"is_impossible": batch[7]})
+                if hasattr(model, "config") and hasattr(model.config, "lang2id"):
+                    inputs.update(
+                        {"langs": (torch.ones(batch[0].shape, dtype=torch.int64) * args.lang_id).to(args.device)}
+                    )
+
            outputs = model(**inputs)
            # model outputs are always tuple in transformers (see doc)
            loss = outputs[0]
@@ -322,7 +334,7 @@ def evaluate(args, model, tokenizer, prefix=""):
                "token_type_ids": batch[2],
            }

-            if args.model_type in ["xlm", "roberta", "distilbert"]:
+            if args.model_type in ["xlm", "roberta", "distilbert", "camembert"]:
                del inputs["token_type_ids"]

            example_indices = batch[3]
@@ -330,6 +342,11 @@ def evaluate(args, model, tokenizer, prefix=""):
            # XLNet and XLM use more arguments for their predictions
            if args.model_type in ["xlnet", "xlm"]:
                inputs.update({"cls_index": batch[4], "p_mask": batch[5]})
+                # for lang_id-sensitive xlm models
+                if hasattr(model, "config") and hasattr(model.config, "lang2id"):
+                    inputs.update(
+                        {"langs": (torch.ones(batch[0].shape, dtype=torch.int64) * args.lang_id).to(args.device)}
+                    )

            outputs = model(**inputs)

@@ -635,6 +652,12 @@ def main():
        help="If true, all of the warnings related to data processing will be printed. "
        "A number of warnings are expected for a normal SQuAD evaluation.",
    )
+    parser.add_argument(
+        "--lang_id",
+        default=0,
+        type=int,
+        help="language id of input for language-specific xlm models (see tokenization_xlm.PRETRAINED_INIT_CONFIGURATION)",
+    )

    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
--- a/examples/run_xnli.py
+++ b/examples/run_xnli.py
@@ -459,7 +459,7 @@ def main():
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
-    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.")
+    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument(
--- a/examples/summarization/modeling_bertabs.py
+++ b/examples/summarization/modeling_bertabs.py
@@ -303,7 +303,7 @@ class TransformerDecoderLayer(nn.Module):
        self.layer_norm_2 = nn.LayerNorm(d_model, eps=1e-6)
        self.drop = nn.Dropout(dropout)
        mask = self._get_attn_subsequent_mask(MAX_SIZE)
-        # Register self.mask as a buffer in TransformerDecoderLayer, so
+        # Register self.mask as a saved_state in TransformerDecoderLayer, so
        # it gets TransformerDecoderLayer's cuda behavior automatically.
        self.register_buffer("mask", mask)

--- a/examples/test_examples.py
+++ b/examples/test_examples.py
@@ -97,4 +97,4 @@ class ExamplesTests(unittest.TestCase):
        model_type, model_name = ("--model_type=openai-gpt", "--model_name_or_path=openai-gpt")
        with patch.object(sys, "argv", testargs + [model_type, model_name]):
            result = run_generation.main()
-            self.assertGreaterEqual(len(result), 10)
+            self.assertGreaterEqual(len(result[0]), 10)
--- a/model_cards/KB/albert-base-swedish-cased-alpha/README.md
+++ b/model_cards/KB/albert-base-swedish-cased-alpha/README.md
@@ -0,0 +1,121 @@
+---
+language: swedish
+---
+
+# Swedish BERT Models
+
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+
+The following three models are currently available:
+
+- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
+- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
+- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
+
+All models are cased and trained with whole word masking.
+
+## Files
+
+| **name**                        | **files** |
+|---------------------------------|-----------|
+| bert-base-swedish-cased         | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
+| bert-base-swedish-cased-ner     | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
+| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
+
+TensorFlow model weights will be released soon.
+
+## Usage requirements / installation instructions
+
+The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
+
+To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
+
+```
+# git clone https://github.com/Kungbib/swedish-bert-models
+# cd swedish-bert-models
+# python3 -m venv venv
+# source venv/bin/activate
+# pip install --upgrade pip
+# pip install -r requirements.txt
+```
+
+### BERT Base Swedish
+
+A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
+model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
+```
+
+
+### BERT base fine-tuned for Swedish NER
+
+This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
+
+```python
+from transformers import pipeline
+
+nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
+
+nlp('Idag släpper KB tre språkmodeller.')
+```
+
+Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
+
+```python
+[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
+  { 'word': 'KB',   'score': 0.9814832210540771, 'entity': 'ORG' } ]
+```
+
+The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
+
+```python
+text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
+       'som spelar fotboll i VM klockan två på kvällen.'
+
+l = []
+for token in nlp(text):
+    if token['word'].startswith('##'):
+        l[-1]['word'] += token['word'][2:]
+    else:
+        l += [ token ]
+
+print(l)
+```
+
+Which should result in the following (though less cleanly formated):
+
+```python
+[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
+  { 'word': 'Volvon',        'score': 0.99..., 'entity': 'OBJ'},
+  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
+  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]
+```
+
+### ALBERT base
+
+The easisest way to do this is, again, using Huggingface Transformers:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
+model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
+```
+
+## Acknowledgements ❤️
+
+- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
+- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+- Models are hosted on S3 by Huggingface 🤗
+
--- a/model_cards/KB/bert-base-swedish-cased-ner/README.md
+++ b/model_cards/KB/bert-base-swedish-cased-ner/README.md
@@ -0,0 +1,121 @@
+---
+language: swedish
+---
+
+# Swedish BERT Models
+
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+
+The following three models are currently available:
+
+- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
+- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
+- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
+
+All models are cased and trained with whole word masking.
+
+## Files
+
+| **name**                        | **files** |
+|---------------------------------|-----------|
+| bert-base-swedish-cased         | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
+| bert-base-swedish-cased-ner     | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
+| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
+
+TensorFlow model weights will be released soon.
+
+## Usage requirements / installation instructions
+
+The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
+
+To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
+
+```
+# git clone https://github.com/Kungbib/swedish-bert-models
+# cd swedish-bert-models
+# python3 -m venv venv
+# source venv/bin/activate
+# pip install --upgrade pip
+# pip install -r requirements.txt
+```
+
+### BERT Base Swedish
+
+A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
+model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
+```
+
+
+### BERT base fine-tuned for Swedish NER
+
+This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
+
+```python
+from transformers import pipeline
+
+nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
+
+nlp('Idag släpper KB tre språkmodeller.')
+```
+
+Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
+
+```python
+[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
+  { 'word': 'KB',   'score': 0.9814832210540771, 'entity': 'ORG' } ]
+```
+
+The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
+
+```python
+text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
+       'som spelar fotboll i VM klockan två på kvällen.'
+
+l = []
+for token in nlp(text):
+    if token['word'].startswith('##'):
+        l[-1]['word'] += token['word'][2:]
+    else:
+        l += [ token ]
+
+print(l)
+```
+
+Which should result in the following (though less cleanly formated):
+
+```python
+[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
+  { 'word': 'Volvon',        'score': 0.99..., 'entity': 'OBJ'},
+  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
+  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]
+```
+
+### ALBERT base
+
+The easisest way to do this is, again, using Huggingface Transformers:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
+model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
+```
+
+## Acknowledgements ❤️
+
+- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
+- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+- Models are hosted on S3 by Huggingface 🤗
+
--- a/model_cards/KB/bert-base-swedish-cased/README.md
+++ b/model_cards/KB/bert-base-swedish-cased/README.md
@@ -0,0 +1,121 @@
+---
+language: swedish
+---
+
+# Swedish BERT Models
+
+The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
+
+The following three models are currently available:
+
+- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
+- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
+- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
+
+All models are cased and trained with whole word masking.
+
+## Files
+
+| **name**                        | **files** |
+|---------------------------------|-----------|
+| bert-base-swedish-cased         | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
+| bert-base-swedish-cased-ner     | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
+| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
+
+TensorFlow model weights will be released soon.
+
+## Usage requirements / installation instructions
+
+The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
+
+To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
+
+```
+# git clone https://github.com/Kungbib/swedish-bert-models
+# cd swedish-bert-models
+# python3 -m venv venv
+# source venv/bin/activate
+# pip install --upgrade pip
+# pip install -r requirements.txt
+```
+
+### BERT Base Swedish
+
+A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
+model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
+```
+
+
+### BERT base fine-tuned for Swedish NER
+
+This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
+
+```python
+from transformers import pipeline
+
+nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
+
+nlp('Idag släpper KB tre språkmodeller.')
+```
+
+Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
+
+```python
+[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
+  { 'word': 'KB',   'score': 0.9814832210540771, 'entity': 'ORG' } ]
+```
+
+The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
+
+```python
+text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
+       'som spelar fotboll i VM klockan två på kvällen.'
+
+l = []
+for token in nlp(text):
+    if token['word'].startswith('##'):
+        l[-1]['word'] += token['word'][2:]
+    else:
+        l += [ token ]
+
+print(l)
+```
+
+Which should result in the following (though less cleanly formated):
+
+```python
+[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
+  { 'word': 'Volvon',        'score': 0.99..., 'entity': 'OBJ'},
+  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
+  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
+  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
+  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
+  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]
+```
+
+### ALBERT base
+
+The easisest way to do this is, again, using Huggingface Transformers:
+
+```python
+from transformers import AutoModel,AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
+model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
+```
+
+## Acknowledgements ❤️
+
+- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
+- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+- Models are hosted on S3 by Huggingface 🤗
+
--- a/model_cards/Musixmatch/umberto-commoncrawl-cased-v1/README.md
+++ b/model_cards/Musixmatch/umberto-commoncrawl-cased-v1/README.md
@@ -0,0 +1,118 @@
+---
+language: italian
+---
+
+# UmBERTo Commoncrawl Cased
+
+[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
+
+<p align="center">
+    <img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
+    Marco Lodola, Monument to Umberto Eco, Alessandria 2019
+</p>
+
+## Dataset
+UmBERTo-Commoncrawl-Cased utilizes the Italian subcorpus of [OSCAR](https://traces1.inria.fr/oscar/) as training set of the language model. We used deduplicated version of the Italian corpus that consists in 70 GB of plain text data, 210M sentences with 11B words where the sentences have been filtered and shuffled at line level in order to be used for NLP research.
+
+## Pre-trained model
+
+| Model | WWM | Cased | Tokenizer | Vocab Size  | Train Steps |  Download |
+| ------ | ------ | ------ | ------ | ------ |------ | ------ |
+| `umberto-commoncrawl-cased-v1` | YES | YES | SPM | 32K | 125k | [Link](http://bit.ly/35zO7GH) |
+
+This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
+
+## Downstream Tasks
+These results refers to umberto-commoncrawl-cased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
+
+#### Named Entity Recognition (NER)
+
+| Dataset | F1 | Precision | Recall | Accuracy |
+| ------ | ------ | ------ |  ------ |  ------ |
+| **ICAB-EvalITA07** | **87.565**  | 86.596  | 88.556  | 98.690 | 
+| **WikiNER-ITA** | **92.531**  | 92.509 | 92.553 | 99.136 | 
+
+#### Part of Speech (POS)
+
+| Dataset | F1 | Precision | Recall | Accuracy |
+| ------ | ------ | ------ |  ------ |  ------ |
+| **UD_Italian-ISDT** | 98.870  | 98.861 | 98.879 | **98.977** | 
+| **UD_Italian-ParTUT** | 98.786 | 98.812 |  98.760 | **98.903** | 
+
+
+
+## Usage
+
+##### Load UmBERTo with AutoModel, Autotokenizer:
+
+```python
+
+import torch
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
+umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
+
+encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
+input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
+outputs = umberto(input_ids)
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output
+```
+
+##### Predict masked token:
+
+```python
+from transformers import pipeline
+
+fill_mask = pipeline(
+	"fill-mask",
+	model="Musixmatch/umberto-commoncrawl-cased-v1",
+	tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
+)
+
+result = fill_mask("Umberto Eco è <mask> un grande scrittore")
+# {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032}
+# {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471}
+# {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654}
+# {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908}
+# {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}
+```
+
+
+## Citation
+All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
+
+* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
+* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
+* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
+* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
+
+```
+@inproceedings {magnini2006annotazione,
+	title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
+	author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
+	booktitle = {Proc.of SILFI 2006},
+	year = {2006}
+}
+@inproceedings {magnini2006cab,
+	title = {I - CAB: the Italian Content Annotation Bank.},
+	author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
+	booktitle = {LREC},
+	pages = {963--968},
+	year = {2006},
+	organization = {Citeseer}
+}
+```
+
+## Authors
+
+**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
+**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
+**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
+
+## About Musixmatch AI
+![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
+We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
+Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
+
+
--- a/model_cards/Musixmatch/umberto-wikipedia-uncased-v1/README.md
+++ b/model_cards/Musixmatch/umberto-wikipedia-uncased-v1/README.md
@@ -0,0 +1,117 @@
+---
+language: italian
+---
+
+# UmBERTo Wikipedia Uncased
+
+[UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
+
+<p align="center">
+    <img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
+    Marco Lodola, Monument to Umberto Eco, Alessandria 2019
+</p>
+
+## Dataset
+UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from [Wikipedia-ITA](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
+
+## Pre-trained model
+
+| Model | WWM | Cased | Tokenizer | Vocab Size  | Train Steps |  Download |
+| ------ | ------ | ------ | ------ | ------ |------ | ------ |
+| `umberto-wikipedia-uncased-v1` | YES | YES | SPM | 32K | 100k | [Link](http://bit.ly/35wbSj6) |
+
+This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
+
+## Downstream Tasks
+These results refers to umberto-wikipedia-uncased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
+
+#### Named Entity Recognition (NER)
+
+| Dataset | F1 | Precision | Recall | Accuracy |
+| ------ | ------ | ------ |  ------ |  ----- |
+| **ICAB-EvalITA07** | **86.240** | 85.939 | 86.544 | 98.534 | 
+| **WikiNER-ITA** | **90.483** | 90.328 | 90.638 | 98.661 | 
+
+#### Part of Speech (POS)
+
+| Dataset | F1 | Precision | Recall | Accuracy |
+| ------ | ------ | ------ |  ------ |  ------ |
+| **UD_Italian-ISDT** | 98.563  | 98.508 | 98.618 | **98.717** | 
+| **UD_Italian-ParTUT** | 97.810 | 97.835 |  97.784 | **98.060** | 
+
+
+
+## Usage
+
+##### Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:
+
+```python
+
+import torch
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
+umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
+
+encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
+input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
+outputs = umberto(input_ids)
+last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output
+```
+
+##### Predict masked token:
+
+```python
+from transformers import pipeline
+
+fill_mask = pipeline(
+	"fill-mask",
+	model="Musixmatch/umberto-wikipedia-uncased-v1",
+	tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
+)
+
+result = fill_mask("Umberto Eco è <mask> un grande scrittore")
+# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
+# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
+# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
+# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
+# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
+```
+
+
+## Citation
+All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
+
+* UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
+* UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
+* I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
+* WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
+
+```
+@inproceedings {magnini2006annotazione,
+	title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
+	author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
+	booktitle = {Proc.of SILFI 2006},
+	year = {2006}
+}
+@inproceedings {magnini2006cab,
+	title = {I - CAB: the Italian Content Annotation Bank.},
+	author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
+	booktitle = {LREC},
+	pages = {963--968},
+	year = {2006},
+	organization = {Citeseer}
+}
+```
+
+## Authors
+
+**Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
+**Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
+**Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
+
+## About Musixmatch AI
+![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
+We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
+Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
+
--- a/model_cards/ahotrod/albert_xxlargev1_squad2_512/README.md
+++ b/model_cards/ahotrod/albert_xxlargev1_squad2_512/README.md
@@ -0,0 +1,90 @@
+## Albert xxlarge version 1 language model fine-tuned on SQuAD2.0
+
+### with the following results:
+
+```
+exact: 85.65653162637918
+f1: 89.260458954177
+total': 11873
+HasAns_exact': 82.6417004048583
+HasAns_f1': 89.8598902096736
+HasAns_total': 5928
+NoAns_exact': 88.66274179983179
+NoAns_f1': 88.66274179983179
+NoAns_total': 5945
+best_exact': 85.65653162637918
+best_exact_thresh': 0.0
+best_f1': 89.2604589541768
+best_f1_thresh': 0.0
+```
+
+### from script:
+
+```
+python -m torch.distributed.launch --nproc_per_node=2 ${RUN_SQUAD_DIR}/run_squad.py \
+--model_type albert \
+--model_name_or_path albert-xxlarge-v1 \
+--do_train \
+--train_file ${SQUAD_DIR}/train-v2.0.json \
+--predict_file ${SQUAD_DIR}/dev-v2.0.json \
+--version_2_with_negative \
+--num_train_epochs 3 \
+--max_steps 8144 \
+--warmup_steps 814 \
+--do_lower_case \
+--learning_rate 3e-5 \
+--max_seq_length 512 \
+--doc_stride 128 \
+--save_steps 2000 \
+--per_gpu_train_batch_size 1 \
+--gradient_accumulation_steps 24 \
+--output_dir ${MODEL_PATH}
+
+CUDA_VISIBLE_DEVICES=0 python ${RUN_SQUAD_DIR}/run_squad.py \
+--model_type albert \
+--model_name_or_path ${MODEL_PATH} \
+--do_eval \
+--train_file ${SQUAD_DIR}/train-v2.0.json \
+--predict_file ${SQUAD_DIR}/dev-v2.0.json \
+--version_2_with_negative \
+--do_lower_case \
+--max_seq_length 512 \
+--per_gpu_eval_batch_size 48 \
+--output_dir ${MODEL_PATH}
+```
+
+### using the following system & software:
+
+```
+OS/Platform: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid
+GPU/CPU: 2 x NVIDIA 1080Ti / Intel i7-8700
+Transformers: 2.3.0
+PyTorch: 1.4.0
+TensorFlow: 2.1.0
+Python: 3.7.6
+```
+
+### Inferencing / prediction works with the current Transformers v2.4.1
+
+### Access this albert_xxlargev1_sqd2_512 fine-tuned model with "tried & true" code:
+
+```python
+config_class, model_class, tokenizer_class = \
+        AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer
+
+model_name_or_path = "ahotrod/albert_xxlargev1_squad2_512"
+config = config_class.from_pretrained(model_name_or_path)
+tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True)
+model = model_class.from_pretrained(model_name_or_path, config=config)
+```
+
+### or the AutoModels (AutoConfig, AutoTokenizer & AutoModel) should also work, however I have yet to use them in my app & confirm:
+
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModel
+
+model_name_or_path = "ahotrod/albert_xxlargev1_squad2_512"
+config = AutoConfig.from_pretrained(model_name_or_path)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
+model = AutoModel.from_pretrained(model_name_or_path, config=config)
+```
--- a/model_cards/ahotrod/xlnet_large_squad2_512/README.md
+++ b/model_cards/ahotrod/xlnet_large_squad2_512/README.md
@@ -0,0 +1,77 @@
+## XLNet large language model fine-tuned on SQuAD2.0
+
+### with the following results:
+
+```
+  "exact": 82.07698138633876,
+  "f1": 85.898874470488,
+  "total": 11873,
+  "HasAns_exact": 79.60526315789474,
+  "HasAns_f1": 87.26000954590184,
+  "HasAns_total": 5928,
+  "NoAns_exact": 84.54163162321278,
+  "NoAns_f1": 84.54163162321278,
+  "NoAns_total": 5945,
+  "best_exact": 83.22243746315169,
+  "best_exact_thresh": -11.112004280090332,
+  "best_f1": 86.88541353813282,
+  "best_f1_thresh": -11.112004280090332
+```
+### from script:
+```
+python -m torch.distributed.launch --nproc_per_node=2 ${RUN_SQUAD_DIR}/run_squad.py \
+  --model_type xlnet \
+  --model_name_or_path xlnet-large-cased \
+  --do_train \
+  --train_file ${SQUAD_DIR}/train-v2.0.json \
+  --predict_file ${SQUAD_DIR}/dev-v2.0.json \
+  --version_2_with_negative \
+  --num_train_epochs 3 \
+  --learning_rate 3e-5 \
+  --adam_epsilon 1e-6 \
+  --max_seq_length 512 \
+  --doc_stride 128 \
+  --save_steps 2000 \
+  --per_gpu_train_batch_size 1 \
+  --gradient_accumulation_steps 24 \
+  --output_dir ${MODEL_PATH}
+
+CUDA_VISIBLE_DEVICES=0 python ${RUN_SQUAD_DIR}/run_squad_II.py \
+  --model_type xlnet \
+  --model_name_or_path ${MODEL_PATH} \
+  --do_eval \
+  --train_file ${SQUAD_DIR}/train-v2.0.json \
+  --predict_file ${SQUAD_DIR}/dev-v2.0.json \
+  --version_2_with_negative \
+  --max_seq_length 512 \
+  --per_gpu_eval_batch_size 48 \
+  --output_dir ${MODEL_PATH}
+```
+### using the following system & software:
+```
+OS/Platform: Linux-4.15.0-76-generic-x86_64-with-debian-buster-sid
+GPU/CPU: 2 x NVIDIA 1080Ti / Intel i7-8700
+Transformers: 2.1.1
+PyTorch: 1.4.0
+TensorFlow: 2.1.0
+Python: 3.7.6
+```
+### Inferencing / prediction works with Transformers v2.4.1, the latest version tested
+
+### Utilize this xlnet_large_squad2_512 fine-tuned model with:
+```python
+config_class, model_class, tokenizer_class = \
+        XLNetConfig, XLNetforQuestionAnswering, XLNetTokenizer
+model_name_or_path = "ahotrod/xlnet_large_squad2_512"
+config = config_class.from_pretrained(model_name_or_path)
+tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True)
+model = model_class.from_pretrained(model_name_or_path, config=config)
+```
+### or the AutoModels (AutoConfig, AutoTokenizer & AutoModel) should also work, however I have yet to use them in my apps & confirm:
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModel
+model_name_or_path = "ahotrod/xlnet_large_squad2_512"
+config = AutoConfig.from_pretrained(model_name_or_path)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
+model = AutoModel.from_pretrained(model_name_or_path, config=config)
+```
--- a/model_cards/bert-base-german-cased-README.md
+++ b/model_cards/bert-base-german-cased-README.md
@@ -0,0 +1,71 @@
+---
+language: german
+thumbnail: https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png
+---
+
+# German BERT
+![bert_image](https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png)
+## Overview
+**Language model:** bert-base-cased   
+**Language:** German  
+**Training data:** Wiki, OpenLegalData, News (~ 12GB)  
+**Eval data:** Conll03 (NER), GermEval14 (NER), GermEval18 (Classification), GNAD (Classification)  
+**Infrastructure**: 1x TPU v2  
+**Published**: Jun 14th, 2019
+ 
+## Details
+- We trained using Google's Tensorflow code on a single cloud TPU v2 with standard settings.
+- We trained 810k steps with a batch size of 1024 for sequence length 128 and 30k steps with sequence length 512. Training took about 9 days.
+- As training data we used the latest German Wikipedia dump (6GB of raw txt files), the OpenLegalData dump (2.4 GB) and news articles (3.6 GB).
+- We cleaned the data dumps with tailored scripts and segmented sentences with spacy v2.1. To create tensorflow records we used the recommended sentencepiece library for creating the word piece vocabulary and tensorflow scripts to convert the text to data usable by BERT.
+
+See https://deepset.ai/german-bert for more details
+
+## Hyperparameters
+
+```
+batch_size = 1024
+n_steps = 810_000
+max_seq_len = 128 (and 512 later)
+learning_rate = 1e-4
+lr_schedule = LinearWarmup
+num_warmup_steps = 10_000
+```
+
+## Performance
+
+During training we monitored the loss and evaluated different model checkpoints on the following German datasets:
+
+- germEval18Fine: Macro f1 score for multiclass sentiment classification
+- germEval18coarse: Macro f1 score for binary sentiment classification
+- germEval14: Seq f1 score for NER (file names deuutf.\*)
+- CONLL03: Seq f1 score for NER
+- 10kGNAD: Accuracy for document classification
+
+Even without thorough hyperparameter tuning, we observed quite stable learning especially for our German model. Multiple restarts with different seeds produced quite similar results.
+  
+![performancetable](https://thumb.tildacdn.com/tild3162-6462-4566-b663-376630376138/-/format/webp/Screenshot_from_2020.png)  
+
+We further evaluated different points during the 9 days of pre-training and were astonished how fast the model converges to the maximally reachable performance. We ran all 5 downstream tasks on 7 different model checkpoints - taken at 0 up to 840k training steps (x-axis in figure below). Most checkpoints are taken from early training where we expected most performance changes. Surprisingly, even a randomly initialized BERT can be trained only on labeled downstream datasets and reach good performance (blue line, GermEval 2018 Coarse task, 795 kB trainset size).
+
+![checkpointseval](https://thumb.tildacdn.com/tild6335-3531-4137-b533-313365663435/-/format/webp/deepset_checkpoints.png)  
+
+## Authors
+Branden Chan: `branden.chan [at] deepset.ai`
+Timo Möller: `timo.moeller [at] deepset.ai`
+Malte Pietsch: `malte.pietsch [at] deepset.ai`
+Tanay Soni: `tanay.soni [at] deepset.ai`
+
+## About us
+![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
+
+We bring NLP to the industry via open source!  
+Our focus: Industry specific language models & large scale QA systems.  
+  
+Some of our work: 
+- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
+- [FARM](https://github.com/deepset-ai/FARM)
+- [Haystack](https://github.com/deepset-ai/haystack/)
+
+Get in touch:
+[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)  
--- a/model_cards/binwang/xlnet-base-cased/README.md
+++ b/model_cards/binwang/xlnet-base-cased/README.md
@@ -0,0 +1,5 @@
+This model is pre-trained **XLNET** with 12 layers.
+
+It comes with paper: SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models
+
+Project Page: [SBERT-WK](https://github.com/BinWang28/SBERT-WK-Sentence-Embedding)
--- a/model_cards/canwenxu/BERT-of-Theseus-MNLI/README.md
+++ b/model_cards/canwenxu/BERT-of-Theseus-MNLI/README.md
@@ -0,0 +1,20 @@
+---
+thumbnail: https://raw.githubusercontent.com/JetRunner/BERT-of-Theseus/master/bert-of-theseus.png
+---
+
+# BERT-of-Theseus
+See our paper ["BERT-of-Theseus: Compressing BERT by Progressive Module Replacing"](http://arxiv.org/abs/2002.02925).
+
+BERT-of-Theseus is a new compressed BERT by progressively replacing the components of the original BERT.
+
+![BERT of Theseus](https://github.com/JetRunner/BERT-of-Theseus/blob/master/bert-of-theseus.png?raw=true)
+
+## Load Pretrained Model on MNLI
+
+We provide a 6-layer pretrained model on MNLI as a general-purpose model, which can transfer to other sentence classification tasks, outperforming DistillBERT (with the same 6-layer structure) on six tasks of GLUE (dev set).
+
+| Method          | MNLI | MRPC | QNLI | QQP  | RTE  | SST-2 | STS-B |
+|-----------------|------|------|------|------|------|-------|-------|
+| BERT-base       | 83.5 | 89.5 | 91.2 | 89.8 | 71.1 | 91.5  | 88.9  |
+| DistillBERT     | 79.0 | 87.5 | 85.3 | 84.9 | 59.9 | 90.7  | 81.2  |
+| BERT-of-Theseus | 82.1 | 87.5 | 88.8 | 88.8 | 70.1 | 91.8  | 87.8  |
--- a/model_cards/dbmdz/bert-base-german-cased/README.md
+++ b/model_cards/dbmdz/bert-base-german-cased/README.md
@@ -0,0 +1,70 @@
+---
+language: german
+---
+
+# 🤗 + 📚 dbmdz German BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources another German BERT models 🎉
+
+# German BERT
+
+## Stats
+
+In addition to the recently released [German BERT](https://deepset.ai/german-bert)
+model by [deepset](https://deepset.ai/) we provide another German-language model.
+
+The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus,
+Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with
+a size of 16GB and 2,350,234,427 tokens.
+
+For sentence splitting, we use [spacy](https://spacy.io/). Our preprocessing steps
+(sentence piece model for vocab generation) follow those used for training
+[SciBERT](https://github.com/allenai/scibert). The model is trained with an initial
+sequence length of 512 subwords and was performed for 1.5M steps.
+
+This release includes both cased and uncased models.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                            | Downloads
+| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `bert-base-german-dbmdz-cased`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json) • [`pytorch_model.bin`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-pytorch_model.bin) • [`vocab.txt`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt)
+| `bert-base-german-dbmdz-uncased` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json) • [`pytorch_model.bin`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-pytorch_model.bin) • [`vocab.txt`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt)
+
+## Usage
+
+With Transformers >= 2.3 our German BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")
+```
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-german-europeana-cased/README.md
+++ b/model_cards/dbmdz/bert-base-german-europeana-cased/README.md
@@ -0,0 +1,61 @@
+---
+language: german
+tags:
+  - "historic german"
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources German Europeana BERT models 🎉
+
+# German Europeana BERT
+
+We use the open source [Europeana newspapers](http://www.europeana-newspapers.eu/)
+that were provided by *The European Library*. The final
+training corpus has a size of 51GB and consists of 8,035,986,369 tokens.
+
+Detailed information about the data and pretraining steps can be found in
+[this repository](https://github.com/stefan-it/europeana-bert).
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                      | Downloads
+| ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-german-europeana-cased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-cased/config.json)   • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-cased/pytorch_model.bin)   • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-cased/vocab.txt)
+
+## Results
+
+For results on Historic NER, please refer to [this repository](https://github.com/stefan-it/europeana-bert).
+
+## Usage
+
+With Transformers >= 2.3 our German Europeana BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-cased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md
@@ -0,0 +1,61 @@
+---
+language: german
+tags:
+  - "historic german"
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources German Europeana BERT models 🎉
+
+# German Europeana BERT
+
+We use the open source [Europeana newspapers](http://www.europeana-newspapers.eu/)
+that were provided by *The European Library*. The final
+training corpus has a size of 51GB and consists of 8,035,986,369 tokens.
+
+Detailed information about the data and pretraining steps can be found in
+[this repository](https://github.com/stefan-it/europeana-bert).
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                      | Downloads
+| ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-german-europeana-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-german-europeana-uncased/vocab.txt)
+
+## Results
+
+For results on Historic NER, please refer to [this repository](https://github.com/stefan-it/europeana-bert).
+
+## Usage
+
+With Transformers >= 2.3 our German Europeana BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-uncased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-uncased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-german-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-german-uncased/README.md
@@ -0,0 +1,70 @@
+---
+language: german
+---
+
+# 🤗 + 📚 dbmdz German BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources another German BERT models 🎉
+
+# German BERT
+
+## Stats
+
+In addition to the recently released [German BERT](https://deepset.ai/german-bert)
+model by [deepset](https://deepset.ai/) we provide another German-language model.
+
+The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus,
+Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with
+a size of 16GB and 2,350,234,427 tokens.
+
+For sentence splitting, we use [spacy](https://spacy.io/). Our preprocessing steps
+(sentence piece model for vocab generation) follow those used for training
+[SciBERT](https://github.com/allenai/scibert). The model is trained with an initial
+sequence length of 512 subwords and was performed for 1.5M steps.
+
+This release includes both cased and uncased models.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                            | Downloads
+| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `bert-base-german-dbmdz-cased`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json) • [`pytorch_model.bin`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-pytorch_model.bin) • [`vocab.txt`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt)
+| `bert-base-german-dbmdz-uncased` | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json) • [`pytorch_model.bin`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-pytorch_model.bin) • [`vocab.txt`](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt)
+
+## Usage
+
+With Transformers >= 2.3 our German BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-german-cased")
+```
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-italian-cased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-cased/README.md
@@ -0,0 +1,77 @@
+---
+language: italian
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources Italian BERT models 🎉
+
+# Italian BERT
+
+The source data for the Italian BERT model consists of a recent Wikipedia dump and
+various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final
+training corpus has a size of 13GB and 2,050,057,573 tokens.
+
+For sentence splitting, we use NLTK (faster compared to spacy).
+Our cased and uncased models are training with an initial sequence length of 512
+subwords for ~2-3M steps.
+
+For the XXL Italian models, we use the same training data from OPUS and extend
+it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/).
+Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                   | Downloads
+| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-italian-cased`         | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json)       • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin)       • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt)
+| `dbmdz/bert-base-italian-uncased`       | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json)     • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin)     • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-cased`     | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json)   • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin)   • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-uncased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt)
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+## Usage
+
+With Transformers >= 2.3 our Italian BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased")
+```
+
+To load the (recommended) Italian XXL BERT models, just use:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-italian-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-uncased/README.md
@@ -0,0 +1,77 @@
+---
+language: italian
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources Italian BERT models 🎉
+
+# Italian BERT
+
+The source data for the Italian BERT model consists of a recent Wikipedia dump and
+various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final
+training corpus has a size of 13GB and 2,050,057,573 tokens.
+
+For sentence splitting, we use NLTK (faster compared to spacy).
+Our cased and uncased models are training with an initial sequence length of 512
+subwords for ~2-3M steps.
+
+For the XXL Italian models, we use the same training data from OPUS and extend
+it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/).
+Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                   | Downloads
+| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-italian-cased`         | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json)       • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin)       • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt)
+| `dbmdz/bert-base-italian-uncased`       | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json)     • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin)     • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-cased`     | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json)   • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin)   • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-uncased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt)
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+## Usage
+
+With Transformers >= 2.3 our Italian BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased")
+```
+
+To load the (recommended) Italian XXL BERT models, just use:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
@@ -0,0 +1,77 @@
+---
+language: italian
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources Italian BERT models 🎉
+
+# Italian BERT
+
+The source data for the Italian BERT model consists of a recent Wikipedia dump and
+various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final
+training corpus has a size of 13GB and 2,050,057,573 tokens.
+
+For sentence splitting, we use NLTK (faster compared to spacy).
+Our cased and uncased models are training with an initial sequence length of 512
+subwords for ~2-3M steps.
+
+For the XXL Italian models, we use the same training data from OPUS and extend
+it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/).
+Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                   | Downloads
+| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-italian-cased`         | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json)       • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin)       • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt)
+| `dbmdz/bert-base-italian-uncased`       | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json)     • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin)     • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-cased`     | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json)   • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin)   • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-uncased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt)
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+## Usage
+
+With Transformers >= 2.3 our Italian BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased")
+```
+
+To load the (recommended) Italian XXL BERT models, just use:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
@@ -0,0 +1,77 @@
+---
+language: italian
+---
+
+# 🤗 + 📚 dbmdz BERT models
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources Italian BERT models 🎉
+
+# Italian BERT
+
+The source data for the Italian BERT model consists of a recent Wikipedia dump and
+various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final
+training corpus has a size of 13GB and 2,050,057,573 tokens.
+
+For sentence splitting, we use NLTK (faster compared to spacy).
+Our cased and uncased models are training with an initial sequence length of 512
+subwords for ~2-3M steps.
+
+For the XXL Italian models, we use the same training data from OPUS and extend
+it with data from the Italian part of the [OSCAR corpus](https://traces1.inria.fr/oscar/).
+Thus, the final training corpus has a size of 81GB and 13,138,379,147 tokens.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                                   | Downloads
+| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-italian-cased`         | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/config.json)       • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/pytorch_model.bin)       • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-cased/vocab.txt)
+| `dbmdz/bert-base-italian-uncased`       | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/config.json)     • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/pytorch_model.bin)     • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-uncased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-cased`     | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/config.json)   • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/pytorch_model.bin)   • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-cased/vocab.txt)
+| `dbmdz/bert-base-italian-xxl-uncased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-italian-xxl-uncased/vocab.txt)
+
+## Results
+
+For results on downstream tasks like NER or PoS tagging, please refer to
+[this repository](https://github.com/stefan-it/fine-tuned-berts-seq).
+
+## Usage
+
+With Transformers >= 2.3 our Italian BERT models can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-cased")
+```
+
+To load the (recommended) Italian XXL BERT models, just use:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-italian-xxl-cased")
+```
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/dbmdz/bert-base-turkish-cased/README.md
+++ b/model_cards/dbmdz/bert-base-turkish-cased/README.md
@@ -0,0 +1,74 @@
+---
+language: turkish
+---
+
+# 🤗 + 📚 dbmdz Turkish BERT model
+
+In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
+Library open sources a cased model for Turkish 🎉
+
+# 🇹🇷 BERTurk
+
+BERTurk is a community-driven cased BERT model for Turkish.
+
+Some datasets used for pretraining and evaluation are contributed from the
+awesome Turkish NLP community, as well as the decision for the model name: BERTurk.
+
+## Stats
+
+The current version of the model is trained on a filtered and sentence
+segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
+a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
+special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
+
+The final training corpus has a size of 35GB and 44,04,976,662 tokens.
+
+Thanks to Google's TensorFlow Research Cloud (TFRC) we could train a cased model
+on a TPU v3-8 for 2M steps.
+
+## Model weights
+
+Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
+compatible weights are available. If you need access to TensorFlow checkpoints,
+please raise an issue!
+
+| Model                             | Downloads
+| --------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `dbmdz/bert-base-turkish-cased`   | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-cased/config.json) • [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-cased/pytorch_model.bin) • [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-cased/vocab.txt)
+
+## Usage
+
+With Transformers >= 2.3 our BERTurk cased model can be loaded like:
+
+```python
+from transformers import AutoModel, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
+model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
+```
+
+## Results
+
+For results on PoS tagging or NER tasks, please refer to
+[this repository](https://github.com/stefan-it/turkish-bert).
+
+# Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
+
+# Contact (Bugs, Feedback, Contribution and more)
+
+For questions about our BERT models just open an issue
+[here](https://github.com/dbmdz/berts/issues/new) 🤗
+
+# Acknowledgments
+
+Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
+additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
+us the Turkish NER dataset for evaluation.
+
+Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
+Thanks for providing access to the TFRC ❤️
+
+Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
+it is possible to download both cased and uncased models from their S3 storage 🤗
--- a/model_cards/deepset/roberta-base-squad2/README.md
+++ b/model_cards/deepset/roberta-base-squad2/README.md
@@ -0,0 +1,104 @@
+# roberta-base for QA 
+
+## Overview
+**Language model:** roberta-base  
+**Language:** English  
+**Downstream-task:** Extractive QA  
+**Training data:** SQuAD 2.0  
+**Eval data:** SQuAD 2.0  
+**Code:**  See [example](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py) in [FARM](https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py)  
+**Infrastructure**: 4x Tesla v100
+
+## Hyperparameters
+
+```
+batch_size = 50
+n_epochs = 3
+base_LM_model = "roberta-base"
+max_seq_len = 384
+learning_rate = 3e-5
+lr_schedule = LinearWarmup
+warmup_proportion = 0.2
+doc_stride=128
+max_query_length=64
+``` 
+
+## Performance
+Evaluated on the SQuAD 2.0 dev set with the [official eval script](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/).
+```
+"exact": 78.49743114629833,
+"f1": 81.73092721240889
+```
+
+## Usage
+
+### In Transformers
+```python
+from transformers.pipelines import pipeline
+from transformers.modeling_auto import AutoModelForQuestionAnswering
+from transformers.tokenization_auto import AutoTokenizer
+
+model_name = "deepset/roberta-base-squad2"
+
+# a) Get predictions
+nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
+QA_input = {
+    'question': 'Why is model conversion important?',
+    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
+}
+res = nlp(QA_input)
+
+# b) Load model & tokenizer
+model = AutoModelForQuestionAnswering.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+```
+
+### In FARM
+
+```python
+from farm.modeling.adaptive_model import AdaptiveModel
+from farm.modeling.tokenization import Tokenizer
+from farm.infer import Inferencer
+
+model_name = "deepset/roberta-base-squad2"
+
+# a) Get predictions
+nlp = Inferencer.load(model_name, task_type="question_answering")
+QA_input = [{"questions": ["Why is model conversion important?"],
+             "text": "The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks."}]
+res = nlp.inference_from_dicts(dicts=QA_input, rest_api_schema=True)
+
+# b) Load model & tokenizer
+model = AdaptiveModel.convert_from_transformers(model_name, device="cpu", task_type="question_answering")
+tokenizer = Tokenizer.load(model_name)
+```
+
+### In haystack
+For doing QA at scale (i.e. many docs instead of single paragraph), you can load the model also in [haystack](https://github.com/deepset-ai/haystack/):
+```python
+reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
+# or 
+reader = TransformersReader(model="deepset/roberta-base-squad2",tokenizer="deepset/roberta-base-squad2")
+```
+
+
+## Authors
+Branden Chan: `branden.chan [at] deepset.ai`
+Timo Möller: `timo.moeller [at] deepset.ai`
+Malte Pietsch: `malte.pietsch [at] deepset.ai`
+Tanay Soni: `tanay.soni [at] deepset.ai`
+
+## About us
+![deepset logo](https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png)
+
+We bring NLP to the industry via open source!  
+Our focus: Industry specific language models & large scale QA systems.  
+  
+Some of our work: 
+- [German BERT (aka "bert-base-german-cased")](https://deepset.ai/german-bert)
+- [FARM](https://github.com/deepset-ai/FARM)
+- [Haystack](https://github.com/deepset-ai/haystack/)
+
+Get in touch:
+[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Website](https://deepset.ai)
+
--- a/model_cards/fmikaelian/flaubert-base-uncased-squad/README.md
+++ b/model_cards/fmikaelian/flaubert-base-uncased-squad/README.md
@@ -0,0 +1,48 @@
+---
+language: french
+---
+
+# flaubert-base-uncased-squad
+
+## Description
+
+A baseline model for question-answering in french ([flaubert](https://github.com/getalp/Flaubert) model fine-tuned on [french-translated SQuAD 1.1 dataset](https://github.com/Alikabbadj/French-SQuAD))
+
+## Training hyperparameters
+
+```shell
+python3 ./examples/run_squad.py \
+--model_type flaubert \
+--model_name_or_path flaubert-base-uncased \
+--do_train \
+--do_eval \
+--do_lower_case \
+--train_file SQuAD-v1.1-train_fr_ss999_awstart2_net.json \
+--predict_file SQuAD-v1.1-dev_fr_ss999_awstart2_net.json \
+--learning_rate 3e-5 \
+--num_train_epochs 2 \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir output \
+--per_gpu_eval_batch_size=3 \
+--per_gpu_train_batch_size=3
+``` 
+
+## Evaluation results
+
+```shell
+{"f1": 68.66174806561969, "exact_match": 49.299692063176714}
+```
+
+## Usage
+
+```python
+from transformers import pipeline
+
+nlp = pipeline('question-answering', model='fmikaelian/flaubert-base-uncased-squad', tokenizer='fmikaelian/flaubert-base-uncased-squad')
+
+nlp({
+    'question': "Qui est Claude Monet?",
+    'context': "Claude Monet, né le 14 novembre 1840 à Paris et mort le 5 décembre 1926 à Giverny, est un peintre français et l’un des fondateurs de l'impressionnisme."
+})
+```
--- a/model_cards/henryk/bert-base-multilingual-cased-finetuned-dutch-squad2/README.md
+++ b/model_cards/henryk/bert-base-multilingual-cased-finetuned-dutch-squad2/README.md
@@ -0,0 +1,50 @@
+---
+language: dutch
+---
+
+# Multilingual + Dutch SQuAD2.0
+
+This model is the multilingual model provided by the Google research team with a fine-tuned dutch Q&A downstream task.
+
+## Details of the language model(bert-base-multilingual-cased)
+
+Language model ([**bert-base-multilingual-cased**](https://github.com/google-research/bert/blob/master/multilingual.md)):
+12-layer, 768-hidden, 12-heads, 110M parameters.
+Trained on cased text in the top 104 languages with the largest Wikipedias.
+
+## Details of the downstream task - Dataset
+Using the `mtranslate` Python module, [**SQuAD2.0**](https://rajpurkar.github.io/SQuAD-explorer/) was machine-translated. In order to find the start tokens the direct translations of the answers were searched in the corresponding paragraphs. Since the answer could not always be found in the text, due to the different translations depending on the context (missing context in the pure answer), a loss of question-answer examples occurred. This is a potential problem where errors can occur in the data set (but in the end it was a quick and dirty solution that worked well enough for my task).
+
+| Dataset                | # Q&A |
+| ---------------------- | ----- |
+| SQuAD2.0 Train         | 130 K |
+| Dutch SQuAD2.0 Train   | 99  K |
+| SQuAD2.0 Dev           | 12  K |
+| Dutch SQuAD2.0 Dev     | 10  K |
+
+## Model training
+
+The model was trained on a Tesla V100 GPU with the following command:
+
+```python
+export SQUAD_DIR=path/to/nl_squad
+
+python run_squad.py \
+  --model_type bert \
+  --model_name_or_path bert-base-multilingual-cased \
+  --version_2_with_negative \
+  --do_train \
+  --do_eval \
+  --train_file $SQUAD_DIR/train_nl-v2.0.json \
+  --predict_file $SQUAD_DIR/dev_nl-v2.0.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /tmp/output_dir/
+```
+
+**Results**:
+
+{'exact': **67.38**, 'f1': **71.36**} 
--- a/model_cards/jplu/tf-camembert-base/README.md
+++ b/model_cards/jplu/tf-camembert-base/README.md
@@ -0,0 +1,31 @@
+# Tensorflow CamemBERT
+
+In this repository you will find different versions of the CamemBERT model for Tensorflow.
+
+## CamemBERT
+
+[CamemBERT](https://camembert-model.fr/) is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR.
+
+## Model Weights
+
+| Model                            | Downloads
+| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `jplu/tf-camembert-base`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-camembert-base/config.json) • [`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-camembert-base/tf_model.h5)
+
+## Usage
+
+With Transformers >= 2.4 the Tensorflow models of CamemBERT can be loaded like:
+
+```python
+from transformers import TFCamembertModel
+
+model = TFCamembertModel.from_pretrained("jplu/tf-camembert-base")
+```
+
+## Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
+
+## Acknowledgments
+
+Thanks to all the Huggingface team for the support and their amazing library!
--- a/model_cards/jplu/tf-xlm-roberta-base/README.md
+++ b/model_cards/jplu/tf-xlm-roberta-base/README.md
@@ -0,0 +1,36 @@
+# Tensorflow XLM-RoBERTa
+
+In this repository you will find different versions of the XLM-RoBERTa model for Tensorflow.
+
+## XLM-RoBERTa
+
+[XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.
+
+## Model Weights
+
+| Model                            | Downloads
+| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `jplu/tf-xlm-roberta-base`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/config.json) • [`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/tf_model.h5)
+| `jplu/tf-xlm-roberta-large`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/config.json) • [`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/tf_model.h5)
+
+## Usage
+
+With Transformers >= 2.4 the Tensorflow models of XLM-RoBERTa can be loaded like:
+
+```python
+from transformers import TFXLMRobertaModel
+
+model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-base")
+```
+Or
+```
+model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large")
+```
+
+## Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
+
+## Acknowledgments
+
+Thanks to all the Huggingface team for the support and their amazing library!
--- a/model_cards/jplu/tf-xlm-roberta-large/README.md
+++ b/model_cards/jplu/tf-xlm-roberta-large/README.md
@@ -0,0 +1,36 @@
+# Tensorflow XLM-RoBERTa
+
+In this repository you will find different versions of the XLM-RoBERTa model for Tensorflow.
+
+## XLM-RoBERTa
+
+[XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross lingual benchmarks.
+
+## Model Weights
+
+| Model                            | Downloads
+| -------------------------------- | ---------------------------------------------------------------------------------------------------------------
+| `jplu/tf-xlm-roberta-base`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/config.json) • [`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-base/tf_model.h5)
+| `jplu/tf-xlm-roberta-large`   | [`config.json`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/config.json) • [`tf_model.h5`](https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/tf_model.h5)
+
+## Usage
+
+With Transformers >= 2.4 the Tensorflow models of XLM-RoBERTa can be loaded like:
+
+```python
+from transformers import TFXLMRobertaModel
+
+model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-base")
+```
+Or
+```
+model = TFXLMRobertaModel.from_pretrained("jplu/tf-xlm-roberta-large")
+```
+
+## Huggingface model hub
+
+All models are available on the [Huggingface model hub](https://huggingface.co/jplu).
+
+## Acknowledgments
+
+Thanks to all the Huggingface team for the support and their amazing library!
--- a/model_cards/julien-c/EsperBERTo-small-pos/README.md
+++ b/model_cards/julien-c/EsperBERTo-small-pos/README.md
@@ -0,0 +1,40 @@
+---
+language: esperanto
+thumbnail: https://huggingface.co/blog/assets/EsperBERTo-thumbnail-v2.png
+---
+
+# EsperBERTo: RoBERTa-like Language model trained on Esperanto
+
+**Companion model to blog post https://huggingface.co/blog/how-to-train** 🔥
+
+## Training Details
+
+- current checkpoint: 566000
+- machine name: `galinette`
+
+
+![](https://huggingface.co/blog/assets/EsperBERTo-thumbnail-v2.png)
+
+## Example pipeline
+
+```python
+from transformers import TokenClassificationPipeline, pipeline
+
+
+MODEL_PATH = "./models/EsperBERTo-small-pos/"
+
+nlp = pipeline(
+    "ner",
+    model=MODEL_PATH,
+    tokenizer=MODEL_PATH,
+)
+# or instantiate a TokenClassificationPipeline directly.
+
+nlp("Mi estas viro kej estas tago varma.")
+
+# {'entity': 'PRON', 'score': 0.9979867339134216, 'word': ' Mi'}
+# {'entity': 'VERB', 'score': 0.9683094620704651, 'word': ' estas'}
+# {'entity': 'VERB', 'score': 0.9797462821006775, 'word': ' estas'}
+# {'entity': 'NOUN', 'score': 0.8509314060211182, 'word': ' tago'}
+# {'entity': 'ADJ', 'score': 0.9996201395988464, 'word': ' varma'}
+```
--- a/model_cards/julien-c/EsperBERTo-small/README.md
+++ b/model_cards/julien-c/EsperBERTo-small/README.md
@@ -0,0 +1,59 @@
+---
+language: esperanto
+thumbnail: https://huggingface.co/blog/assets/EsperBERTo-thumbnail-v2.png
+---
+
+# EsperBERTo: RoBERTa-like Language model trained on Esperanto
+
+**Companion model to blog post https://huggingface.co/blog/how-to-train** 🔥
+
+## Training Details
+
+- current checkpoint: 566000
+- machine name: `galinette`
+
+
+![](https://huggingface.co/blog/assets/EsperBERTo-thumbnail-v2.png)
+
+## Example pipeline
+
+```python
+from transformers import pipeline
+
+fill_mask = pipeline(
+    "fill-mask",
+    model="julien-c/EsperBERTo-small",
+    tokenizer="julien-c/EsperBERTo-small"
+)
+
+fill_mask("Jen la komenco de bela <mask>.")
+
+# This is the beginning of a beautiful <mask>.
+# =>
+
+# {
+#     'score':0.06502299010753632
+#     'sequence':'<s> Jen la komenco de bela vivo.</s>'
+#     'token':1099
+# }
+# {
+#     'score':0.0421181358397007
+#     'sequence':'<s> Jen la komenco de bela vespero.</s>'
+#     'token':5100
+# }
+# {
+#     'score':0.024884626269340515
+#     'sequence':'<s> Jen la komenco de bela laboro.</s>'
+#     'token':1570
+# }
+# {
+#     'score':0.02324388362467289
+#     'sequence':'<s> Jen la komenco de bela tago.</s>'
+#     'token':1688
+# }
+# {
+#     'score':0.020378097891807556
+#     'sequence':'<s> Jen la komenco de bela festo.</s>'
+#     'token':4580
+# }
+```
--- a/model_cards/julien-c/bert-xsmall-dummy/README.md
+++ b/model_cards/julien-c/bert-xsmall-dummy/README.md
@@ -0,0 +1,25 @@
+## How to build a dummy model
+
+
+```python
+from transformers.configuration_bert import BertConfig
+from transformers.modeling_bert import BertForMaskedLM
+from transformers.modeling_tf_bert import TFBertForMaskedLM
+from transformers.tokenization_bert import BertTokenizer
+
+
+SMALL_MODEL_IDENTIFIER = "julien-c/bert-xsmall-dummy"
+DIRNAME = "./bert-xsmall-dummy"
+
+config = BertConfig(10, 20, 1, 1, 40)
+
+model = BertForMaskedLM(config)
+model.save_pretrained(DIRNAME)
+
+tf_model = TFBertForMaskedLM.from_pretrained(DIRNAME, from_pt=True)
+tf_model.save_pretrained(DIRNAME)
+
+# Slightly different for tokenizer.
+# tokenizer = BertTokenizer.from_pretrained(DIRNAME)
+# tokenizer.save_pretrained()
+```
--- a/model_cards/julien-c/dummy-unknown/README.md
+++ b/model_cards/julien-c/dummy-unknown/README.md
@@ -0,0 +1,59 @@
+---
+tags:
+- ci
+---
+
+## Dummy model used for unit testing and CI
+
+
+```python
+import json
+import os
+from transformers.configuration_roberta import RobertaConfig
+from transformers import RobertaForMaskedLM, TFRobertaForMaskedLM
+
+DIRNAME = "./dummy-unknown"
+
+
+config = RobertaConfig(10, 20, 1, 1, 40)
+
+model = RobertaForMaskedLM(config)
+model.save_pretrained(DIRNAME)
+
+tf_model = TFRobertaForMaskedLM.from_pretrained(DIRNAME, from_pt=True)
+tf_model.save_pretrained(DIRNAME)
+
+# Tokenizer:
+
+vocab = [
+    "l",
+    "o",
+    "w",
+    "e",
+    "r",
+    "s",
+    "t",
+    "i",
+    "d",
+    "n",
+    "\u0120",
+    "\u0120l",
+    "\u0120n",
+    "\u0120lo",
+    "\u0120low",
+    "er",
+    "\u0120lowest",
+    "\u0120newer",
+    "\u0120wider",
+    "<unk>",
+]
+vocab_tokens = dict(zip(vocab, range(len(vocab))))
+merges = ["#version: 0.2", "\u0120 l", "\u0120l o", "\u0120lo w", "e r", ""]
+
+vocab_file = os.path.join(DIRNAME, "vocab.json")
+merges_file = os.path.join(DIRNAME, "merges.txt")
+with open(vocab_file, "w", encoding="utf-8") as fp:
+    fp.write(json.dumps(vocab_tokens) + "\n")
+with open(merges_file, "w", encoding="utf-8") as fp:
+    fp.write("\n".join(merges))
+```
--- a/model_cards/lysandre/arxiv-nlp/README.md
+++ b/model_cards/lysandre/arxiv-nlp/README.md
@@ -0,0 +1,7 @@
+# ArXiv-NLP GPT-2 checkpoint
+
+This is a GPT-2 small checkpoint for PyTorch. It is the official `gpt2-small` fine-tuned to ArXiv paper on the computational linguistics field.
+
+## Training data
+
+This model was trained on a subset of ArXiv papers that were parsed from PDF to txt. The resulting data is made of 80MB of text from the computational linguistics (cs.CL) field.
--- a/model_cards/lysandre/arxiv/README.md
+++ b/model_cards/lysandre/arxiv/README.md
@@ -0,0 +1,7 @@
+# ArXiv GPT-2 checkpoint
+
+This is a GPT-2 small checkpoint for PyTorch. It is the official `gpt2-small` finetuned to ArXiv paper on physics fields.
+
+## Training data
+
+This model was trained on a subset of ArXiv papers that were parsed from PDF to txt. The resulting data is made of 130MB of text, mostly from quantum physics (quant-ph) and other physics sub-fields.
--- a/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
+++ b/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
@@ -0,0 +1,91 @@
+---
+language: spanish
+thumbnail: https://i.imgur.com/jgBdimh.png
+---
+
+# BETO (Spanish BERT) + Spanish SQuAD2.0
+
+This model is provided by [BETO team](https://github.com/dccuchile/beto) and fine-tuned on [SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve) for **Q&A** downstream task.
+
+## Details of the language model('dccuchile/bert-base-spanish-wwm-cased')
+
+Language model ([**'dccuchile/bert-base-spanish-wwm-cased'**](https://github.com/dccuchile/beto/blob/master/README.md)):
+
+BETO is a [BERT model](https://github.com/google-research/bert) trained on a [big Spanish corpus](https://github.com/josecannete/spanish-corpora). BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) as well as other (not BERT-based) models.
+
+## Details of the downstream task (Q&A) - Dataset
+[SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve)
+
+| Dataset                | # Q&A |
+| ---------------------- | ----- |
+| SQuAD2.0 Train         | 130 K |
+| SQuAD2.0-es-v2.0       | 111 K |
+| SQuAD2.0 Dev           | 12  K |
+| SQuAD-es-v2.0-small Dev| 69  K |
+
+## Model training
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+export SQUAD_DIR=path/to/nl_squad
+python transformers/examples/run_squad.py \
+  --model_type bert \
+  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --train_file $SQUAD_DIR/train_nl-v2.0.json \
+  --predict_file $SQUAD_DIR/dev_nl-v2.0.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /content/model_output \
+  --save_steps 5000 \
+  --threads 4 \
+  --version_2_with_negative 
+```
+
+## Results:
+
+
+  | Metric               | # Value |
+| ---------------------- | ----- |
+| **Exact**              | **76.50**50 |
+| **F1**                 | **86.07**81 |
+
+```json
+{
+  "exact": 76.50501430594491,
+  "f1": 86.07818773108252,
+  "total": 69202,
+  "HasAns_exact": 67.93020719738277,
+  "HasAns_f1": 82.37912207996466,
+  "HasAns_total": 45850,
+  "NoAns_exact": 93.34104145255225,
+  "NoAns_f1": 93.34104145255225,
+  "NoAns_total": 23352,
+  "best_exact": 76.51223953064941,
+  "best_exact_thresh": 0.0,
+  "best_f1": 86.08541295578848,
+  "best_f1_thresh": 0.0
+}
+```
+
+### Model in action (in a Colab Notebook)
+<details>
+
+1.  Set the context and ask some questions:
+
+![Set context and questions](https://media.giphy.com/media/mCIaBpfN0LQcuzkA2F/giphy.gif)
+
+2.  Run predictions:
+
+![Run the model](https://media.giphy.com/media/WT453aptcbCP7hxWTZ/giphy.gif)
+</details>
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-ner/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-ner/README.md
@@ -0,0 +1,58 @@
+---
+language: spanish
+thumbnail: https://i.imgur.com/jgBdimh.png
+---
+
+# Spanish BERT (BETO) + NER
+
+This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **NER** downstream task.
+
+## Details of the downstream task (NER) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) 
+
+I preprocessed the dataset and splitted it as train / dev (80/20)
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  | 8.7 K |
+| Dev                    | 2.2 K |
+
+
+- [Fine-tune on NER script](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+
+```bash
+!export NER_DIR='/content/ner_dataset'
+!python /content/transformers/examples/run_ner.py \
+  --model_type bert \
+  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
+  --do_train \
+  --do_eval \
+  --data_dir '/content/ner_dataset' \
+  --num_train_epochs 15.0 \
+  --max_seq_length 384 \
+  --output_dir /content/model_output \
+  --save_steps 5000 \
+
+```
+
+## Comparison:
+
+|                                                      Model                                                       |  # score  |
+| :--------------------------------------------------------------------------------------------------------------: | :-------: |
+|                                        bert-base-spanish-wwm-cased (BETO)                                        |   88.43   |
+| [bert-spanish-cased-finetuned-ner (this one)](https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner) | **89.65** |
+|                                              Best Multilingual BERT                                              |   87.38   |
+
+```
+ ***** All metrics on Eval results  *****
+
+f1 = 0.8965040489828165
+loss = 0.11504213575173258
+precision = 0.893679858239811
+recall = 0.8993461462254805
+```
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
@@ -0,0 +1,80 @@
+---
+language: spanish
+thumbnail: https://i.imgur.com/jgBdimh.png
+---
+
+# Spanish BERT (BETO) + POS
+
+This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) Of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.
+
+## Details of the downstream task (POS) - Dataset
+
+- [Dataset:  CONLL Corpora ES](https://www.kaggle.com/nltkdata/conll-corpora) with data augmentation techniques
+
+I preprocessed the dataset and splitted it as train / dev (80/20)
+
+| Dataset                | # Examples |
+| ---------------------- | ----- |
+| Train                  | 340 K |
+| Dev                    | 50 K |
+
+
+- [Fine-tune on NER script provided by Huggingface](https://github.com/huggingface/transformers/blob/master/examples/run_ner.py)
+
+- Labels covered:
+
+```
+AO, AQ, CC, CS, DA, DD, DE, DI, DN, DP, DT, Faa, Fat, Fc, Fd, Fe, Fg, Fh, Fia, Fit, Fp, Fpa, Fpt, Fs, Ft, Fx, Fz, I, NC, NP, P0, PD, PI, PN, PP, PR, PT, PX, RG, RN, SP, VAI, VAM, VAN, VAP, VAS, VMG, VMI, VMM, VMN, VMP, VMS, VSG, VSI, VSM, VSN, VSP, VSS, Y and Z
+```
+
+
+## Metrics on evaluation set:
+
+|                                                      Metric                                                       |  # score  |
+| :------------------------------------------------------------------------------------: | :-------: |
+| F1                                       | **90.06**  
+| Precision                                | **89.46** | 
+| Recall                                   | **90.67** |                                    
+
+## Model in action
+
+Fast usage with **pipelines**:
+
+```python
+from transformers import pipeline
+
+nlp_pos = pipeline(
+    "ner",
+    model="mrm8488/bert-spanish-cased-finetuned-pos",
+    tokenizer=(
+        'mrm8488/bert-spanish-cased-finetuned-pos',  
+        {"use_fast": False}
+))
+
+
+text = 'Mis amigos están pensando en viajar a Londres este verano'
+
+nlp_pos(text)
+
+#Output:
+'''
+[{'entity': 'NC', 'score': 0.7792173624038696, 'word': '[CLS]'},
+ {'entity': 'DP', 'score': 0.9996283650398254, 'word': 'Mis'},
+ {'entity': 'NC', 'score': 0.9999253749847412, 'word': 'amigos'},
+ {'entity': 'VMI', 'score': 0.9998560547828674, 'word': 'están'},
+ {'entity': 'VMG', 'score': 0.9992249011993408, 'word': 'pensando'},
+ {'entity': 'SP', 'score': 0.9999602437019348, 'word': 'en'},
+ {'entity': 'VMN', 'score': 0.9998666048049927, 'word': 'viajar'},
+ {'entity': 'SP', 'score': 0.9999545216560364, 'word': 'a'},
+ {'entity': 'VMN', 'score': 0.8722310662269592, 'word': 'Londres'},
+ {'entity': 'DD', 'score': 0.9995203614234924, 'word': 'este'},
+ {'entity': 'NC', 'score': 0.9999248385429382, 'word': 'verano'},
+ {'entity': 'NC', 'score': 0.8802427649497986, 'word': '[SEP]'}]
+ '''
+```
+![model in action](https://media.giphy.com/media/jVC9m1cNrdIWuAAtjy/giphy.gif)
+
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
+++ b/model_cards/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md
@@ -0,0 +1,141 @@
+---
+language: spanish
+thumbnail: https://i.imgur.com/jgBdimh.png
+---
+
+# BETO (Spanish BERT) + Spanish SQuAD2.0 + distillation using 'bert-base-multilingual-cased' as teacher
+
+This model is a fine-tuned on [SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve) and **distilled** version of [BETO](https://github.com/dccuchile/beto) for **Q&A**.
+
+Distillation makes the model **smaller, faster, cheaper and lighter** than [bert-base-spanish-wwm-cased-finetuned-spa-squad2-es](https://github.com/huggingface/transformers/blob/master/model_cards/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es/README.md)
+
+This model was fine-tuned on the same dataset but using **distillation** during the process as mentioned above (and one more train epoch).
+
+The **teacher model** for the distillation was `bert-base-multilingual-cased`. It is the same teacher used for `distilbert-base-multilingual-cased` AKA [**DistilmBERT**](https://github.com/huggingface/transformers/tree/master/examples/distillation) (on average is twice as fast as **mBERT-base**).
+
+## Details of the downstream task (Q&A) - Dataset
+
+<details>
+
+[SQuAD-es-v2.0](https://github.com/ccasimiro88/TranslateAlignRetrieve)
+
+| Dataset                 | # Q&A |
+| ----------------------- | ----- |
+| SQuAD2.0 Train          | 130 K |
+| SQuAD2.0-es-v2.0        | 111 K |
+| SQuAD2.0 Dev            | 12 K  |
+| SQuAD-es-v2.0-small Dev | 69 K  |
+
+</details>
+
+## Model training
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+!export SQUAD_DIR=/path/to/squad-v2_spanish \
+&& python transformers/examples/distillation/run_squad_w_distillation.py \
+  --model_type bert \
+  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
+  --teacher_type bert \
+  --teacher_name_or_path bert-base-multilingual-cased \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --train_file $SQUAD_DIR/train-v2.json \
+  --predict_file $SQUAD_DIR/dev-v2.json \
+  --per_gpu_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 5.0 \
+  --max_seq_length 384 \
+  --doc_stride 128 \
+  --output_dir /content/model_output \
+  --save_steps 5000 \
+  --threads 4 \
+  --version_2_with_negative
+```
+
+## Results:
+
+| Metric    | # Value     |
+| --------- | ----------- |
+| **Exact** | **90.77**48 |
+| **F1**    | **94.94**71 |
+
+```json
+{
+  "exact": 90.77483309730933,
+  "f1": 94.94714391266254,
+  "total": 69202,
+  "HasAns_exact": 86.60850599781898,
+  "HasAns_f1": 92.90582885592328,
+  "HasAns_total": 45850,
+  "NoAns_exact": 98.95512161699212,
+  "NoAns_f1": 98.95512161699212,
+  "NoAns_total": 23352,
+  "best_exact": 90.77483309730933,
+  "best_exact_thresh": 0.0,
+  "best_f1": 94.94714391266305,
+  "best_f1_thresh": 0.0
+}
+```
+
+## Comparison:
+
+|                              Model                              | f1 score  |
+| :-------------------------------------------------------------: | :-------: |
+|       bert-base-spanish-wwm-cased-finetuned-spa-squad2-es       |   86.07   |
+| **distill**-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es | **94.94** |
+
+So, yes, this version is even more accurate.
+
+### Model in action
+
+Fast usage with **pipelines**:
+
+```python
+from transformers import *
+
+# Important!: By now the QA pipeline is not compatible with fast tokenizer, but they are working on it. So that pass the object to the tokenizer {"use_fast": False} as in the following example:
+
+nlp = pipeline(
+    'question-answering', 
+    model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
+    tokenizer=(
+        'mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',  
+        {"use_fast": False}
+    )
+)
+
+nlp(
+    {
+        'question': '¿Para qué lenguaje está trabajando?',
+        'context': 'Manuel Romero está colaborando activamente con huggingface/transformers ' +
+                    'para traer el poder de las últimas técnicas de procesamiento de lenguaje natural al idioma español'
+    }
+)
+# Output: {'answer': 'español', 'end': 169, 'score': 0.67530957344621, 'start': 163}
+```
+
+Play with this model and ```pipelines``` in a Colab:
+
+<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Using_Spanish_BERT_fine_tuned_for_Q%26A_pipelines.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
+
+<details>
+
+1.  Set the context and ask some questions:
+
+![Set context and questions](https://media.giphy.com/media/mCIaBpfN0LQcuzkA2F/giphy.gif)
+
+2.  Run predictions:
+
+![Run the model](https://media.giphy.com/media/WT453aptcbCP7hxWTZ/giphy.gif)
+</details>
+
+More about ``` Huggingface pipelines```? check this Colab out:
+
+<a href="https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Huggingface_pipelines_demo.ipynb" target="_parent"><img src="https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab" data-canonical-src="https://colab.research.google.com/assets/colab-badge.svg"></a>
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
+++ b/model_cards/nlpaueb/bert-base-greek-uncased-v1/README.md
@@ -0,0 +1,134 @@
+---
+language: greek
+thumbnail: https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png
+---
+
+# GreekBERT
+
+A Greek version of BERT pre-trained language model.
+
+<img src="https://github.com/nlpaueb/GreekBERT/raw/master/greek-bert-logo.png" width="600"/> 
+
+
+## Pre-training corpora
+
+The pre-training corpora of `bert-base-greek-uncased-v1` include:
+
+* The Greek part of [Wikipedia](https://el.wikipedia.org/wiki/Βικιπαίδεια:Αντίγραφα_της_βάσης_δεδομένων),
+* The Greek part of [European Parliament Proceedings Parallel Corpus](https://www.statmt.org/europarl/), and
+* The Greek part of [OSCAR](https://traces1.inria.fr/oscar/), a cleansed version of [Common Crawl](https://commoncrawl.org).
+
+Future release will also include:
+
+* The entire corpus of Greek legislation, as published by the [National Publication Office](http://www.et.gr),  
+* The entire corpus of EU legislation (Greek translation), as published in [Eur-Lex](https://eur-lex.europa.eu/homepage.html?locale=en).
+
+## Pre-training details
+
+* We trained BERT using the official code provided in Google BERT's github repository (https://github.com/google-research/bert). We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint and vocabulary in the desirable format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
+* We released a model similar to the English `bert-base-uncased` model (12-layer, 768-hidden, 12-heads, 110M parameters).
+* We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
+* We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
+
+
+## Requirements
+
+We published `bert-base-greek-uncased-v1` as part of [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) repository. So, you need to install the transfomers library through pip along with PyTorch or Tensorflow 2.
+
+```
+pip install transfomers
+pip install (torch|tensorflow)
+```
+
+## Pre-process text (Deaccent - Lower)
+
+In order to use `bert-base-greek-uncased-v1`, you have to pre-process texts to lowercase letters and remove all Greek diacritics.
+
+```python
+
+import unicodedata
+
+def strip_accents_and_lowercase(s):
+   return ''.join(c for c in unicodedata.normalize('NFD', s)
+                  if unicodedata.category(c) != 'Mn').lower()
+
+accented_string = "Αυτή είναι η Ελληνική έκδοση του BERT."
+unaccented_string = strip_accents_and_lowercase(accented_string)
+
+print(unaccented_string) # αυτη ειναι η ελληνικη εκδοση του bert.
+
+```
+
+## Load Pretrained Model 
+
+```python
+from transformers import AutoTokenizer, AutoModel
+
+tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
+model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
+```
+
+## Use Pretrained Model as a Language Model
+
+```python
+import torch
+from transformers import *
+
+# Load model and tokenizer
+tokenizer_greek = AutoTokenizer.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
+lm_model_greek = AutoModelWithLMHead.from_pretrained('nlpaueb/bert-base-greek-uncased-v1')
+
+# ================ EXAMPLE 1 ================
+text_1 = 'O ποιητής έγραψε ένα [MASK] .'
+# EN: 'The poet wrote a [MASK].'
+input_ids = tokenizer_greek.encode(text_1)
+print(tokenizer_greek.convert_ids_to_tokens(input_ids))
+# ['[CLS]', 'o', 'ποιητης', 'εγραψε', 'ενα', '[MASK]', '.', '[SEP]']
+outputs = lm_model_greek(torch.tensor([input_ids]))[0]
+print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 5].max(0)[1].item()))
+# the most plausible prediction for [MASK] is "song"
+
+# ================ EXAMPLE 2 ================
+text_2 = 'Είναι ένας [MASK] άνθρωπος.'
+# EN: 'He is a [MASK] person.'
+input_ids = tokenizer_greek.encode(text_1)
+print(tokenizer_greek.convert_ids_to_tokens(input_ids))
+# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', '.', '[SEP]']
+outputs = lm_model_greek(torch.tensor([input_ids]))[0]
+print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 3].max(0)[1].item()))
+# the most plausible prediction for [MASK] is "good"
+
+# ================ EXAMPLE 3 ================
+text_3 = 'Είναι ένας [MASK] άνθρωπος και κάνει συχνά [MASK].'
+# EN: 'He is a [MASK] person he does frequently [MASK].'
+input_ids = tokenizer_greek.encode(text_3)
+print(tokenizer_greek.convert_ids_to_tokens(input_ids))
+# ['[CLS]', 'ειναι', 'ενας', '[MASK]', 'ανθρωπος', 'και', 'κανει', 'συχνα', '[MASK]', '.', '[SEP]']
+outputs = lm_model_greek(torch.tensor([input_ids]))[0]
+print(tokenizer_greek.convert_ids_to_tokens(outputs[0, 8].max(0)[1].item()))
+# the most plausible prediction for the second [MASK] is "trips"
+```
+
+## Evaluation on downstream tasks
+
+TBA
+
+## Author
+
+Ilias Chalkidis on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)
+
+| Github: [@ilias.chalkidis](https://github.com/seolhokim) | Twitter: [@KiddoThe2B](https://twitter.com/KiddoThe2B) |
+
+## About Us
+
+[AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts.
+
+The group's current research interests include:
+* question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering,
+* natural language generation from databases and ontologies, especially Semantic Web ontologies,
+text classification, including filtering spam and abusive content,
+* information extraction and opinion mining, including legal text analytics and sentiment analysis,
+* natural language processing tools for Greek, for example parsers and named-entity recognizers,
+machine learning in natural language processing, especially deep learning.
+
+The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business.
--- a/model_cards/nlptown/bert-base-multilingual-uncased-sentiment/README.md
+++ b/model_cards/nlptown/bert-base-multilingual-uncased-sentiment/README.md
@@ -0,0 +1,49 @@
+---
+language:
+- english
+- dutch
+- german
+- french
+- italian
+- spanish
+---
+
+# bert-base-multilingual-uncased-sentiment
+
+This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).
+
+This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.
+
+## Training data
+
+Here is the number of product reviews we used for finetuning the model: 
+
+| Language | Number of reviews |
+| -------- | ----------------- |
+| English  | 150k           |
+| Dutch    | 80k            |
+| German   | 137k           |
+| French   | 140k           |
+| Italian  | 72k            |
+| Spanish  | 50k            |
+
+## Accuracy
+
+The finetuned model obtained the following accuracy on 5,000 held-out product reviews in each of the languages:
+
+- Accuracy (exact) is the exact match on the number of stars.
+- Accuracy (off-by-1) is the percentage of reviews where the number of stars the model predicts differs by a maximum of 1 from the number given by the human reviewer. 
+
+
+| Language | Accuracy (exact) | Accuracy (off-by-1) |
+| -------- | ---------------------- | ------------------- |
+| English  | 67%                 | 95%
+| Dutch    | 57%                 | 93%
+| German   | 61%                 | 94%
+| French   | 59%                 | 94%
+| Italian  | 59%                 | 95%
+| Spanish  | 58%                 | 95%
+
+## Contact 
+
+Contact [NLP Town](https://www.nlp.town) for questions, feedback and/or requests for similar models.
--- a/model_cards/severinsimmler/literary-german-bert/README.md
+++ b/model_cards/severinsimmler/literary-german-bert/README.md
@@ -0,0 +1,51 @@
+---
+language: german
+thumbnail: kfold.png
+---
+
+# German BERT for literary texts
+
+This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
+
+
+# Stats
+
+## Language modeling
+
+The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
+
+![years](prosa-jahre.png)
+
+
+### Results
+
+After one epoch:
+
+| Model            | Perplexity |
+| ---------------- | ---------- |
+| Vanilla BERT     | 6.82       |
+| Fine-tuned BERT  | 4.98       |
+
+
+## Named entity recognition
+
+The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
+
+
+## Results
+
+| Dataset | Precision | Recall | F1   |
+| ------- | --------- | ------ | ---- |
+| Dev     | 96.4      | 87.3   | 91.6 |
+| Test    | 92.8      | 94.9   | 93.8 |
+
+The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
+
+![kfold](kfold.png)
+
+
+# References
+
+Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
+
+Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.
--- a/model_cards/severinsimmler/literary-german-bert/kfold.png
+++ b/model_cards/severinsimmler/literary-german-bert/kfold.png
--- a/model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
+++ b/model_cards/severinsimmler/literary-german-bert/prosa-jahre.png
--- a/setup.cfg
+++ b/setup.cfg
@@ -15,6 +15,7 @@ known_third_party =
    packaging
    PIL
    psutil
+    pytorch_lightning
    seqeval
    sklearn
    tensorboardX
@@ -23,6 +24,7 @@ known_third_party =
    torch
    torchtext
    torchvision
+    torch_xla

 line_length = 119
 lines_after_imports = 2
--- a/setup.py
+++ b/setup.py
@@ -34,6 +34,9 @@ To create the package for pypi.

 7. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory.

+8. Update the documentation commit in .circleci/deploy.sh for the accurate documentation to be displayed
+
+9. Update README.md to redirect to correct documentation.
 """

 import shutil
@@ -63,6 +66,7 @@ extras = {}
 extras["mecab"] = ["mecab-python3"]
 extras["sklearn"] = ["scikit-learn"]
 extras["tf"] = ["tensorflow"]
+extras["tf-cpu"] = ["tensorflow-cpu"]
 extras["torch"] = ["torch"]

 extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
@@ -75,8 +79,8 @@ extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "sciki

 setup(
    name="transformers",
-    version="2.4.1",
-    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
+    version="2.5.1",
+    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
    long_description=open("README.md", "r", encoding="utf-8").read(),
@@ -88,7 +92,7 @@ setup(
    packages=find_packages("src"),
    install_requires=[
        "numpy",
-        "tokenizers == 0.0.11",
+        "tokenizers == 0.5.2",
        # accessing files from S3 directly
        "boto3",
        # filesystem locks e.g. to prevent parallel downloads
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "2.4.1"
+__version__ = "2.5.1"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -21,6 +21,7 @@ import logging

 from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
 from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig
+from .configuration_bart import BartConfig
 from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
 from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
 from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
@@ -101,21 +102,23 @@ from .pipelines import (
    PipelineDataFormat,
    QuestionAnsweringPipeline,
    TextClassificationPipeline,
+    TokenClassificationPipeline,
    pipeline,
 )
 from .tokenization_albert import AlbertTokenizer
 from .tokenization_auto import AutoTokenizer
+from .tokenization_bart import BartTokenizer
 from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
 from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
 from .tokenization_camembert import CamembertTokenizer
 from .tokenization_ctrl import CTRLTokenizer
-from .tokenization_distilbert import DistilBertTokenizer
+from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
 from .tokenization_flaubert import FlaubertTokenizer
 from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
-from .tokenization_openai import OpenAIGPTTokenizer
-from .tokenization_roberta import RobertaTokenizer
+from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
+from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
 from .tokenization_t5 import T5Tokenizer
-from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer
+from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer, TransfoXLTokenizerFast

 # Tokenizers
 from .tokenization_utils import PreTrainedTokenizer
@@ -203,6 +206,7 @@ if is_torch_available():
        XLMForQuestionAnsweringSimple,
        XLM_PRETRAINED_MODEL_ARCHIVE_MAP,
    )
+    from .modeling_bart import BartForSequenceClassification, BartModel, BartForMaskedLM
    from .modeling_roberta import (
        RobertaForMaskedLM,
        RobertaModel,
@@ -217,6 +221,7 @@ if is_torch_available():
        CamembertModel,
        CamembertForSequenceClassification,
        CamembertForTokenClassification,
+        CamembertForQuestionAnswering,
        CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    )
    from .modeling_distilbert import (
--- a/src/transformers/activations.py
+++ b/src/transformers/activations.py
@@ -0,0 +1,51 @@
+import math
+
+import torch
+import torch.nn.functional as F
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+def _gelu_python(x):
+    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        This is now written in C in torch.nn.functional
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+if torch.__version__ < "1.4.0":
+    gelu = _gelu_python
+else:
+    gelu = F.gelu
+
+
+def gelu_new(x):
+    """ Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+ACT2FN = {
+    "relu": F.relu,
+    "swish": swish,
+    "gelu": gelu,
+    "tanh": F.tanh,
+    "gelu_new": gelu_new,
+}
+
+
+def get_activation(activation_string):
+    if activation_string in ACT2FN:
+        return ACT2FN[activation_string]
+    else:
+        raise KeyError(
+            "function {} not found in ACT2FN mapping {} or torch.nn.functional".format(
+                activation_string, list(ACT2FN.keys())
+            )
+        )
--- a/src/transformers/commands/env.py
+++ b/src/transformers/commands/env.py
@@ -0,0 +1,58 @@
+import platform
+from argparse import ArgumentParser
+
+from transformers import __version__ as version
+from transformers import is_tf_available, is_torch_available
+from transformers.commands import BaseTransformersCLICommand
+
+
+def info_command_factory(_):
+    return EnvironmentCommand()
+
+
+class EnvironmentCommand(BaseTransformersCLICommand):
+    @staticmethod
+    def register_subcommand(parser: ArgumentParser):
+        download_parser = parser.add_parser("env")
+        download_parser.set_defaults(func=info_command_factory)
+
+    def run(self):
+        pt_version = "not installed"
+        pt_cuda_available = "NA"
+        if is_torch_available():
+            import torch
+
+            pt_version = torch.__version__
+            pt_cuda_available = torch.cuda.is_available()
+
+        tf_version = "not installed"
+        tf_cuda_available = "NA"
+        if is_tf_available():
+            import tensorflow as tf
+
+            tf_version = tf.__version__
+            try:
+                # deprecated in v2.1
+                tf_cuda_available = tf.test.is_gpu_available()
+            except AttributeError:
+                # returns list of devices, convert to bool
+                tf_cuda_available = bool(tf.config.list_physical_devices("GPU"))
+
+        info = {
+            "`transformers` version": version,
+            "Platform": platform.platform(),
+            "Python version": platform.python_version(),
+            "PyTorch version (GPU?)": "{} ({})".format(pt_version, pt_cuda_available),
+            "Tensorflow version (GPU?)": "{} ({})".format(tf_version, tf_cuda_available),
+            "Using GPU in script?": "<fill in>",
+            "Using distributed or parallel set-up in script?": "<fill in>",
+        }
+
+        print("\nCopy-and-paste the text below in your GitHub issue and FILL OUT the two last points.\n")
+        print(self.format_dict(info))
+
+        return info
+
+    @staticmethod
+    def format_dict(d):
+        return "\n".join(["- {}: {}".format(prop, val) for prop, val in d.items()]) + "\n"
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -19,6 +19,7 @@ import logging
 from collections import OrderedDict

 from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
+from .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
 from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
 from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
 from .configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
@@ -42,6 +43,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
    (key, value)
    for pretrained_map in [
        BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        BART_PRETRAINED_CONFIG_ARCHIVE_MAP,
        OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
        GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -67,6 +69,7 @@ CONFIG_MAPPING = OrderedDict(
        ("albert", AlbertConfig,),
        ("camembert", CamembertConfig,),
        ("xlm-roberta", XLMRobertaConfig,),
+        ("bart", BartConfig,),
        ("roberta", RobertaConfig,),
        ("flaubert", FlaubertConfig,),
        ("bert", BertConfig,),
--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@@ -0,0 +1,101 @@
+# coding=utf-8
+# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BART configuration """
+
+
+import logging
+
+from .configuration_utils import PretrainedConfig
+
+
+logger = logging.getLogger(__name__)
+
+_bart_large_url = "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json"
+BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "bart-large": _bart_large_url,
+    "bart-large-mnli": _bart_large_url,  # fine as same
+    "bart-cnn": None,  # not done
+}
+
+
+class BartConfig(PretrainedConfig):
+    r"""
+        Configuration class for Bart. Parameters are renamed from the fairseq implementation
+    """
+    model_type = "bart"
+    pretrained_config_archive_map = BART_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(
+        self,
+        activation_dropout=0.0,
+        vocab_size=50265,
+        pad_token_id=1,
+        eos_token_id=2,
+        d_model=1024,
+        encoder_ffn_dim=4096,
+        encoder_layers=12,
+        encoder_attention_heads=16,
+        decoder_ffn_dim=4096,
+        decoder_layers=12,
+        decoder_attention_heads=16,
+        encoder_layerdrop=0.0,
+        decoder_layerdrop=0.0,
+        attention_dropout=0.0,
+        dropout=0.1,
+        max_position_embeddings=1024,
+        init_std=0.02,
+        classifier_dropout=0.0,
+        output_past=False,
+        num_labels=3,
+        **common_kwargs
+    ):
+        r"""
+            :class:`~transformers.BartConfig` is the configuration class for `BartModel`.
+            Examples:
+                config = BartConfig.from_pretrained('bart-large')
+                model = BartModel(config)
+        """
+        super().__init__(num_labels=num_labels, output_past=output_past, pad_token_id=pad_token_id, **common_kwargs)
+
+        self.vocab_size = vocab_size
+        self.d_model = d_model  # encoder_embed_dim and decoder_embed_dim
+        self.eos_token_id = eos_token_id
+
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.encoder_layers = self.num_hidden_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.encoder_layerdrop = encoder_layerdrop
+        self.decoder_layerdrop = decoder_layerdrop
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.init_std = init_std  # Normal(0, this parameter)
+
+        # 3 Types of Dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.dropout = dropout
+
+        # Classifier stuff
+        self.classif_dropout = classifier_dropout
+
+    @property
+    def num_attention_heads(self):
+        return self.encoder_attention_heads
+
+    @property
+    def hidden_size(self):
+        return self.d_model
--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@@ -25,6 +25,8 @@ logger = logging.getLogger(__name__)
 DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "distilbert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json",
    "distilbert-base-uncased-distilled-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-config.json",
+    "distilbert-base-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-config.json",
+    "distilbert-base-cased-distilled-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-config.json",
    "distilbert-base-german-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-config.json",
    "distilbert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-config.json",
    "distilbert-base-uncased-finetuned-sst-2-english": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-config.json",
@@ -58,7 +60,7 @@ class DistilBertConfig(PretrainedConfig):
                Number of attention heads for each attention layer in the Transformer encoder.
            dim (:obj:`int`, optional, defaults to 768):
                Dimensionality of the encoder layers and the pooler layer.
-            intermediate_size (:obj:`int`, optional, defaults to 3072):
+            hidden_dim (:obj:`int`, optional, defaults to 3072):
                The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
            dropout (:obj:`float`, optional, defaults to 0.1):
                The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -75,9 +75,9 @@ class PretrainedConfig(object):
        self.top_k = kwargs.pop("top_k", 50)
        self.top_p = kwargs.pop("top_p", 1.0)
        self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
-        self.bos_token_id = kwargs.pop("bos_token_id", 0)
-        self.pad_token_id = kwargs.pop("pad_token_id", 0)
-        self.eos_token_ids = kwargs.pop("eos_token_ids", 0)
+        self.bos_token_id = kwargs.pop("bos_token_id", None)
+        self.pad_token_id = kwargs.pop("pad_token_id", None)
+        self.eos_token_ids = kwargs.pop("eos_token_ids", None)
        self.length_penalty = kwargs.pop("length_penalty", 1.0)
        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)

@@ -198,6 +198,7 @@ class PretrainedConfig(object):
        force_download = kwargs.pop("force_download", False)
        resume_download = kwargs.pop("resume_download", False)
        proxies = kwargs.pop("proxies", None)
+        local_files_only = kwargs.pop("local_files_only", False)

        if pretrained_config_archive_map is None:
            pretrained_config_archive_map = cls.pretrained_config_archive_map
@@ -219,6 +220,7 @@ class PretrainedConfig(object):
                force_download=force_download,
                proxies=proxies,
                resume_download=resume_download,
+                local_files_only=local_files_only,
            )
            # Load config dict
            if resolved_config_file is None:
--- a/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
@@ -0,0 +1,100 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert BART checkpoint."""
+
+
+import argparse
+import logging
+from pathlib import Path
+
+import fairseq
+import torch
+from packaging import version
+
+from transformers import BartConfig, BartForSequenceClassification, BartModel, BartTokenizer
+
+
+if version.parse(fairseq.__version__) < version.parse("0.9.0"):
+    raise Exception("requires fairseq >= 0.9.0")
+
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+SAMPLE_TEXT = "Hello world! cécé herlolip"
+
+rename_keys = [
+    ("model.classification_heads.mnli.dense.weight", "classification_head.dense.weight"),
+    ("model.classification_heads.mnli.dense.bias", "classification_head.dense.bias"),
+    ("model.classification_heads.mnli.out_proj.weight", "classification_head.out_proj.weight"),
+    ("model.classification_heads.mnli.out_proj.bias", "classification_head.out_proj.bias"),
+]
+IGNORE_KEYS = ["encoder.version", "decoder.version", "model.encoder.version", "model.decoder.version"]
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
+    """
+    Copy/paste/tweak model's weights to our BERT structure.
+    """
+    b2 = torch.hub.load("pytorch/fairseq", checkpoint_path)
+    b2.eval()  # disable dropout
+    b2.model.upgrade_state_dict(b2.model.state_dict())
+    config = BartConfig()
+    tokens = b2.encode(SAMPLE_TEXT).unsqueeze(0)
+    tokens2 = BartTokenizer.from_pretrained("bart-large").encode(SAMPLE_TEXT).unsqueeze(0)
+    assert torch.eq(tokens, tokens2).all()
+
+    # assert their_output.size() == (1, 11, 1024)
+
+    if checkpoint_path == "bart.large":
+        state_dict = b2.model.state_dict()
+        state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
+        model = BartModel(config)
+        their_output = b2.extract_features(tokens)
+
+    else:  # MNLI Case
+        state_dict = b2.state_dict()
+        state_dict["model.shared.weight"] = state_dict["model.decoder.embed_tokens.weight"]
+        for src, dest in rename_keys:
+            rename_key(state_dict, src, dest)
+        state_dict.pop("_float_tensor", None)
+        model = BartForSequenceClassification(config)
+        their_output = b2.predict("mnli", tokens, return_logits=True)
+    for k in IGNORE_KEYS:
+        state_dict.pop(k, None)
+    model.load_state_dict(state_dict)
+    model.eval()
+    our_outputs = model.forward(tokens)[0]
+
+    assert their_output.shape == our_outputs.shape
+    assert (their_output == our_outputs).all().item()
+    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+    model.save_pretrained(pytorch_dump_folder_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument("fairseq_path", choices=["bart.large", "bart.large.mnli"], type=str, help="")
+    parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
+    args = parser.parse_args()
+    convert_bart_checkpoint(
+        args.fairseq_path, args.pytorch_dump_folder_path,
+    )
--- a/src/transformers/convert_pytorch_checkpoint_to_tf2.py
+++ b/src/transformers/convert_pytorch_checkpoint_to_tf2.py
@@ -277,7 +277,7 @@ MODEL_CLASSES = {
        DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
        DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
    ),
-    "distilbert-base-uncased-distilled-squad": (
+    "distilbert-base-distilled-squad": (
        DistilBertConfig,
        TFDistilBertForQuestionAnswering,
        DistilBertForQuestionAnswering,
--- a/src/transformers/data/processors/squad.py
+++ b/src/transformers/data/processors/squad.py
@@ -123,7 +123,7 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
    truncated_query = tokenizer.encode(example.question_text, add_special_tokens=False, max_length=max_query_length)
    sequence_added_tokens = (
        tokenizer.max_len - tokenizer.max_len_single_sentence + 1
-        if "roberta" in str(type(tokenizer))
+        if "roberta" in str(type(tokenizer)) or "camembert" in str(type(tokenizer))
        else tokenizer.max_len - tokenizer.max_len_single_sentence
    )
    sequence_pair_added_tokens = tokenizer.max_len - tokenizer.max_len_sentences_pair
@@ -147,7 +147,14 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
        )

        if tokenizer.pad_token_id in encoded_dict["input_ids"]:
-            non_padded_ids = encoded_dict["input_ids"][: encoded_dict["input_ids"].index(tokenizer.pad_token_id)]
+            if tokenizer.padding_side == "right":
+                non_padded_ids = encoded_dict["input_ids"][: encoded_dict["input_ids"].index(tokenizer.pad_token_id)]
+            else:
+                last_padding_id_position = (
+                    len(encoded_dict["input_ids"]) - 1 - encoded_dict["input_ids"][::-1].index(tokenizer.pad_token_id)
+                )
+                non_padded_ids = encoded_dict["input_ids"][last_padding_id_position + 1 :]
+
        else:
            non_padded_ids = encoded_dict["input_ids"]

@@ -621,7 +628,7 @@ class SquadExample(object):
        self.doc_tokens = doc_tokens
        self.char_to_word_offset = char_to_word_offset

-        # Start end end positions only has a value during evaluation.
+        # Start and end positions only has a value during evaluation.
        if start_position_character is not None and not is_impossible:
            self.start_position = char_to_word_offset[start_position_character]
            self.end_position = char_to_word_offset[
--- a/src/transformers/data/processors/utils.py
+++ b/src/transformers/data/processors/utils.py
@@ -32,11 +32,11 @@ class InputExample(object):
    Args:
        guid: Unique id for the example.
        text_a: string. The untokenized text of the first sequence. For single
-        sequence tasks, only this sequence must be specified.
+            sequence tasks, only this sequence must be specified.
        text_b: (Optional) string. The untokenized text of the second sequence.
-        Only must be specified for sequence pair tasks.
+            Only must be specified for sequence pair tasks.
        label: (Optional) string. The label of the example. This should be
-        specified for train and dev examples, but not for test examples.
+            specified for train and dev examples, but not for test examples.
    """

    def __init__(self, guid, text_a, text_b=None, label=None):
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -8,13 +8,16 @@ import fnmatch
 import json
 import logging
 import os
+import shutil
 import sys
+import tarfile
 import tempfile
 from contextlib import contextmanager
 from functools import partial, wraps
 from hashlib import sha256
 from typing import Optional
 from urllib.parse import urlparse
+from zipfile import ZipFile, is_zipfile

 import boto3
 import requests
@@ -203,7 +206,15 @@ def filename_to_url(filename, cache_dir=None):


 def cached_path(
-    url_or_filename, cache_dir=None, force_download=False, proxies=None, resume_download=False, user_agent=None
+    url_or_filename,
+    cache_dir=None,
+    force_download=False,
+    proxies=None,
+    resume_download=False,
+    user_agent=None,
+    extract_compressed_file=False,
+    force_extract=False,
+    local_files_only=False,
 ) -> Optional[str]:
    """
    Given something that might be a URL (or might be a local path),
@@ -215,6 +226,10 @@ def cached_path(
        force_download: if True, re-dowload the file even if it's already cached in the cache dir.
        resume_download: if True, resume the download if incompletly recieved file is found.
        user_agent: Optional string or dict that will be appended to the user-agent on remote requests.
+        extract_compressed_file: if True and the path point to a zip or tar file, extract the compressed
+            file in a folder along the archive.
+        force_extract: if True when extract_compressed_file is True and the archive was already extracted,
+            re-extract the archive and overide the folder where it was extracted.

    Return:
        None in case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
@@ -229,17 +244,18 @@ def cached_path(

    if is_remote_url(url_or_filename):
        # URL, so get it from the cache (downloading if necessary)
-        return get_from_cache(
+        output_path = get_from_cache(
            url_or_filename,
            cache_dir=cache_dir,
            force_download=force_download,
            proxies=proxies,
            resume_download=resume_download,
            user_agent=user_agent,
+            local_files_only=local_files_only,
        )
    elif os.path.exists(url_or_filename):
        # File, and it exists.
-        return url_or_filename
+        output_path = url_or_filename
    elif urlparse(url_or_filename).scheme == "":
        # File, but it doesn't exist.
        raise EnvironmentError("file {} not found".format(url_or_filename))
@@ -247,6 +263,39 @@ def cached_path(
        # Something unknown
        raise ValueError("unable to parse {} as a URL or as a local path".format(url_or_filename))

+    if extract_compressed_file:
+        if not is_zipfile(output_path) and not tarfile.is_tarfile(output_path):
+            return output_path
+
+        # Path where we extract compressed archives
+        # We avoid '.' in dir name and add "-extracted" at the end: "./model.zip" => "./model-zip-extracted/"
+        output_dir, output_file = os.path.split(output_path)
+        output_extract_dir_name = output_file.replace(".", "-") + "-extracted"
+        output_path_extracted = os.path.join(output_dir, output_extract_dir_name)
+
+        if os.path.isdir(output_path_extracted) and os.listdir(output_path_extracted) and not force_extract:
+            return output_path_extracted
+
+        # Prevent parallel extractions
+        lock_path = output_path + ".lock"
+        with FileLock(lock_path):
+            shutil.rmtree(output_path_extracted, ignore_errors=True)
+            os.makedirs(output_path_extracted)
+            if is_zipfile(output_path):
+                with ZipFile(output_path, "r") as zip_file:
+                    zip_file.extractall(output_path_extracted)
+                    zip_file.close()
+            elif tarfile.is_tarfile(output_path):
+                tar_file = tarfile.open(output_path)
+                tar_file.extractall(output_path_extracted)
+                tar_file.close()
+            else:
+                raise EnvironmentError("Archive format of {} could not be identified".format(output_path))
+
+        return output_path_extracted
+
+    return output_path
+

 def split_s3_path(url):
    """Split a full s3 path into the bucket name and path."""
@@ -331,7 +380,14 @@ def http_get(url, temp_file, proxies=None, resume_size=0, user_agent=None):


 def get_from_cache(
-    url, cache_dir=None, force_download=False, proxies=None, etag_timeout=10, resume_download=False, user_agent=None
+    url,
+    cache_dir=None,
+    force_download=False,
+    proxies=None,
+    etag_timeout=10,
+    resume_download=False,
+    user_agent=None,
+    local_files_only=False,
 ) -> Optional[str]:
    """
    Given a URL, look for the corresponding file in the local cache.
@@ -348,18 +404,19 @@ def get_from_cache(

    os.makedirs(cache_dir, exist_ok=True)

-    # Get eTag to add to filename, if it exists.
-    if url.startswith("s3://"):
-        etag = s3_etag(url, proxies=proxies)
-    else:
-        try:
-            response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)
-            if response.status_code != 200:
-                etag = None
-            else:
-                etag = response.headers.get("ETag")
-        except (EnvironmentError, requests.exceptions.Timeout):
-            etag = None
+    etag = None
+    if not local_files_only:
+        # Get eTag to add to filename, if it exists.
+        if url.startswith("s3://"):
+            etag = s3_etag(url, proxies=proxies)
+        else:
+            try:
+                response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)
+                if response.status_code == 200:
+                    etag = response.headers.get("ETag")
+            except (EnvironmentError, requests.exceptions.Timeout):
+                # etag is already None
+                pass

    filename = url_to_filename(url, etag)

@@ -380,6 +437,15 @@ def get_from_cache(
            if len(matching_files) > 0:
                return os.path.join(cache_dir, matching_files[-1])
            else:
+                # If files cannot be found and local_files_only=True,
+                # the models might've been found if local_files_only=False
+                # Notify the user about that
+                if local_files_only:
+                    raise ValueError(
+                        "Cannot find the requested files in the cached path and outgoing traffic has been"
+                        " disabled. To enable model look-ups and downloads online, set 'local_files_only'"
+                        " to False."
+                    )
                return None

    # From now on, etag is not None.
--- a/src/transformers/modeling_albert.py
+++ b/src/transformers/modeling_albert.py
@@ -117,7 +117,13 @@ def load_tf_weights_in_albert(model, config, tf_checkpoint_path):
        name = name.split("/")

        # Ignore the gradients applied by the LAMB/ADAM optimizers.
-        if "adam_m" in name or "adam_v" in name or "global_step" in name:
+        if (
+            "adam_m" in name
+            or "adam_v" in name
+            or "AdamWeightDecayOptimizer" in name
+            or "AdamWeightDecayOptimizer_1" in name
+            or "global_step" in name
+        ):
            logger.info("Skipping {}".format("/".join(name)))
            continue

@@ -594,7 +600,7 @@ class AlbertMLMHead(nn.Module):
        hidden_states = self.LayerNorm(hidden_states)
        hidden_states = self.decoder(hidden_states)

-        prediction_scores = hidden_states + self.bias
+        prediction_scores = hidden_states

        return prediction_scores

--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -21,6 +21,7 @@ from collections import OrderedDict
 from .configuration_auto import (
    AlbertConfig,
    AutoConfig,
+    BartConfig,
    BertConfig,
    CamembertConfig,
    CTRLConfig,
@@ -43,6 +44,7 @@ from .modeling_albert import (
    AlbertForSequenceClassification,
    AlbertModel,
 )
+from .modeling_bart import BART_PRETRAINED_MODEL_ARCHIVE_MAP, BartForMaskedLM, BartForSequenceClassification, BartModel
 from .modeling_bert import (
    BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    BertForMaskedLM,
@@ -118,6 +120,7 @@ ALL_PRETRAINED_MODEL_ARCHIVE_MAP = dict(
    (key, value)
    for pretrained_map in [
        BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+        BART_PRETRAINED_MODEL_ARCHIVE_MAP,
        OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
        TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
        GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -142,6 +145,7 @@ MODEL_MAPPING = OrderedDict(
        (AlbertConfig, AlbertModel),
        (CamembertConfig, CamembertModel),
        (XLMRobertaConfig, XLMRobertaModel),
+        (BartConfig, BartModel),
        (RobertaConfig, RobertaModel),
        (BertConfig, BertModel),
        (OpenAIGPTConfig, OpenAIGPTModel),
@@ -161,6 +165,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
        (AlbertConfig, AlbertForMaskedLM),
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
+        (BartConfig, BartForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
        (BertConfig, BertForPreTraining),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
@@ -180,6 +185,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
        (AlbertConfig, AlbertForMaskedLM),
        (CamembertConfig, CamembertForMaskedLM),
        (XLMRobertaConfig, XLMRobertaForMaskedLM),
+        (BartConfig, BartForMaskedLM),
        (RobertaConfig, RobertaForMaskedLM),
        (BertConfig, BertForMaskedLM),
        (OpenAIGPTConfig, OpenAIGPTLMHeadModel),
@@ -198,6 +204,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
        (AlbertConfig, AlbertForSequenceClassification),
        (CamembertConfig, CamembertForSequenceClassification),
        (XLMRobertaConfig, XLMRobertaForSequenceClassification),
+        (BartConfig, BartForSequenceClassification),
        (RobertaConfig, RobertaForSequenceClassification),
        (BertConfig, BertForSequenceClassification),
        (XLNetConfig, XLNetForSequenceClassification),
@@ -352,16 +359,12 @@ class AutoModel(object):
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.

            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModel.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModel.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModel.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
@@ -496,24 +499,12 @@ class AutoModelForPreTraining(object):
            output_loading_info: (`optional`) boolean:
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model.
-                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or
-                automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the
-                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have
-                  already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class
-                  initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of
-                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute
-                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration
-                  attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModelForPreTraining.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModelForPreTraining.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
@@ -650,24 +641,12 @@ class AutoModelWithLMHead(object):
            output_loading_info: (`optional`) boolean:
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.
            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model.
-                (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or
-                automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the
-                  underlying model's ``__init__`` method (we assume all relevant updates to the configuration have
-                  already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class
-                  initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of
-                  ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute
-                  with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration
-                  attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModelWithLMHead.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModelWithLMHead.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
@@ -807,16 +786,12 @@ class AutoModelForSequenceClassification(object):
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.

            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModelForSequenceClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
@@ -950,16 +925,12 @@ class AutoModelForQuestionAnswering(object):
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.

            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModelForQuestionAnswering.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
@@ -1094,16 +1065,12 @@ class AutoModelForTokenClassification:
                Set to ``True`` to also return a dictionnary containing missing keys, unexpected keys and error messages.

            kwargs: (`optional`) Remaining dictionary of keyword arguments:
-                Can be used to update the configuration object (after it being loaded) and initiate the model. (e.g. ``output_attention=True``). Behave differently depending on whether a `config` is provided or automatically loaded:
-
-                - If a configuration is provided with ``config``, ``**kwargs`` will be directly passed to the underlying model's ``__init__`` method (we assume all relevant updates to the configuration have already been done)
-                - If a configuration is not provided, ``kwargs`` will be first passed to the configuration class initialization function (:func:`~transformers.PretrainedConfig.from_pretrained`). Each key of ``kwargs`` that corresponds to a configuration attribute will be used to override said attribute with the supplied ``kwargs`` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's ``__init__`` function.
+                These arguments will be passed to the configuration and the model.

        Examples::

            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased')    # Download model and configuration from S3 and cache.
            model = AutoModelForTokenClassification.from_pretrained('./test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
-            model = AutoModelForTokenClassification.from_pretrained('bert-base-uncased', output_attention=True)  # Update configuration during loading
            assert model.config.output_attention == True
            # Loading from a TF checkpoint file instead of a PyTorch model (slower)
            config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
--- a/src/transformers/modeling_bart.py
+++ b/src/transformers/modeling_bart.py
--- a/src/transformers/modeling_bert.py
+++ b/src/transformers/modeling_bert.py
@@ -24,6 +24,7 @@ import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss, MSELoss

+from .activations import gelu, gelu_new, swish
 from .configuration_bert import BertConfig
 from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_utils import PreTrainedModel, prune_linear_layer
@@ -86,7 +87,10 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
        name = name.split("/")
        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculated m and v
        # which are not required for using pretrained model
-        if any(n in ["adam_v", "adam_m", "global_step"] for n in name):
+        if any(
+            n in ["adam_v", "adam_m", "AdamWeightDecayOptimizer", "AdamWeightDecayOptimizer_1", "global_step"]
+            for n in name
+        ):
            logger.info("Skipping {}".format("/".join(name)))
            continue
        pointer = model
@@ -126,26 +130,6 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
    return model


-def gelu(x):
-    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
-        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
-        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
-        Also see https://arxiv.org/abs/1606.08415
-    """
-    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
-
-
-def gelu_new(x):
-    """ Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
-        Also see https://arxiv.org/abs/1606.08415
-    """
-    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
-
-
-def swish(x):
-    return x * torch.sigmoid(x)
-
-
 def mish(x):
    return x * torch.tanh(nn.functional.softplus(x))

@@ -487,7 +471,7 @@ class BertLMPredictionHead(nn.Module):

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
-        hidden_states = self.decoder(hidden_states) + self.bias
+        hidden_states = self.decoder(hidden_states)
        return hidden_states


@@ -730,8 +714,8 @@ class BertModel(BertPreTrainedModel):
                seq_ids = torch.arange(seq_length, device=device)
                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
                causal_mask = causal_mask.to(
-                    torch.long
-                )  # not converting to long will cause errors with pytorch version < 1.3
+                    attention_mask.dtype
+                )  # causal and attention masks must have same type with pytorch version < 1.3
                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
            else:
                extended_attention_mask = attention_mask[:, None, None, :]
--- a/src/transformers/modeling_camembert.py
+++ b/src/transformers/modeling_camembert.py
@@ -15,7 +15,6 @@
 # limitations under the License.
 """PyTorch CamemBERT model. """

-
 import logging

 from .configuration_camembert import CamembertConfig
@@ -23,6 +22,7 @@ from .file_utils import add_start_docstrings
 from .modeling_roberta import (
    RobertaForMaskedLM,
    RobertaForMultipleChoice,
+    RobertaForQuestionAnswering,
    RobertaForSequenceClassification,
    RobertaForTokenClassification,
    RobertaModel,
@@ -37,7 +37,6 @@ CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
    "umberto-wikipedia-uncased-v1": "https://s3.amazonaws.com/models.huggingface.co/bert/Musixmatch/umberto-wikipedia-uncased-v1/pytorch_model.bin",
 }

-
 CAMEMBERT_START_DOCSTRING = r"""

    This model is a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ sub-class.
@@ -46,7 +45,8 @@ CAMEMBERT_START_DOCSTRING = r"""

    Parameters:
        config (:class:`~transformers.CamembertConfig`): Model configuration class with all the parameters of the
-            model. Initializing with a config file does not load the weights associated with the model, only the configuration.
+            model. Initializing with a config file does not load the weights associated with the model, only the
+            configuration.
            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
 """

@@ -121,3 +121,18 @@ class CamembertForTokenClassification(RobertaForTokenClassification):

    config_class = CamembertConfig
    pretrained_model_archive_map = CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP
+
+
+@add_start_docstrings(
+    """CamemBERT Model with a span classification head on top for extractive question-answering tasks like SQuAD
+    (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits` """,
+    CAMEMBERT_START_DOCSTRING,
+)
+class CamembertForQuestionAnswering(RobertaForQuestionAnswering):
+    """
+    This class overrides :class:`~transformers.RobertaForQuestionAnswering`. Please check the
+    superclass for the appropriate documentation alongside usage examples.
+    """
+
+    config_class = CamembertConfig
+    pretrained_model_archive_map = CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_MAP
--- a/src/transformers/modeling_distilbert.py
+++ b/src/transformers/modeling_distilbert.py
@@ -27,6 +27,7 @@ import torch
 import torch.nn as nn
 from torch.nn import CrossEntropyLoss

+from .activations import gelu
 from .configuration_distilbert import DistilBertConfig
 from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_utils import PreTrainedModel, prune_linear_layer
@@ -38,6 +39,8 @@ logger = logging.getLogger(__name__)
 DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
    "distilbert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pytorch_model.bin",
    "distilbert-base-uncased-distilled-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-distilled-squad-pytorch_model.bin",
+    "distilbert-base-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-pytorch_model.bin",
+    "distilbert-base-cased-distilled-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-cased-distilled-squad-pytorch_model.bin",
    "distilbert-base-german-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-german-cased-pytorch_model.bin",
    "distilbert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-multilingual-cased-pytorch_model.bin",
    "distilbert-base-uncased-finetuned-sst-2-english": "https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-finetuned-sst-2-english-pytorch_model.bin",
@@ -45,8 +48,6 @@ DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {


 # UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #
-def gelu(x):
-    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))


 def create_sinusoidal_embeddings(n_pos, dim, out):
@@ -216,11 +217,6 @@ class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()

-        self.n_heads = config.n_heads
-        self.dim = config.dim
-        self.hidden_dim = config.hidden_dim
-        self.dropout = nn.Dropout(p=config.dropout)
-        self.activation = config.activation
        self.output_attentions = config.output_attentions

        assert config.dim % config.n_heads == 0
@@ -440,8 +436,8 @@ class DistilBertModel(DistilBertPreTrainedModel):
        from transformers import DistilBertTokenizer, DistilBertModel
        import torch

-        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-        model = DistilBertModel.from_pretrained('distilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+        model = DistilBertModel.from_pretrained('distilbert-base-cased')

        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids)
@@ -544,8 +540,8 @@ class DistilBertForMaskedLM(DistilBertPreTrainedModel):
        from transformers import DistilBertTokenizer, DistilBertForMaskedLM
        import torch

-        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+        model = DistilBertForMaskedLM.from_pretrained('distilbert-base-cased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, masked_lm_labels=input_ids)
        loss, prediction_scores = outputs[:2]
@@ -619,8 +615,8 @@ class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
        from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
        import torch

-        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+        model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
@@ -711,8 +707,8 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
        from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
        import torch

-        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+        model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
        start_positions = torch.tensor([1])
        end_positions = torch.tensor([3])
@@ -798,8 +794,8 @@ class DistilBertForTokenClassification(DistilBertPreTrainedModel):
        from transformers import DistilBertTokenizer, DistilBertForTokenClassification
        import torch

-        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
-        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased')
+        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
+        model = DistilBertForTokenClassification.from_pretrained('distilbert-base-cased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
--- a/src/transformers/modeling_encoder_decoder.py
+++ b/src/transformers/modeling_encoder_decoder.py
@@ -18,7 +18,6 @@
 import logging
 import os

-import torch
 from torch import nn

 from .modeling_auto import AutoModel, AutoModelWithLMHead
@@ -232,46 +231,10 @@ class PreTrainedEncoderDecoder(nn.Module):
            encoder_outputs = ()

        kwargs_decoder["encoder_hidden_states"] = encoder_hidden_states
-        decoder_outputs = self.decoder(decoder_input_ids, encoder_hidden_states, **kwargs_decoder)
+        decoder_outputs = self.decoder(decoder_input_ids, **kwargs_decoder)

        return decoder_outputs + encoder_outputs

-    @staticmethod
-    def prepare_model_kwargs(**kwargs):
-        """ Prepare the encoder and decoder's keyword arguments.
-
-        Keyword arguments come in 3 flavors:
-        - encoder-specific (prefixed by `encoder_`)
-        - decoder-specific (prefixed by `decoder_`)
-        - those that apply to the model as whole.
-
-        We let the specific kwargs override the common ones in case of
-        conflict.
-        """
-        kwargs_common = {
-            argument: value
-            for argument, value in kwargs.items()
-            if not argument.startswith("encoder_") and not argument.startswith("decoder_")
-        }
-        decoder_kwargs = kwargs_common.copy()
-        encoder_kwargs = kwargs_common.copy()
-        encoder_kwargs.update(
-            {
-                argument[len("encoder_") :]: value
-                for argument, value in kwargs.items()
-                if argument.startswith("encoder_")
-            }
-        )
-        decoder_kwargs.update(
-            {
-                argument[len("decoder_") :]: value
-                for argument, value in kwargs.items()
-                if argument.startswith("decoder_")
-            }
-        )
-        decoder_kwargs["encoder_attention_mask"] = encoder_kwargs.get("attention_mask", None)
-        return encoder_kwargs, decoder_kwargs
-

 class Model2Model(PreTrainedEncoderDecoder):
    r"""
@@ -330,21 +293,3 @@ class Model2Model(PreTrainedEncoderDecoder):
        )

        return model
-
-
-class Model2LSTM(PreTrainedEncoderDecoder):
-    @classmethod
-    def from_pretrained(cls, *args, **kwargs):
-        if kwargs.get("decoder_model", None) is None:
-            # We will create a randomly initilized LSTM model as decoder
-            if "decoder_config" not in kwargs:
-                raise ValueError(
-                    "To load an LSTM in Encoder-Decoder model, please supply either: "
-                    "    - a torch.nn.LSTM model as `decoder_model` parameter (`decoder_model=lstm_model`), or"
-                    "    - a dictionary of configuration parameters that will be used to initialize a"
-                    "      torch.nn.LSTM model as `decoder_config` keyword argument. "
-                    "      E.g. `decoder_config={'input_size': 768, 'hidden_size': 768, 'num_layers': 2}`"
-                )
-            kwargs["decoder_model"] = torch.nn.LSTM(kwargs.pop("decoder_config"))
-        model = super().from_pretrained(*args, **kwargs)
-        return model
--- a/src/transformers/modeling_flaubert.py
+++ b/src/transformers/modeling_flaubert.py
@@ -231,7 +231,7 @@ class FlaubertModel(XLMModel):
            inputs_embeds = self.embeddings(input_ids)

        tensor = inputs_embeds + self.position_embeddings(position_ids).expand_as(inputs_embeds)
-        if langs is not None and self.use_lang_emb:
+        if langs is not None and self.use_lang_emb and self.config.n_langs > 1:
            tensor = tensor + self.lang_embeddings(langs)
        if token_type_ids is not None:
            tensor = tensor + self.embeddings(token_type_ids)
--- a/src/transformers/modeling_gpt2.py
+++ b/src/transformers/modeling_gpt2.py
@@ -24,6 +24,7 @@ import torch
 import torch.nn as nn
 from torch.nn import CrossEntropyLoss

+from .activations import gelu_new
 from .configuration_gpt2 import GPT2Config
 from .file_utils import add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_utils import Conv1D, PreTrainedModel, SequenceSummary, prune_conv1d_layer
@@ -95,10 +96,6 @@ def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
    return model


-def gelu(x):
-    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
-
-
 class Attention(nn.Module):
    def __init__(self, nx, n_ctx, config, scale=False):
        super().__init__()
@@ -206,7 +203,7 @@ class MLP(nn.Module):
        nx = config.n_embd
        self.c_fc = Conv1D(n_state, nx)
        self.c_proj = Conv1D(nx, n_state)
-        self.act = gelu
+        self.act = gelu_new
        self.dropout = nn.Dropout(config.resid_pdrop)

    def forward(self, x):
--- a/Show More
+++ b/Show More