Release: 2.1.1

Merge pull request #1488 from huggingface/pytorch-tpu
GLUE on TPU
2019-10-11 06:37:49 -04:00 · 2019-10-11 16:33:00 +02:00 · 2019-10-11 16:25:43 +02:00 · 2019-10-11 16:18:16 +02:00 · 2019-10-11 16:09:42 +02:00 · 2019-10-11 15:55:01 +02:00
109 changed files with 4770 additions and 1109 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -9,7 +9,7 @@ jobs:
        steps:
            - checkout
            - run: sudo pip install torch
-            - run: sudo pip install tensorflow==2.0.0-rc0
+            - run: sudo pip install tensorflow
            - run: sudo pip install --progress-bar off .
            - run: sudo pip install pytest codecov pytest-cov
            - run: sudo pip install tensorboardX scikit-learn
@@ -38,7 +38,7 @@ jobs:
        parallelism: 1
        steps:
            - checkout
-            - run: sudo pip install tensorflow==2.0.0-rc0
+            - run: sudo pip install tensorflow
            - run: sudo pip install --progress-bar off .
            - run: sudo pip install pytest codecov pytest-cov
            - run: sudo pip install tensorboardX scikit-learn
@@ -65,7 +65,7 @@ jobs:
            - image: circleci/python:2.7
        steps:
            - checkout
-            - run: sudo pip install tensorflow==2.0.0-rc0
+            - run: sudo pip install tensorflow
            - run: sudo pip install --progress-bar off .
            - run: sudo pip install pytest codecov pytest-cov
            - run: python -m pytest -sv ./transformers/tests/ --cov
@@ -81,7 +81,6 @@ jobs:
            - checkout
            - run: sudo pip install --progress-bar off -r docs/requirements.txt
            - run: sudo pip install --progress-bar off -r requirements.txt
-            - run: cd docs/source && ln -s ../../examples/README.md examples.md && cd -
            - run: cd docs && make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir
 workflow_filters: &workflow_filters
    filters:
--- a/.github/ISSUE_TEMPLATE/--new-model-addition.md
+++ b/.github/ISSUE_TEMPLATE/--new-model-addition.md
@@ -0,0 +1,23 @@
+---
+name: "\U0001F31FNew model addition"
+about: Submit a proposal/request to implement a new Transformer-based model
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+# 🌟New model addition
+
+## Model description
+
+<!-- Important information -->
+
+## Open Source status
+
+* [ ] the model implementation is available: (give details)
+* [ ] the model weights are available: (give details)
+
+## Additional context
+
+<!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/bug-report.md
+++ b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -1,6 +1,10 @@
 ---
 name: "\U0001F41B Bug Report"
 about: Submit a bug report to help us improve PyTorch Transformers
+title: ''
+labels: ''
+assignees: ''
+
 ---

 ## 🐛 Bug
@@ -45,4 +49,4 @@ Steps to reproduce the behavior:

 ## Additional context

-<!-- Add any other context about the problem here. -->
+<!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/feature-request.md
+++ b/.github/ISSUE_TEMPLATE/feature-request.md
@@ -1,6 +1,10 @@
 ---
 name: "\U0001F680 Feature Request"
 about: Submit a proposal/request for a new PyTorch Transformers feature
+title: ''
+labels: ''
+assignees: ''
+
 ---

 ## 🚀 Feature
@@ -13,4 +17,4 @@ about: Submit a proposal/request for a new PyTorch Transformers feature

 ## Additional context

-<!-- Add any other context or screenshots about the feature request here. -->
+<!-- Add any other context or screenshots about the feature request here. -->
--- a/.github/ISSUE_TEMPLATE/migration.md
+++ b/.github/ISSUE_TEMPLATE/migration.md
@@ -1,6 +1,10 @@
 ---
 name: "\U0001F4DA Migration from PyTorch-pretrained-Bert"
 about: Report a problem when migrating from PyTorch-pretrained-Bert to Transformers
+title: ''
+labels: ''
+assignees: ''
+
 ---

 ## 📚 Migration
@@ -40,4 +44,4 @@ Details of the issue:

 ## Additional context

-<!-- Add any other context about the problem here. -->
+<!-- Add any other context about the problem here. -->
--- a/.github/ISSUE_TEMPLATE/question-help.md
+++ b/.github/ISSUE_TEMPLATE/question-help.md
@@ -1,8 +1,12 @@
 ---
 name: "❓Questions & Help"
 about: Start a general discussion related to PyTorch Transformers
+title: ''
+labels: ''
+assignees: ''
+
 ---

 ## ❓ Questions & Help

-<!-- A clear and concise description of the question. -->
+<!-- A clear and concise description of the question. -->
--- a/.gitignore
+++ b/.gitignore
@@ -118,6 +118,9 @@ dmypy.json
 # vscode
 .vscode

+# Pycharm
+.idea
+
 # TF code
 tensorflow_code

@@ -131,4 +134,7 @@ examples/runs

 # data
 /data
-serialization_dir
+serialization_dir
+
+# emacs
+*.*~
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,175 @@
+# How to contribute to transformers?
+
+Everyone is welcome to contribute, and we value everybody's contribution. Code
+is thus not the only way to help the community. Answering questions, helping
+others, reaching out and improving the documentations are immensely valuable to
+the community.
+
+It also helps us if you spread the word: reference the library from blog posts
+on the awesome projects it made possible, shout out on Twitter every time it has
+helped you, or simply star the repo to say "thank you".
+
+## You can contribute in so many ways!
+
+There are 4 ways you can contribute to transformers:
+* Fixing outstanding issues with the existing code;
+* Implementing new models;
+* Contributing to the examples or to the documentation;
+* Submitting issues related to bugs or desired new features.
+
+*All are equally valuable to the community.*
+
+## Submitting a new issue or feature request
+
+Do your best to follow these guidelines when submitting an issue or a feature
+request. It will make it easier for us to come back to you quickly and with good
+feedback.
+
+### Did you find a bug?
+
+The transformers are robust and reliable thanks to the users who notify us of
+the problems they encounter. So thank you for reporting an issue.
+
+First, we would really appreciate it if you could **make sure the bug was not
+already reported** (use the search bar on Github under Issues).
+
+Did not find it? :( So we can act quickly on it, please follow these steps:
+
+* Include your **OS type and version**, the versions of **Python**, **PyTorch** and
+  **Tensorflow** when applicable;
+* A short, self-contained, code snippet that allows us to reproduce the bug in
+  less than 30s;
+* Provide the *full* traceback if an exception is raised.
+
+To get the OS and software versions, execute the following code and copy-paste
+the output:
+
+```
+import platform; print("Platform", platform.platform())
+import sys; print("Python", sys.version)
+import torch; print("PyTorch", torch.__version__)
+import tensorflow; print("Tensorflow", tensorflow.__version__)
+```
+
+### Do you want to implement a new model?
+
+Awesome! Please provide the following information:
+
+* Short description of the model and link to the paper;
+* Link to the implementation if it is open-source;
+* Link to the model weights if they are available.
+
+If you are willing to contribute the model yourself, let us know so we can best
+guide you.
+
+### Do you want a new feature (that is not a model)?
+
+A world-class feature request addresses the following points:
+
+1. Motivation first:
+  * Is it related to a problem/frustration with the library? If so, please explain
+    why. Providing a code snippet that demonstrates the problem is best.
+  * Is it related to something you would need for a project? We'd love to hear
+    about it!
+  * Is it something you worked on and think could benefit the community?
+    Awesome! Tell us what problem it solved for you.
+2. Write a *full paragraph* describing the feature;
+3. Provide a **code snippet** that demonstrates its future use;
+4. In case this is related to a paper, please attach a link;
+5. Attach any additional information (drawings, screenshots, etc.) you think may help.
+
+If your issue is well written we're already 80% of the way there by the time you
+post it.
+
+## Start contributing! (Pull Requests)
+
+Before writing code, we strongly advise you to search through the exising PRs or
+issues to make sure that nobody is already working on the same thing. If you are
+unsure, it is always a good idea to open an issue to get some feedback.
+
+You will need basic `git` proficiency to be able to contribute to
+`transformers`. `git` is not the easiest tool to use but it has the greatest
+manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro
+Git](https://git-scm.com/book/en/v2) is a very good reference.
+
+Follow these steps to start contributing:
+
+1. Fork the [repository](https://github.com/huggingface/transformers) by
+   clicking on the 'Fork' button on the repository's page. This creates a copy of the code
+   under your github user account.
+2. Clone your fork to your local disk, and add the base repository as a remote:
+   
+   ```bash
+   $ git clone git@github.com:<your Github handle>/transformers.git
+   $ cd transformers
+   $ git remote add upstream git@github.com:huggingface/transformers.git
+   ```
+
+3. Create a new branch to hold your development changes:
+
+   ```bash
+   $ git checkout -b a-descriptive-name-for-my-changes
+   ```
+   
+   **do not** work on the `master` branch.
+   
+4. Set up a development environment by running the following command in a virtual environment:
+
+   ```bash
+   $ pip install -r requirements-dev.txt
+   ```
+
+5. Develop the features on your branch. Add changed files using `git add` and
+   then `git commit` to record your changes locally:
+   
+   ```bash
+   $ git add modified_file.py
+   $ git commit
+   ```
+   
+   Please write [good commit
+   messages](https://chris.beams.io/posts/git-commit/). It
+   is a good idea to sync your copy of the code with the original repository
+   regularly. This way you can quickly account for changes:
+   
+   ```bash
+   $ git fetch upstream
+   $ git rebase upstream/master
+   ```
+   
+   Push the changes to your account using:
+   
+   ```bash
+   $ git push -u origin a-descriptive-name-for-my-changes
+   ```
+   
+6. Once you are satisfied (**and the checklist below is happy too**), go to the
+   webpage of your fork on Github. Click on 'Pull request' to send your changes
+   to the project maintainers for review.
+   
+7. It's ok if maintainers ask you for changes. It happens to core contributors
+   too! So everyone can see the changes in the Pull request, work in your local
+   branch and push the changes to your fork. They will automatically appear in
+   the pull request.
+
+
+### Checklist
+
+1. The title of your pull request should be a summary of its contribution;
+2. If your pull request adresses an issue, please mention the issue number in
+   the pull request description to make sure they are linked (and people
+   consulting the issue know you are working on it);
+3. To indicate a work in progress please prefix the title with `[WIP]`. These
+   are useful to avoid duplicated work, and to differentiate it from PRs ready
+   to be merged;
+4. Make sure pre-existing tests still pass;
+5. Add high-coverage tests. No quality test, no merge;
+6. All public methods must have informative doctrings;
+
+
+### Style guide
+
+For documentation strings, `transformers` follows the [google
+style](https://google.github.io/styleguide/pyguide.html).
+
+#### This guide was heavily inspired by the awesome [scikit-learn guide to contributing](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md)
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
    <br>
 <p>
 <p align="center">
-    <a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
+    <a href="https://circleci.com/gh/huggingface/transformers">
        <img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/transformers/master">
    </a>
    <a href="https://github.com/huggingface/transformers/blob/master/LICENSE">
@@ -22,7 +22,7 @@
 <p>State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
 </h3>

-🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
+🤗 Transformers (formerly known as `pytorch-transformers` and `pytorch-pretrained-bert`) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

 ### Features

@@ -54,19 +54,22 @@ Choose the right framework for every part of a model's lifetime
 | [Model architectures](#model-architectures) | Architectures (with pretrained weights) |
 | [Online demo](#online-demo) | Experimenting with this repo’s text generation capabilities |
 | [Quick tour: Usage](#quick-tour) | Tokenizers & models usage: Bert and GPT-2 |
-| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-2.0-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
+| [Quick tour: TF 2.0 and PyTorch ](#Quick-tour-TF-20-training-and-PyTorch-interoperability) | Train a TF 2.0 model in 10 lines of code, load it in PyTorch |
 | [Quick tour: Fine-tuning/usage scripts](#quick-tour-of-the-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
-| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
+| [Migrating from pytorch-transformers to transformers](#Migrating-from-pytorch-transformers-to-transformers) | Migrating your code from pytorch-transformers to transformers |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-transformers) | Migrating your code from pytorch-pretrained-bert to transformers |
 | [Documentation](https://huggingface.co/transformers/) | Full API documentation and more |

 ## Installation

-This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.0.0+
+This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+), PyTorch 1.0.0+ and TensorFlow 2.0.0-rc1

 ### With pip

-Transformers can be installed by pip as follows:
+First you need to install one of, or both, TensorFlow 2.0 and PyTorch.
+Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
+
+When TensorFlow 2.0 and/or PyTorch has been installed, 🤗 Transformers can be installed using pip as follows:

 ```bash
 pip install transformers
@@ -74,7 +77,10 @@ pip install transformers

 ### From source

-Clone the repository and run:
+Here also, you first need to install one of, or both, TensorFlow 2.0 and PyTorch.
+Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/pip#tensorflow-2.0-rc-is-available) and/or [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) regarding the specific install command for your platform.
+
+When TensorFlow 2.0 and/or PyTorch has been installed, you can install from source by cloning the repository and running:

 ```bash
 pip install [--editable] .
@@ -82,10 +88,12 @@ pip install [--editable] .

 ### Tests

-A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
+A series of tests are included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).

 These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).

+Depending on which framework is installed (TensorFlow 2.0 and/or PyTorch), the irrelevant tests will be skipped. Ensure that both frameworks are installed if you want to execute all tests.
+
 You can run the tests from the root of the cloned repository with the commands:

 ```bash
@@ -97,10 +105,9 @@ python -m pytest -sv ./examples/

 You should check out our [`swift-coreml-transformers`](https://github.com/huggingface/swift-coreml-transformers) repo.

-It contains an example of a conversion script from a Pytorch trained Transformer model (here, `GPT-2`) to a CoreML model that runs on iOS devices.
+It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.

-At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
-or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
+At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models to productizing them in CoreML, or prototype a model or an app in CoreML then research its hyperparameters or architecture from TensorFlow 2.0 and/or PyTorch. Super exciting!

 ## Model architectures

@@ -113,8 +120,8 @@ or prototype a model or an app in CoreML then research its hyperparameters or ar
 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5
-) by Victor Sanh, Lysandre Debut and Thomas Wolf.
+8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation).
+9. **[CTRL](https://github.com/salesforce/ctrl/)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.

 These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

@@ -141,6 +148,7 @@ from transformers import *
 MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
+          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
@@ -173,24 +181,24 @@ for model_class in BERT_MODEL_CLASSES:
    # Load pretrained model/tokenizer
    model = model_class.from_pretrained('bert-base-uncased')

-# Models can return full list of hidden-states & attentions weights at each layer
-model = model_class.from_pretrained(pretrained_weights,
-                                    output_hidden_states=True,
-                                    output_attentions=True)
-input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
-all_hidden_states, all_attentions = model(input_ids)[-2:]
+    # Models can return full list of hidden-states & attentions weights at each layer
+    model = model_class.from_pretrained(pretrained_weights,
+                                        output_hidden_states=True,
+                                        output_attentions=True)
+    input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
+    all_hidden_states, all_attentions = model(input_ids)[-2:]

-# Models are compatible with Torchscript
-model = model_class.from_pretrained(pretrained_weights, torchscript=True)
-traced_model = torch.jit.trace(model, (input_ids,))
+    # Models are compatible with Torchscript
+    model = model_class.from_pretrained(pretrained_weights, torchscript=True)
+    traced_model = torch.jit.trace(model, (input_ids,))

-# Simple serialization for models and tokenizers
-model.save_pretrained('./directory/to/save/')  # save
-model = model_class.from_pretrained('./directory/to/save/')  # re-load
-tokenizer.save_pretrained('./directory/to/save/')  # save
-tokenizer = tokenizer_class.from_pretrained('./directory/to/save/')  # re-load
+    # Simple serialization for models and tokenizers
+    model.save_pretrained('./directory/to/save/')  # save
+    model = model_class.from_pretrained('./directory/to/save/')  # re-load
+    tokenizer.save_pretrained('./directory/to/save/')  # save
+    tokenizer = BertTokenizer.from_pretrained('./directory/to/save/')  # re-load

-# SOTA examples for GLUE, SQUAD, text generation...
+    # SOTA examples for GLUE, SQUAD, text generation...
 ```

 ## Quick tour TF 2.0 training and PyTorch interoperability
@@ -200,7 +208,7 @@ Let's do a quick example of how a TensorFlow 2.0 model can be trained in 12 line
 ```python
 import tensorflow as tf
 import tensorflow_datasets
-from pytorch_transformers import *
+from transformers import *

 # Load dataset, tokenizer, model from pretrained model/vocabulary
 tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
@@ -208,8 +216,8 @@ model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
 data = tensorflow_datasets.load('glue/mrpc')

 # Prepare dataset for GLUE as a tf.data.Dataset instance
-train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, 128, 'mrpc')
-valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, 128, 'mrpc')
+train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
+valid_dataset = glue_convert_examples_to_features(data['validation'], tokenizer, max_length=128, task='mrpc')
 train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
 valid_dataset = valid_dataset.batch(64)

@@ -246,7 +254,7 @@ The library comprises several example scripts with SOTA performances for NLU and

 - `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
 - `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
+- `run_generation.py`: an example using GPT, GPT-2, CTRL, Transformer-XL and XLNet for conditional language generation
 - other model-specific examples (see the documentation).

 Here are three quick usage examples for these scripts:
@@ -384,10 +392,10 @@ python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncase

 This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.

-### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
+### `run_generation.py`: Text generation with GPT, GPT-2, CTRL, Transformer-XL and XLNet

 A conditional generation script is also included to generate text from a prompt.
-The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
+The generation script includes the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by Aman Rusia to get high-quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).

 Here is how to run the script with the small version of OpenAI GPT-2 model:

@@ -398,6 +406,16 @@ python ./examples/run_generation.py \
    --model_name_or_path=gpt2 \
 ```

+and from the Salesforce CTRL model: 
+```shell
+python ./examples/run_generation.py \
+    --model_type=ctrl \
+    --length=20 \
+    --model_name_or_path=gpt2 \
+    --temperature=0 \
+    --repetition_penalty=1.2 \
+```
+
 ## Migrating from pytorch-transformers to transformers

 Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to `transformers`.
@@ -417,9 +435,9 @@ Here is a quick summary of what you should take care of when migrating from `pyt

 ### Models always output `tuples`

-The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
+The main breaking change when migrating from `pytorch-pretrained-bert` to `transformers` is that every model's forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.

-The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).
+The exact content of the tuples for each model is detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/).

 In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.

@@ -445,13 +463,17 @@ outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs
 ```

+### Using hidden states
+
+By enabling the configuration option `output_hidden_states`, it was possible to retrieve the last hidden states of the encoder. In `pytorch-transformers` as well as `transformers` the return value has changed slightly: `all_hidden_states` now also includes the hidden state of the embeddings in addition to those of the encoding layers. This allows users to easily access the embeddings final state.
+
 ### Serialization

-Breaking change in the `from_pretrained()`method:
+Breaking change in the `from_pretrained()` method:

-1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.
+1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them, don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.

-2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.
+2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead, which can break derived model classes built based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/transformers/pull/866) by forwarding the the model's `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.

 Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.

@@ -523,4 +545,14 @@ for batch in train_data:

 ## Citation

-At the moment, there is no paper associated to Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
+We now have a paper you can cite for the 🤗 Transformers library:
+```
+@misc{wolf2019transformers,
+    title={Transformers: State-of-the-art Natural Language Processing},
+    author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Jamie Brew},
+    year={2019},
+    eprint={1910.03771},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
--- a/docs/README.md
+++ b/docs/README.md
@@ -34,11 +34,11 @@ pip install recommonmark

 ## Building the documentation

-Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the followig 
+Make sure that there is a symlink from the `example` file (in /examples) inside the source folder. Run the following 
 command to generate it:

 ```bash
-ln -s ../../examples/README.md source/examples.md
+ln -s ../../examples/README.md examples.md
 ```

 Once you have setup `sphinx`, you can build the documentation by running the following command in the `/docs` folder:
@@ -50,7 +50,7 @@ make html
 ---
 **NOTE**

-If you are adding/removing elements from the toc-tree or from any strutural item, it is recommended to clean the build
+If you are adding/removing elements from the toc-tree or from any structural item, it is recommended to clean the build
 directory before rebuilding. Run the following command to clean and build:

 ```bash
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -26,4 +26,7 @@ sphinxcontrib-jsmath==1.0.1
 sphinxcontrib-qthelp==1.0.2
 sphinxcontrib-serializinghtml==1.1.3
 urllib3==1.25.3
-sphinx-markdown-tables==0.0.9
+sphinx-markdown-tables==0.0.9
+numpy==1.17.2
+tensorflow==2.0.0rc2
+torch==1.2.0
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
@@ -1,5 +1,3 @@
-huggingface.css
-
 /* The literal code blocks */
 .rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
    color: #6670FF;
@@ -44,11 +42,11 @@ huggingface.css
 /* The text items on the toc tree */
 .wy-menu-vertical a {
    color: #FFFFDD;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
 }
 .wy-menu-vertical header, .wy-menu-vertical p.caption{
    color: white;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
 }

 /* The color inside the selected toc tree block */
@@ -85,7 +83,7 @@ a {
    border-right: solid 2px #FB8D68;
    border-left: solid 2px #FB8D68;
    color: #FB8D68;
-    font-family: Calibre-Light;
+    font-family: Calibre-Light, sans-serif;
    border-top: none;
    font-style: normal !important;
 }
@@ -136,14 +134,14 @@ a {

 /* class and method names in doc */
 .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
-    font-family: Calibre;
+    font-family: Calibre, sans-serif;
    font-size: 20px !important;
 }

 /* class name in doc*/
 .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
    margin-right: 10px;
-    font-family: Calibre-Medium;
+    font-family: Calibre-Medium, sans-serif;
 }

 /* Method and class parameters */
@@ -160,17 +158,17 @@ a {

 /* FONTS */
 body{
-    font-family: Calibre;
+    font-family: Calibre, sans-serif;
    font-size: 16px;
 }

 h1 {
-    font-family: Calibre-Thin;
+    font-family: Calibre-Thin, sans-serif;
    font-size: 70px;
 }

 h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
-    font-family: Calibre-Medium;
+    font-family: Calibre-Medium, sans-serif;
 }

@font-face {
@@ -196,4 +194,3 @@ h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
    src: url(./Calibre-Thin.otf);
    font-weight:400;
 }
-
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'1.2.0'
+release = u'2.1.1'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/examples.md
+++ b/docs/source/examples.md
@@ -0,0 +1 @@
+../../examples/README.md
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -5,6 +5,8 @@ Transformers
 (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural Language Generation
 (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

+This is the documentation of our repository `transformers <https://github.com/huggingface/transformers>`__.
+
 Features
 ---------------------------------------------------

@@ -13,17 +15,20 @@ Features
 - High performance on NLU and NLG tasks
 - Low barrier to entry for educators and practitioners

-State-of-the-art NLP for everyone
+State-of-the-art NLP for everyone:
+
 - Deep learning researchers
 - Hands-on practitioners
 - AI/ML/NLP teachers and educators

-Lower compute costs, smaller carbon footprint
+Lower compute costs, smaller carbon footprint:
+
 - Researchers can share trained models instead of always retraining
 - Practitioners can reduce compute time and production costs
 - 8 architectures with over 30 pretrained models, some in more than 100 languages

-Choose the right framework for every part of a model's lifetime
+Choose the right framework for every part of a model's lifetime:
+
 - Train state-of-the-art models in 3 lines of code
 - Deep interoperability between TensorFlow 2.0 and PyTorch models
 - Move a single model between TF2.0/PyTorch frameworks at will
@@ -41,8 +46,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
 5. `XLNet <https://github.com/zihangdai/xlnet>`_ (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
 6. `XLM <https://github.com/facebookresearch/XLM>`_ (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
 7. `RoBERTa <https://github.com/pytorch/fairseq/tree/master/examples/roberta>`_ (from Facebook), released together with the paper a `Robustly Optimized BERT Pretraining Approach <https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf.
-
+8. `DistilBERT <https://huggingface.co/transformers/model_doc/distilbert.html>`_ (from HuggingFace) released together with the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`_ by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2 <https://github.com/huggingface/transformers/tree/master/examples/distillation>`_.

 .. toctree::
    :maxdepth: 2
@@ -58,6 +62,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    migration
    bertology
    torchscript
+    multilingual

 .. toctree::
    :maxdepth: 2
@@ -82,3 +87,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/xlnet
    model_doc/roberta
    model_doc/distilbert
+    model_doc/ctrl
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -0,0 +1,58 @@
+# Installation
+
+Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
+
+## With pip
+
+PyTorch Transformers can be installed using pip as follows:
+
+``` bash
+pip install transformers
+```
+
+## From source
+
+To install from source, clone the repository and install with:
+
+``` bash
+git clone https://github.com/huggingface/transformers.git
+cd transformers
+pip install [--editable] .
+```
+
+## Tests
+
+An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/transformers/tree/master/transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/transformers/tree/master/examples).
+
+Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
+
+Run all the tests from the root of the cloned repository with the commands:
+
+``` bash
+python -m pytest -sv ./transformers/tests/
+python -m pytest -sv ./examples/
+```
+
+## OpenAI GPT original tokenization workflow
+
+If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (use version 4.4.3 if you are using Python 2) and `SpaCy`:
+
+``` bash
+pip install spacy ftfy==4.4.3
+python -m spacy download en
+```
+
+If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+
+## Note on model downloads (Continuous Integration or large-scale deployments)
+
+If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
+
+## Do you want to run a Transformer model on a mobile device?
+
+You should check out our [swift-coreml-transformers](https://github.com/huggingface/swift-coreml-transformers) repo.
+
+It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
+
+At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
+or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -1,71 +0,0 @@
-Installation
-================================================
-
-Transformers is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 1.1.0
-
-With pip
-^^^^^^^^
-
-PyTorch Transformers can be installed using pip as follows:
-
-.. code-block:: bash
-
-   pip install transformers
-
-From source
-^^^^^^^^^^^
-
-To install from source, clone the repository and install with:
-
-.. code-block:: bash
-
-    git clone https://github.com/huggingface/transformers.git
-    cd transformers
-    pip install [--editable] .
-
-
-Tests
-^^^^^
-
-An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the `tests folder <https://github.com/huggingface/transformers/tree/master/transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/transformers/tree/master/examples>`_.
-
-Tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
-
-Run all the tests from the root of the cloned repository with the commands:
-
-.. code-block:: bash
-
-    python -m pytest -sv ./transformers/tests/
-    python -m pytest -sv ./examples/
-
-
-OpenAI GPT original tokenization workflow
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (use version 4.4.3 if you are using Python 2) and ``SpaCy`` :
-
-.. code-block:: bash
-
-   pip install spacy ftfy==4.4.3
-   python -m spacy download en
-
-If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
-
-
-Note on model downloads (Continuous Integration or large-scale deployments)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you expect to be downloading large volumes of models (more than 1,000) from our hosted bucket (for instance through your CI setup, or a large-scale production deployment), please cache the model files on your end. It will be way faster, and cheaper. Feel free to contact us privately if you need any help.
-
-
-Do you want to run a Transformer model on a mobile device?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-You should check out our `swift-coreml-transformers <https://github.com/huggingface/swift-coreml-transformers>`_ repo.
-
-It contains an example of a conversion script from a Pytorch trained Transformer model (here, ``GPT-2``) to a CoreML model that runs on iOS devices.
-
-It also contains an implementation of BERT for Question answering.
-
-At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch to productizing them in CoreML,
-or prototype a model or an app in CoreML then research its hyperparameters or architecture from PyTorch. Super exciting!
--- a/docs/source/main_classes/model.rst
+++ b/docs/source/main_classes/model.rst
@@ -17,5 +17,5 @@ The base class ``PreTrainedModel`` implements the common methods for loading/sav
 ``TFPreTrainedModel``
 ~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFPreTrainedModel
+.. autoclass:: transformers.TFPreTrainedModel
    :members:
--- a/docs/source/main_classes/processors.rst
+++ b/docs/source/main_classes/processors.rst
@@ -8,20 +8,20 @@ Processors
 ~~~~~~~~~~~~~~~~~~~~~

 All processors follow the same architecture which is that of the
-:class:`~pytorch_transformers.data.processors.utils.DataProcessor`. The processor returns a list
-of :class:`~pytorch_transformers.data.processors.utils.InputExample`. These
-:class:`~pytorch_transformers.data.processors.utils.InputExample` can be converted to
-:class:`~pytorch_transformers.data.processors.utils.InputFeatures` in order to be fed to the model.
+:class:`~transformers.data.processors.utils.DataProcessor`. The processor returns a list
+of :class:`~transformers.data.processors.utils.InputExample`. These
+:class:`~transformers.data.processors.utils.InputExample` can be converted to
+:class:`~transformers.data.processors.utils.InputFeatures` in order to be fed to the model.

-.. autoclass:: pytorch_transformers.data.processors.utils.DataProcessor
+.. autoclass:: transformers.data.processors.utils.DataProcessor
    :members:


-.. autoclass:: pytorch_transformers.data.processors.utils.InputExample
+.. autoclass:: transformers.data.processors.utils.InputExample
    :members:


-.. autoclass:: pytorch_transformers.data.processors.utils.InputFeatures
+.. autoclass:: transformers.data.processors.utils.InputFeatures
    :members:


@@ -36,20 +36,20 @@ This library hosts a total of 10 processors for the following tasks: MRPC, MNLI,
 CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.

 Those processors are:
-    - :class:`~pytorch_transformers.data.processors.utils.MrpcProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.MnliProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.MnliMismatchedProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.Sst2Processor`
-    - :class:`~pytorch_transformers.data.processors.utils.StsbProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.QqpProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.QnliProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.RteProcessor`
-    - :class:`~pytorch_transformers.data.processors.utils.WnliProcessor`
+    - :class:`~transformers.data.processors.utils.MrpcProcessor`
+    - :class:`~transformers.data.processors.utils.MnliProcessor`
+    - :class:`~transformers.data.processors.utils.MnliMismatchedProcessor`
+    - :class:`~transformers.data.processors.utils.Sst2Processor`
+    - :class:`~transformers.data.processors.utils.StsbProcessor`
+    - :class:`~transformers.data.processors.utils.QqpProcessor`
+    - :class:`~transformers.data.processors.utils.QnliProcessor`
+    - :class:`~transformers.data.processors.utils.RteProcessor`
+    - :class:`~transformers.data.processors.utils.WnliProcessor`

 Additionally, the following method  can be used to load values from a data file and convert them to a list of
-:class:`~pytorch_transformers.data.processors.utils.InputExample`.
+:class:`~transformers.data.processors.utils.InputExample`.

-.. automethod:: pytorch_transformers.data.processors.glue.glue_convert_examples_to_features
+.. automethod:: transformers.data.processors.glue.glue_convert_examples_to_features

 Example usage
 ^^^^^^^^^^^^^^^^^^^^^^^^^
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -74,55 +74,55 @@ BERT
 ``TFBertModel``
 ~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertModel
+.. autoclass:: transformers.TFBertModel
    :members:


 ``TFBertForPreTraining``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForPreTraining
+.. autoclass:: transformers.TFBertForPreTraining
    :members:


 ``TFBertForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForMaskedLM
+.. autoclass:: transformers.TFBertForMaskedLM
    :members:


 ``TFBertForNextSentencePrediction``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForNextSentencePrediction
+.. autoclass:: transformers.TFBertForNextSentencePrediction
    :members:


 ``TFBertForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForSequenceClassification
+.. autoclass:: transformers.TFBertForSequenceClassification
    :members:


 ``TFBertForMultipleChoice``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForMultipleChoice
+.. autoclass:: transformers.TFBertForMultipleChoice
    :members:


 ``TFBertForTokenClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForTokenClassification
+.. autoclass:: transformers.TFBertForTokenClassification
    :members:


 ``TFBertForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFBertForQuestionAnswering
+.. autoclass:: transformers.TFBertForQuestionAnswering
    :members:

--- a/docs/source/model_doc/ctrl.rst
+++ b/docs/source/model_doc/ctrl.rst
@@ -0,0 +1,44 @@
+CTRL
+----------------------------------------------------
+
+``CTRLConfig``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.CTRLConfig
+    :members:
+
+
+``CTRLTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.CTRLTokenizer
+    :members:
+
+
+``CTRLModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.CTRLModel
+    :members:
+
+
+``CTRLLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.CTRLLMHeadModel
+    :members:
+
+
+``TFCTRLModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFCTRLModel
+    :members:
+
+
+``TFCTRLLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFCTRLLMHeadModel
+    :members:
+
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@@ -45,26 +45,26 @@ DistilBERT
 ``TFDistilBertModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFDistilBertModel
+.. autoclass:: transformers.TFDistilBertModel
    :members:


 ``TFDistilBertForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFDistilBertForMaskedLM
+.. autoclass:: transformers.TFDistilBertForMaskedLM
    :members:


 ``TFDistilBertForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFDistilBertForSequenceClassification
+.. autoclass:: transformers.TFDistilBertForSequenceClassification
    :members:


 ``TFDistilBertForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFDistilBertForQuestionAnswering
+.. autoclass:: transformers.TFDistilBertForQuestionAnswering
    :members:
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -39,19 +39,19 @@ OpenAI GPT
 ``TFOpenAIGPTModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFOpenAIGPTModel
+.. autoclass:: transformers.TFOpenAIGPTModel
    :members:


 ``TFOpenAIGPTLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFOpenAIGPTLMHeadModel
+.. autoclass:: transformers.TFOpenAIGPTLMHeadModel
    :members:


 ``TFOpenAIGPTDoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFOpenAIGPTDoubleHeadsModel
+.. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel
    :members:
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -39,19 +39,19 @@ OpenAI GPT2
 ``TFGPT2Model``
 ~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFGPT2Model
+.. autoclass:: transformers.TFGPT2Model
    :members:


 ``TFGPT2LMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFGPT2LMHeadModel
+.. autoclass:: transformers.TFGPT2LMHeadModel
    :members:


 ``TFGPT2DoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFGPT2DoubleHeadsModel
+.. autoclass:: transformers.TFGPT2DoubleHeadsModel
    :members:
--- a/docs/source/model_doc/roberta.rst
+++ b/docs/source/model_doc/roberta.rst
@@ -39,19 +39,19 @@ RoBERTa
 ``TFRobertaModel``
 ~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFRobertaModel
+.. autoclass:: transformers.TFRobertaModel
    :members:


 ``TFRobertaForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFRobertaForMaskedLM
+.. autoclass:: transformers.TFRobertaForMaskedLM
    :members:


 ``TFRobertaForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFRobertaForSequenceClassification
+.. autoclass:: transformers.TFRobertaForSequenceClassification
    :members:
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -33,12 +33,12 @@ Transformer XL
 ``TFTransfoXLModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFTransfoXLModel
+.. autoclass:: transformers.TFTransfoXLModel
    :members:


 ``TFTransfoXLLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFTransfoXLLMHeadModel
+.. autoclass:: transformers.TFTransfoXLLMHeadModel
    :members:
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
@@ -44,26 +44,26 @@ XLM
 ``TFXLMModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLMModel
+.. autoclass:: transformers.TFXLMModel
    :members:


 ``TFXLMWithLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLMWithLMHeadModel
+.. autoclass:: transformers.TFXLMWithLMHeadModel
    :members:


 ``TFXLMForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLMForSequenceClassification
+.. autoclass:: transformers.TFXLMForSequenceClassification
    :members:


 ``TFXLMForQuestionAnsweringSimple``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLMForQuestionAnsweringSimple
+.. autoclass:: transformers.TFXLMForQuestionAnsweringSimple
    :members:
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
@@ -46,26 +46,26 @@ XLNet
 ``TFXLNetModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLNetModel
+.. autoclass:: transformers.TFXLNetModel
    :members:


 ``TFXLNetLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLNetLMHeadModel
+.. autoclass:: transformers.TFXLNetLMHeadModel
    :members:


 ``TFXLNetForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLNetForSequenceClassification
+.. autoclass:: transformers.TFXLNetForSequenceClassification
    :members:


 ``TFXLNetForQuestionAnsweringSimple``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.. autoclass:: pytorch_transformers.TFXLNetForQuestionAnsweringSimple
+.. autoclass:: transformers.TFXLNetForQuestionAnsweringSimple
    :members:
--- a/docs/source/multilingual.rst
+++ b/docs/source/multilingual.rst
@@ -0,0 +1,103 @@
+Multi-lingual models
+================================================
+
+Most of the models available in this library are mono-lingual models (English, Chinese and German). A few
+multi-lingual models are available and have a different mechanisms than mono-lingual models.
+This page details the usage of these models.
+
+The two models that currently support multiple languages are BERT and XLM.
+
+XLM
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
+be split in two categories: the checkpoints that make use of language embeddings, and those that don't
+
+XLM & Language Embeddings
+------------------------------------------------
+
+This section concerns the following checkpoints:
+
+- ``xlm-mlm-ende-1024`` (Masked language modeling, English-German)
+- ``xlm-mlm-enfr-1024`` (Masked language modeling, English-French)
+- ``xlm-mlm-enro-1024`` (Masked language modeling, English-Romanian)
+- ``xlm-mlm-xnli15-1024`` (Masked language modeling, XNLI languages)
+- ``xlm-mlm-tlm-xnli15-1024`` (Masked language modeling + Translation, XNLI languages)
+- ``xlm-clm-enfr-1024`` (Causal language modeling, English-French)
+- ``xlm-clm-ende-1024`` (Causal language modeling, English-German)
+
+These checkpoints require language embeddings that will specify the language used at inference time. These language
+embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
+these tensors depend on the language used and are identifiable using the ``lang2id`` and ``id2lang`` attributes
+from the tokenizer.
+
+Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language modeling, English-French):
+
+
+.. code-block::
+
+    import torch
+    from transformers import XLMTokenizer, XLMWithLMHeadModel
+
+    tokenizer = XLMTokenizer.from_pretrained("xlm-clm-1024-enfr")
+
+
+The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
+``lang2id`` attribute:
+
+.. code-block::
+
+    print(tokenizer.lang2id)  # {'en': 0, 'fr': 1}
+
+
+These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
+
+.. code-block::
+
+    input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
+
+
+We should now define the language embedding by using the previously defined language id. We want to create a tensor
+filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
+
+.. code-block::
+
+    language_id = tokenizer.lang2id['en']  # 0
+    langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
+
+    # We reshape it to be of size (batch_size, sequence_length)
+    langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+
+
+You can then feed it all as input to your model:
+
+.. code-block::
+
+    outputs = model(input_ids, langs=langs)
+
+
+The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/run_generation.py>`__
+can generate text using the CLM checkpoints from XLM, using the language embeddings.
+
+XLM without Language Embeddings
+------------------------------------------------
+
+This section concerns the following checkpoints:
+
+- ``xlm-mlm-17-1280`` (Masked language modeling, 17 languages)
+- ``xlm-mlm-100-1280`` (Masked language modeling, 100 languages)
+
+These checkpoints do not require language embeddings at inference time. These models are used to have generic
+sentence representations, differently from previously-mentioned XLM checkpoints.
+
+
+BERT
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+BERT has two checkpoints that can be used for multi-lingual tasks:
+
+- ``bert-base-multilingual-uncased`` (Masked language modeling + Next sentence prediction, 102 languages)
+- ``bert-base-multilingual-cased`` (Masked language modeling + Next sentence prediction, 104 languages)
+
+These checkpoints do not require language embeddings at inference time. They should identify the language
+used in the context and infer accordingly.
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -53,6 +53,14 @@ Here is the full list of the currently provided pretrained models together with
 |                   | ``bert-base-cased-finetuned-mrpc``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | The ``bert-base-cased`` model fine-tuned on MRPC                                                                                    |
 |                   |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/transformers/examples.html>`__)                           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-german-dbmdz-cased``                           | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on cased German text by DBMDZ                                                                                               |
+|                   |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``bert-base-german-dbmdz-uncased``                         | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
+|                   |                                                            | | Trained on uncased German text by DBMDZ                                                                                             |
+|                   |                                                            | (see `details on dbmdz repository <https://github.com/dbmdz/german-bert>`__).                                                         |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | GPT               | ``openai-gpt``                                             | | 12-layer, 768-hidden, 12-heads, 110M parameters.                                                                                    |
 |                   |                                                            | | OpenAI GPT English model                                                                                                            |
@@ -98,6 +106,12 @@ Here is the full list of the currently provided pretrained models together with
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``xlm-clm-ende-1024``                                      | | 6-layer, 1024-hidden, 8-heads                                                                                                       |
 |                   |                                                            | | XLM English-German model trained with CLM (Causal Language Modeling) on the concatenation of English and German wikipedia           |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-17-1280``                                        | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
+|                   |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 17 languages.                                                              |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``xlm-mlm-100-1280``                                       | | 16-layer, 1280-hidden, 16-heads                                                                                                     |
+|                   |                                                            | | XLM model trained with MLM (Masked Language Modeling) on 100 languages.                                                             |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | RoBERTa           | ``roberta-base``                                           | | 12-layer, 768-hidden, 12-heads, 125M parameters                                                                                     |
 |                   |                                                            | | RoBERTa using the BERT-base architecture                                                                                            |
@@ -113,11 +127,18 @@ Here is the full list of the currently provided pretrained models together with
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | DistilBERT        | ``distilbert-base-uncased``                                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint                                                   |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``distilbert-base-uncased-distilled-squad``                | | 6-layer, 768-hidden, 12-heads, 66M parameters                                                                                       |
 |                   |                                                            | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer.                 |
-|                   |                                                            | (see `details <https://medium.com/huggingface/distilbert-8cf3380435b5>`__)                                                            |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
+|                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+|                   | ``distilgpt2``                                             | | 6-layer, 768-hidden, 12-heads, 82M parameters                                                                                       |
+|                   |                                                            | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint.                                                               |
+|                   |                                                            | (see `details <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__)                                     |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| CTRL              | ``ctrl``                                                   | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters                                                                                    |
+|                   |                                                            | | Salesforce's Large-sized CTRL English model                                                                                         |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+

 .. <https://huggingface.co/transformers/examples.html>`__
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -19,12 +19,12 @@ The library was designed with two strong goals in mind:

 A few other goals:

- expose the models internals as consistently as possible:
+- expose the models' internals as consistently as possible:

  - we give access, using a single API to the full hidden-states and attention weights,
  - tokenizer and base model's API are standardized to easily switch between models.

- incorporate a subjective selection of promising tools for fine-tuning/investiguating these models:
+- incorporate a subjective selection of promising tools for fine-tuning/investigating these models:

  - a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
  - simple ways to mask and prune transformer heads.
@@ -33,7 +33,7 @@ A few other goals:

 The library is build around three type of classes for each models:

- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 6 models architectures currently provided in the library, e.g. `BertModel`
+- **model classes** which are PyTorch models (`torch.nn.Modules`) of the 8 models architectures currently provided in the library, e.g. `BertModel`
 - **configuration classes** which store all the parameters required to build a model, e.g. `BertConfig`. You don't always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
 - **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. `BertTokenizer`

@@ -51,7 +51,7 @@ We'll finish this quickstart tour by going through a few simple quick-start exam

 Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.

-See full API reference for examples for each model classe.
+See full API reference for examples for each model class.

 ### BERT example

@@ -93,8 +93,8 @@ Let's see how we can use `BertModel` to encode our inputs in hidden-states:
 # Load pre-trained model (weights)
 model = BertModel.from_pretrained('bert-base-uncased')

-# Set the model in evaluation mode to desactivate the DropOut modules
-# This is IMPORTANT to have reproductible results during evaluation!
+# Set the model in evaluation mode to deactivate the DropOut modules
+# This is IMPORTANT to have reproducible results during evaluation!
 model.eval()

 # If you have a GPU, put everything on cuda
@@ -168,8 +168,8 @@ Let's see how to use `GPT2LMHeadModel` to generate the next token following our
 # Load pre-trained model (weights)
 model = GPT2LMHeadModel.from_pretrained('gpt2')

-# Set the model in evaluation mode to desactivate the DropOut modules
-# This is IMPORTANT to have reproductible results during evaluation!
+# Set the model in evaluation mode to deactivate the DropOut modules
+# This is IMPORTANT to have reproducible results during evaluation!
 model.eval()

 # If you have a GPU, put everything on cuda
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -33,6 +33,8 @@ where
    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
+    * ``bert-base-german-dbmdz-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
+    * ``bert-base-german-dbmdz-uncased``: Trained on (uncased) German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://github.com/dbmdz/german-bert>`__
    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
--- a/examples/README.md
+++ b/examples/README.md
@@ -9,7 +9,7 @@ similar API between the different models.
 | [Language Generation](#language-generation) | Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL and XLNet.                                         |
 | [GLUE](#glue) | Examples running BERT/XLM/XLNet/RoBERTa on the 9 GLUE tasks. Examples feature distributed training as well as half-precision.                              |
 | [SQuAD](#squad) | Using BERT for question answering, examples with distributed training.                                                                                  |
-| [Multiple Choice](#multiple choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. 
+| [Multiple Choice](#multiple-choice) | Examples running BERT/XLNet/RoBERTa on the SWAG/RACE/ARC tasks. 

 ## Language model fine-tuning

@@ -283,17 +283,17 @@ The results  are the following:
  loss = 0.04755385363816904
 ```

-##Multiple Choice
+## Multiple Choice

 Based on the script [`run_multiple_choice.py`]().

 #### Fine-tuning on SWAG
 Download [swag](https://github.com/rowanz/swagaf/tree/master/data) data

-```
+```bash
 #training on 4 tesla V100(16GB) GPUS
 export SWAG_DIR=/path/to/swag_data_dir
-python ./examples/single_model_scripts/run_multiple_choice.py \
+python ./examples/run_multiple_choice.py \
 --model_type roberta \
 --task_name swag \
 --model_name_or_path roberta-base \
--- a/examples/contrib/run_swag.py
+++ b/examples/contrib/run_swag.py
@@ -31,9 +31,13 @@ import torch
 from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
 from torch.utils.data.distributed import DistributedSampler
-from tqdm import tqdm, trange

-from tensorboardX import SummaryWriter
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
+from tqdm import tqdm, trange

 from transformers import (WEIGHTS_NAME, BertConfig,
                                  BertForMultipleChoice, BertTokenizer)
--- a/examples/distillation/README.md
+++ b/examples/distillation/README.md
@@ -1,22 +1,25 @@
-# DistilBERT
+# Distil*

-This folder contains the original code used to train DistilBERT as well as examples showcasing how to use DistilBERT.
+This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2.
+
+**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2.

 **2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future!

-## What is DistilBERT
+## What is Distil*

-DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.
+Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving 97% of BERT's performances as measured on the GLUE language understanding benchmark. DistilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DistilBERT is thus an interesting option to put large-scaled trained Transformer model into production.

-For more information on DistilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-distilbert-a-distilled-version-of-bert-8cf3380435b5
-). *Please note that we will publish a formal write-up with updated and more complete results in the near future (September 19th).*
+We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for DistilGPT2 (after fine-tuning on the train set).

-Here's the updated results on the dev sets of GLUE:
+For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances.

-| Model      | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | STS-B | WNLI |
+Here are the results on the dev sets of GLUE:
+
+| Model      | Macro-score | CoLA | MNLI | MRPC | QNLI | QQP  | RTE  | SST-2| STS-B| WNLI |
 | :---:      |    :---:    | :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:| :---:|
 | BERT-base  |  **77.6**   | 48.9 | 84.3 | 88.6 | 89.3 | 89.5 | 71.3 | 91.7 | 91.2 | 43.7 |
-| DistilBERT |  **75.2**   | 49.1 | 81.8 | 90.2 | 87.0 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |
+| DistilBERT |  **76.8**   | 49.1 | 81.8 | 90.2 | 90.2 | 89.2 | 62.9 | 92.7 | 90.7 | 44.4 |

 ## Setup

@@ -26,10 +29,12 @@ This part of the library has only be tested with Python3.6+. There are few speci

 ## How to use DistilBERT

-Transformers includes two pre-trained DistilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):
+Transformers includes two pre-trained Distil* models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DistilBERT):

 - `distilbert-base-uncased`: DistilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters.
 - `distilbert-base-uncased-distilled-squad`: A finetuned version of `distilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.9 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score).
+- `distilgpt2`: DistilGPT2 English language model pretrained with the supervision of `gpt2` (the smallest version of GPT2) on [OpenWebTextCorpus](https://skylion007.github.io/OpenWebTextCorpus/), a reproduction of OpenAI's WebText dataset and . The model has 6 layers, 768 dimension and 12 heads, totalizing 82M (compared to 124M parameters for GPT2). On average, DistilGPT2 is two times faster than GPT2.
+- and more to come! 🤗🤗🤗

 Using DistilBERT is very similar to using BERT. DistilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DistilBertTokenizer` name to have a consistent naming between the library models.

@@ -42,9 +47,11 @@ outputs = model(input_ids)
 last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 ```

-## How to train DistilBERT
+Similarly, using DistilGPT2 simply consists in calling the GPT2 classes from a different pretrained checkpoint: `model = GPT2Model.from_pretrained('distilgpt2')`.

-In the following, we will explain how you can train your own compressed model.
+## How to train Distil*
+
+In the following, we will explain how you can train DistilBERT.

 ### A. Preparing the data

@@ -57,7 +64,8 @@ First, we will binarize the data, i.e. tokenize the data and convert each token
 ```bash
 python scripts/binarized_data.py \
    --file_path data/dump.txt \
-    --bert_tokenizer bert-base-uncased \
+    --tokenizer_type bert \
+    --tokenizer_name bert-base-uncased \
    --dump_file data/binarized_text
 ```

@@ -66,7 +74,8 @@ Our implementation of masked language modeling loss follows [XLM](https://github
 ```bash
 python scripts/token_counts.py \
    --data_file data/binarized_text.bert-base-uncased.pickle \
-    --token_counts_dump data/token_counts.bert-base-uncased.pickle
+    --token_counts_dump data/token_counts.bert-base-uncased.pickle \
+    --vocab_size 30522
 ```

 ### B. Training
@@ -75,6 +84,12 @@ Training with distillation is really simple once you have pre-processed the data

 ```bash
 python train.py \
+    --student_type distilbert \
+    --student_config training_configs/distilbert-base-uncased.json \
+    --teacher_type bert \
+    --teacher_name bert-base-uncased \
+    --alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --mlm \
+    --freeze_pos_embs \
    --dump_path serialization_dir/my_first_training \
    --data_file data/binarized_text.bert-base-uncased.pickle \
    --token_counts data/token_counts.bert-base-uncased.pickle \
@@ -83,7 +98,7 @@ python train.py \

 By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them.

-We highly encourage you to use distributed training for training DistilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:
+We highly encourage you to use distributed training for training DistilBERT as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs:

 ```bash
 export NODE_RANK=0
@@ -105,11 +120,17 @@ python -m torch.distributed.launch \
    train.py \
        --force \
        --n_gpu $WORLD_SIZE \
+        --student_type distilbert \
+        --student_config training_configs/distilbert-base-uncased.json \
+        --teacher_type bert \
+        --teacher_name bert-base-uncased \
+        --alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --mlm \
+        --freeze_pos_embs \
+        --dump_path serialization_dir/my_first_training \
        --data_file data/binarized_text.bert-base-uncased.pickle \
-        --token_counts data/token_counts.bert-base-uncased.pickle \
-        --dump_path serialization_dir/my_first_distillation
+        --token_counts data/token_counts.bert-base-uncased.pickle
 ```

-**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training!
+**Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training!

 Happy distillation!
--- a/examples/distillation/distiller.py
+++ b/examples/distillation/distiller.py
@@ -12,14 +12,13 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" The distiller to distil DistilBERT
-    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
+""" The distiller to distil the student.
+    Adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
 import os
 import math
 import psutil
 import time
-from tensorboardX import SummaryWriter
 from tqdm import trange, tqdm
 import numpy as np
 import psutil
@@ -28,16 +27,24 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from torch.optim import AdamW
+from torch.utils.data.distributed import DistributedSampler
+from torch.utils.data import RandomSampler, BatchSampler, DataLoader
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter

 from transformers import WarmupLinearSchedule

 from utils import logger
-from dataset import Dataset
+from lm_seqs_dataset import LmSeqsDataset
+from grouped_batch_sampler import GroupedBatchSampler, create_lengths_groups

 class Distiller:
    def __init__(self,
                 params: dict,
-                 dataloader: Dataset,
+                 dataset: LmSeqsDataset,
                 token_probs: torch.tensor,
                 student: nn.Module,
                 teacher: nn.Module):
@@ -50,33 +57,47 @@ class Distiller:
        self.student = student
        self.teacher = teacher

-        self.dataloader = dataloader
-        if self.params.n_gpu > 1:
-            self.dataloader.split()
-        self.get_iterator(seed=params.seed)
+        self.student_config = student.config
+        self.vocab_size = student.config.vocab_size
+
+        if params.n_gpu <= 1:
+            sampler = RandomSampler(dataset)
+        else:
+            sampler = DistributedSampler(dataset)
+
+        if params.group_by_size:
+            groups = create_lengths_groups(lengths=dataset.lengths, k=params.max_model_input_size)
+            sampler = GroupedBatchSampler(sampler=sampler, group_ids=groups, batch_size=params.batch_size)
+        else:
+            sampler = BatchSampler(sampler=sampler, batch_size=params.batch_size, drop_last=False)
+
+        self.dataloader = DataLoader(dataset=dataset,
+                                     batch_sampler=sampler,
+                                     collate_fn=dataset.batch_sequences)

        self.temperature = params.temperature
        assert self.temperature > 0.

        self.alpha_ce = params.alpha_ce
        self.alpha_mlm = params.alpha_mlm
+        self.alpha_clm = params.alpha_clm
        self.alpha_mse = params.alpha_mse
        self.alpha_cos = params.alpha_cos
-        assert self.alpha_ce >= 0.
-        assert self.alpha_mlm >= 0.
-        assert self.alpha_mse >= 0.
-        assert self.alpha_cos >= 0.
-        assert self.alpha_ce + self.alpha_mlm + self.alpha_mse + self.alpha_cos > 0.

-        self.mlm_mask_prop = params.mlm_mask_prop
-        assert 0.0 <= self.mlm_mask_prop <= 1.0
-        assert params.word_mask + params.word_keep + params.word_rand == 1.0
-        self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
-        self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
-        self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
-        if self.fp16:
-            self.pred_probs = self.pred_probs.half()
-            self.token_probs = self.token_probs.half()
+        self.mlm = params.mlm
+        if self.mlm:
+            logger.info(f'Using MLM loss for LM step.')
+            self.mlm_mask_prop = params.mlm_mask_prop
+            assert 0.0 <= self.mlm_mask_prop <= 1.0
+            assert params.word_mask + params.word_keep + params.word_rand == 1.0
+            self.pred_probs = torch.FloatTensor([params.word_mask, params.word_keep, params.word_rand])
+            self.pred_probs = self.pred_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else self.pred_probs
+            self.token_probs = token_probs.to(f'cuda:{params.local_rank}') if params.n_gpu > 0 else token_probs
+            if self.fp16:
+                self.pred_probs = self.pred_probs.half()
+                self.token_probs = self.token_probs.half()
+        else:
+            logger.info(f'Using CLM loss for LM step.')

        self.epoch = 0
        self.n_iter = 0
@@ -86,12 +107,13 @@ class Distiller:
        self.last_loss = 0
        self.last_loss_ce = 0
        self.last_loss_mlm = 0
+        self.last_loss_clm = 0
        if self.alpha_mse > 0.: self.last_loss_mse = 0
        if self.alpha_cos > 0.: self.last_loss_cos = 0
        self.last_log = 0

        self.ce_loss_fct = nn.KLDivLoss(reduction='batchmean')
-        self.mlm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
+        self.lm_loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
        if self.alpha_mse > 0.:
            self.mse_loss_fct = nn.MSELoss(reduction='sum')
        if self.alpha_cos > 0.:
@@ -99,7 +121,7 @@ class Distiller:

        logger.info('--- Initializing model optimizer')
        assert params.gradient_accumulation_steps >= 1
-        self.num_steps_epoch = int(len(self.dataloader) / params.batch_size) + 1
+        self.num_steps_epoch = len(self.dataloader)
        num_train_optimization_steps = int(self.num_steps_epoch / params.gradient_accumulation_steps * params.n_epoch) + 1

        no_decay = ['bias', 'LayerNorm.weight']
@@ -140,43 +162,18 @@ class Distiller:
                logger.info("Using nn.parallel.DistributedDataParallel for distributed training.")
                self.student = DistributedDataParallel(self.student,
                                                       device_ids=[params.local_rank],
-                                                       output_device=params.local_rank)
+                                                       output_device=params.local_rank,
+                                                       find_unused_parameters=True)

        self.is_master = params.is_master
        if self.is_master:
            logger.info('--- Initializing Tensorboard')
            self.tensorboard = SummaryWriter(log_dir=os.path.join(self.dump_path, 'log', 'train'))
-            self.tensorboard.add_text(tag='config', text_string=str(self.params), global_step=0)
+            self.tensorboard.add_text(tag='config/training', text_string=str(self.params), global_step=0)
+            self.tensorboard.add_text(tag='config/student', text_string=str(self.student_config), global_step=0)

-    def get_iterator(self,
-                     seed: int = None):
-        """
-        Initialize the data iterator.
-        Each process has its own data iterator (iterating on his own random portion of the dataset).
-
-        Input:
-        ------
-            seed: `int` - The random seed.
-        """
-        logger.info('--- Initializing Data Iterator')
-        self.data_iterator = self.dataloader.get_iterator(seed=seed)
-
-    def get_batch(self):
-        """
-        Call the data iterator to output a new batch.
-        If the data iterator went through the whole dataset, create a new iterator.
-        """
-        assert hasattr(self, 'data_iterator')
-        try:
-            x = next(self.data_iterator)
-        except StopIteration:
-            logger.warning('--- Went through the whole dataset. Creating new data iterator.')
-            self.data_iterator = self.dataloader.get_iterator()
-            x = next(self.data_iterator)
-        return x
-
-    def prepare_batch(self,
-                      batch):
+    def prepare_batch_mlm(self,
+                          batch):
        """
        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the masked label for MLM.

@@ -222,7 +219,7 @@ class Distiller:
                assert pred_mask.sum().item() % 8 == 0, pred_mask.sum().item()

        _token_ids_real = token_ids[pred_mask]
-        _token_ids_rand = _token_ids_real.clone().random_(self.params.vocab_size)
+        _token_ids_rand = _token_ids_real.clone().random_(self.vocab_size)
        _token_ids_mask = _token_ids_real.clone().fill_(self.params.special_tok_ids['mask_token'])
        probs = torch.multinomial(self.pred_probs, len(_token_ids_real), replacement=True)
        _token_ids = _token_ids_mask * (probs == 0).long() + _token_ids_real * (probs == 1).long() + _token_ids_rand * (probs == 2).long()
@@ -230,8 +227,41 @@ class Distiller:

        mlm_labels[~pred_mask] = -1 # previously `mlm_labels[1-pred_mask] = -1`, cf pytorch 1.2.0 compatibility

+        # sanity checks
+        assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
+
        return token_ids, attn_mask, mlm_labels

+    def prepare_batch_clm(self,
+                          batch):
+        """
+        Prepare the batch: from the token_ids and the lenghts, compute the attention mask and the labels for CLM.
+
+        Input:
+        ------
+            batch: `Tuple`
+                token_ids: `torch.tensor(bs, seq_length)` - The token ids for each of the sequence. It is padded.
+                lengths: `torch.tensor(bs)` - The lengths of each of the sequences in the batch.
+
+        Output:
+        -------
+            token_ids: `torch.tensor(bs, seq_length)` - The token ids after the modifications for MLM.
+            attn_mask: `torch.tensor(bs, seq_length)` - The attention mask for the self-attention.
+            clm_labels: `torch.tensor(bs, seq_length)` - The causal languge modeling labels. There is a -1 where there is nothing to predict.
+        """
+        token_ids, lengths = batch
+        token_ids, lengths = self.round_batch(x=token_ids, lengths=lengths)
+        assert token_ids.size(0) == lengths.size(0)
+
+        attn_mask = (torch.arange(token_ids.size(1), dtype=torch.long, device=lengths.device) < lengths[:, None])
+        clm_labels = token_ids.new(token_ids.size()).copy_(token_ids)
+        clm_labels[~attn_mask] = -1 # previously `clm_labels[1-attn_mask] = -1`, cf pytorch 1.2.0 compatibility
+
+        # sanity checks
+        assert 0 <= token_ids.min() <= token_ids.max() < self.vocab_size
+
+        return token_ids, attn_mask, clm_labels
+
    def round_batch(self,
                    x: torch.tensor,
                    lengths: torch.tensor):
@@ -269,7 +299,10 @@ class Distiller:
        if ml1 % 8 != 0:
            pad = 8 - (ml1 % 8)
            ml2 = ml1 + pad
-            pad_id = self.params.special_tok_ids['pad_token']
+            if self.mlm:
+                pad_id = self.params.special_tok_ids['pad_token']
+            else:
+                pad_id = self.params.special_tok_ids['unk_token']
            padding_tensor = torch.zeros(bs2, pad, dtype=torch.long, device=x.device).fill_(pad_id)
            x = torch.cat([x, padding_tensor], 1)
            assert x.size() == (bs2, ml2)
@@ -292,14 +325,16 @@ class Distiller:
            if self.multi_gpu:
                torch.distributed.barrier()

-            iter_bar = trange(self.num_steps_epoch, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
-            for __ in range(self.num_steps_epoch):
-                batch = self.get_batch()
+            iter_bar = tqdm(self.dataloader, desc="-Iter", disable=self.params.local_rank not in [-1, 0])
+            for batch in iter_bar:
                if self.params.n_gpu > 0:
                    batch = tuple(t.to(f'cuda:{self.params.local_rank}') for t in batch)
-                token_ids, attn_mask, mlm_labels = self.prepare_batch(batch=batch)

-                self.step(input_ids=token_ids, attention_mask=attn_mask, mlm_labels=mlm_labels)
+                if self.mlm:
+                    token_ids, attn_mask, lm_labels = self.prepare_batch_mlm(batch=batch)
+                else:
+                    token_ids, attn_mask, lm_labels = self.prepare_batch_clm(batch=batch)
+                self.step(input_ids=token_ids, attention_mask=attn_mask, lm_labels=lm_labels)

                iter_bar.update()
                iter_bar.set_postfix({'Last_loss': f'{self.last_loss:.2f}',
@@ -317,7 +352,7 @@ class Distiller:
    def step(self,
             input_ids: torch.tensor,
             attention_mask: torch.tensor,
-             mlm_labels: torch.tensor):
+             lm_labels: torch.tensor):
        """
        One optimization step: forward of student AND teacher, backward on the loss (for gradient accumulation),
        and possibly a parameter update (depending on the gradient accumulation).
@@ -326,17 +361,22 @@ class Distiller:
        ------
        input_ids: `torch.tensor(bs, seq_length)` - The token ids.
        attention_mask: `torch.tensor(bs, seq_length)` - The attention mask for self attention.
-        mlm_labels: `torch.tensor(bs, seq_length)` - The masked language modeling labels.
+        lm_labels: `torch.tensor(bs, seq_length)` - The language modeling labels (mlm labels for MLM and clm labels for CLM).
        """
-        s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask)     # (bs, seq_length, voc_size)
-        with torch.no_grad():
-            t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
+        if self.mlm:
+            s_logits, s_hidden_states = self.student(input_ids=input_ids, attention_mask=attention_mask)     # (bs, seq_length, voc_size)
+            with torch.no_grad():
+                t_logits, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=attention_mask) # (bs, seq_length, voc_size)
+        else:
+            s_logits, _, s_hidden_states = self.student(input_ids=input_ids, attention_mask=None)            # (bs, seq_length, voc_size)
+            with torch.no_grad():
+                t_logits, _, t_hidden_states = self.teacher(input_ids=input_ids, attention_mask=None)           # (bs, seq_length, voc_size)
        assert s_logits.size() == t_logits.size()

        #https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/model/net.py#L100
        #https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
        if self.params.restrict_ce_to_mask:
-            mask = (mlm_labels>-1).unsqueeze(-1).expand_as(s_logits)   # (bs, seq_lenth, voc_size)
+            mask = (lm_labels>-1).unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
        else:
            mask = attention_mask.unsqueeze(-1).expand_as(s_logits)    # (bs, seq_lenth, voc_size)
        s_logits_slct = torch.masked_select(s_logits, mask)            # (bs * seq_length * voc_size) modulo the 1s in mask
@@ -348,13 +388,20 @@ class Distiller:
        loss_ce = self.ce_loss_fct(F.log_softmax(s_logits_slct/self.temperature, dim=-1),
                                   F.softmax(t_logits_slct/self.temperature, dim=-1)) * (self.temperature)**2
        loss = self.alpha_ce*loss_ce
+
        if self.alpha_mlm > 0.:
-            loss_mlm = self.mlm_loss_fct(s_logits.view(-1, s_logits.size(-1)), mlm_labels.view(-1))
+            loss_mlm = self.lm_loss_fct(s_logits.view(-1, s_logits.size(-1)), lm_labels.view(-1))
            loss += self.alpha_mlm * loss_mlm
+        if self.alpha_clm > 0.:
+            shift_logits = s_logits[..., :-1, :].contiguous()
+            shift_labels = lm_labels[..., 1:].contiguous()
+            loss_clm = self.lm_loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                                        shift_labels.view(-1))
+            loss += self.alpha_clm * loss_clm
+
        if self.alpha_mse > 0.:
            loss_mse = self.mse_loss_fct(s_logits_slct, t_logits_slct)/s_logits_slct.size(0) # Reproducing batchmean reduction
            loss += self.alpha_mse * loss_mse
-        
        if self.alpha_cos > 0.:
            s_hidden_states = s_hidden_states[-1]                              # (bs, seq_length, dim)
            t_hidden_states = t_hidden_states[-1]                              # (bs, seq_length, dim)
@@ -376,6 +423,8 @@ class Distiller:
        self.last_loss_ce = loss_ce.item()
        if self.alpha_mlm > 0.:
            self.last_loss_mlm = loss_mlm.item()
+        if self.alpha_clm > 0.:
+            self.last_loss_clm = loss_clm.item()
        if self.alpha_mse > 0.:
            self.last_loss_mse = loss_mse.item()
        if self.alpha_cos > 0.:
@@ -452,6 +501,8 @@ class Distiller:
        self.tensorboard.add_scalar(tag="losses/loss_ce", scalar_value=self.last_loss_ce, global_step=self.n_total_iter)
        if self.alpha_mlm > 0.:
            self.tensorboard.add_scalar(tag="losses/loss_mlm", scalar_value=self.last_loss_mlm, global_step=self.n_total_iter)
+        if self.alpha_clm > 0.:
+            self.tensorboard.add_scalar(tag="losses/loss_clm", scalar_value=self.last_loss_clm, global_step=self.n_total_iter)
        if self.alpha_mse > 0.:
            self.tensorboard.add_scalar(tag="losses/loss_mse", scalar_value=self.last_loss_mse, global_step=self.n_total_iter)
        if self.alpha_cos > 0.:
--- a/examples/distillation/grouped_batch_sampler.py
+++ b/examples/distillation/grouped_batch_sampler.py
@@ -0,0 +1,105 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team and Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Adapted from PyTorch Vision (https://github.com/pytorch/vision/blob/master/references/detection/group_by_aspect_ratio.py)
+"""
+import bisect
+import copy
+from collections import defaultdict
+import numpy as np
+
+from torch.utils.data.sampler import BatchSampler, Sampler
+
+from utils import logger
+
+def _quantize(x, bins):
+    bins = copy.deepcopy(bins)
+    bins = sorted(bins)
+    quantized = list(map(lambda y: bisect.bisect_right(bins, y), x))
+    return quantized
+
+def create_lengths_groups(lengths, k=0):
+    bins = np.arange(start=3, stop=k, step=4).tolist() if k > 0 else [10]
+    groups = _quantize(lengths, bins)
+    # count number of elements per group
+    counts = np.unique(groups, return_counts=True)[1]
+    fbins = [0] + bins + [np.inf]
+    logger.info("Using {} as bins for aspect lengths quantization".format(fbins))
+    logger.info("Count of instances per bin: {}".format(counts))
+    return groups
+
+class GroupedBatchSampler(BatchSampler):
+    """
+    Wraps another sampler to yield a mini-batch of indices.
+    It enforces that the batch only contain elements from the same group.
+    It also tries to provide mini-batches which follows an ordering which is
+    as close as possible to the ordering from the original sampler.
+    Arguments:
+        sampler (Sampler): Base sampler.
+        group_ids (list[int]): If the sampler produces indices in range [0, N),
+            `group_ids` must be a list of `N` ints which contains the group id of each sample.
+            The group ids must be a continuous set of integers starting from
+            0, i.e. they must be in the range [0, num_groups).
+        batch_size (int): Size of mini-batch.
+    """
+    def __init__(self, sampler, group_ids, batch_size):
+        if not isinstance(sampler, Sampler):
+            raise ValueError(
+                "sampler should be an instance of "
+                "torch.utils.data.Sampler, but got sampler={}".format(sampler)
+            )
+        self.sampler = sampler
+        self.group_ids = group_ids
+        self.batch_size = batch_size
+
+    def __iter__(self):
+        buffer_per_group = defaultdict(list)
+        samples_per_group = defaultdict(list)
+
+        num_batches = 0
+        for idx in self.sampler:
+            group_id = self.group_ids[idx]
+            buffer_per_group[group_id].append(idx)
+            samples_per_group[group_id].append(idx)
+            if len(buffer_per_group[group_id]) == self.batch_size:
+                yield buffer_per_group[group_id] #TODO
+                num_batches += 1
+                del buffer_per_group[group_id]
+            assert len(buffer_per_group[group_id]) < self.batch_size
+
+        # now we have run out of elements that satisfy
+        # the group criteria, let's return the remaining
+        # elements so that the size of the sampler is
+        # deterministic
+        expected_num_batches = len(self)
+        num_remaining = expected_num_batches - num_batches
+        if num_remaining > 0:
+            # for the remaining batches, group the batches by similar lengths
+            batch_idx = []
+            for group_id, idxs in sorted(buffer_per_group.items(), key=lambda x: x[0]):
+                batch_idx.extend(idxs)
+                if len(batch_idx) >= self.batch_size:
+                    yield batch_idx[:self.batch_size]
+                    batch_idx = batch_idx[self.batch_size:]
+                    num_remaining -= 1
+            if len(batch_idx) > 0:
+                yield batch_idx
+                num_remaining -= 1
+        assert num_remaining == 0
+
+    def __len__(self):
+        """
+        Return the number of mini-batches rather than the number of samples.
+        """
+        return (len(self.sampler) + self.batch_size - 1) // self.batch_size
--- a/examples/distillation/lm_seqs_dataset.py
+++ b/examples/distillation/lm_seqs_dataset.py
@@ -12,30 +12,33 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Dataloaders to train DistilBERT
+""" Dataset to distilled models
    adapted in part from Facebook, Inc XLM model (https://github.com/facebookresearch/XLM)
 """
-from typing import List
-import math
-from itertools import chain
-from collections import Counter
-import numpy as np
 import torch
+from torch.utils.data import Dataset

+import numpy as np
 from utils import logger

-class Dataset:
+class LmSeqsDataset(Dataset):
+    """Custom Dataset wrapping language modeling sequences.
+
+    Each sample will be retrieved by indexing the list of token_ids and their corresponding lengths.
+
+    Input:
+    ------
+        params: `NameSpace` parameters
+        data: `List[np.array[int]]
+    """
+
    def __init__(self,
                 params,
                 data):
        self.params = params
-        self.tokens_per_batch = params.tokens_per_batch
-        self.batch_size = params.batch_size
-        self.shuffle = params.shuffle
-        self.group_by_size = params.group_by_size

        self.token_ids = np.array(data)
-        self.lengths = np.uint16([len(t) for t in data])
+        self.lengths = np.array([len(t) for t in data])

        self.check()
        self.remove_long_sequences()
@@ -43,6 +46,9 @@ class Dataset:
        self.check()
        self.print_statistics()

+    def __getitem__(self, index):
+        return (self.token_ids[index], self.lengths[index])
+
    def __len__(self):
        return len(self.lengths)

@@ -51,12 +57,14 @@ class Dataset:
        Some sanity checks
        """
        assert len(self.token_ids) == len(self.lengths)
+        assert all(self.lengths[i] == len(self.token_ids[i]) for i in range(len(self.lengths))) 

    def remove_long_sequences(self):
        """
-        Sequences that are too long are splitted by chunk of max_position_embeddings.
+        Sequences that are too long are splitted by chunk of max_model_input_size.
        """
-        indices = self.lengths >= self.params.max_position_embeddings
+        max_len = self.params.max_model_input_size
+        indices = self.lengths > max_len
        logger.info(f'Splitting {sum(indices)} too long sequences.')

        def divide_chunks(l, n):
@@ -64,10 +72,13 @@ class Dataset:

        new_tok_ids = []
        new_lengths = []
-        cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
-        max_len = self.params.max_position_embeddings
+        if self.params.mlm:
+            cls_id, sep_id = self.params.special_tok_ids['cls_token'], self.params.special_tok_ids['sep_token']
+        else:
+            cls_id, sep_id = self.params.special_tok_ids['bos_token'], self.params.special_tok_ids['eos_token']

        for seq_, len_ in zip(self.token_ids, self.lengths):
+            assert (seq_[0] == cls_id) and (seq_[-1] == sep_id), seq_
            if len_ <= max_len:
                new_tok_ids.append(seq_)
                new_lengths.append(len_)
@@ -79,6 +90,7 @@ class Dataset:
                    if sub_s[-1] != sep_id:
                        sub_s = np.insert(sub_s, len(sub_s), sep_id)
                    assert len(sub_s) <= max_len
+                    assert (sub_s[0] == cls_id) and (sub_s[-1] == sep_id), sub_s
                    sub_seqs.append(sub_s)

                new_tok_ids.extend(sub_seqs)
@@ -113,89 +125,27 @@ class Dataset:
        # nb_unkown = sum([(t==unk_idx).sum() for t in self.token_ids])
        # logger.info(f'{nb_unkown} unknown tokens (covering {100*nb_unkown/data_len:.2f}% of the data)')

-    def select_data(self, a: int, b: int):
-        """
-        Select a subportion of the data.
-        """
-        n_sequences = len(self)
-        assert 0 <= a < b <= n_sequences, ValueError(f'`0 <= a < b <= n_sequences` is not met with a={a} and b={b}')
-
-        logger.info(f'Selecting sequences from {a} to {b} (excluded).')
-        self.token_ids = self.token_ids[a:b]
-        self.lengths = self.lengths[a:b]
-
-        self.check()
-
-    def split(self):
-        """
-        Distributed training: split the data accross the processes.
-        """
-        assert self.params.n_gpu > 1
-        logger.info('Splitting the data accross the processuses.')
-        n_seq = len(self)
-        n_seq_per_procesus = n_seq // self.params.world_size
-        a = n_seq_per_procesus * self.params.global_rank
-        b = a + n_seq_per_procesus
-        self.select_data(a=a, b=b)
-
    def batch_sequences(self,
-                        token_ids: List[List[int]],
-                        lengths: List[int]):
+                        batch):
        """
        Do the padding and transform into torch.tensor.
        """
+        token_ids = [t[0] for t in batch]
+        lengths = [t[1] for t in batch]
        assert len(token_ids) == len(lengths)

        # Max for paddings
        max_seq_len_ = max(lengths)

        # Pad token ids
-        pad_idx = self.params.special_tok_ids['pad_token']
+        if self.params.mlm:
+            pad_idx = self.params.special_tok_ids['pad_token']
+        else:
+            pad_idx = self.params.special_tok_ids['unk_token']
        tk_ = [list(t.astype(int)) + [pad_idx]*(max_seq_len_-len(t)) for t in token_ids]
        assert len(tk_) == len(token_ids)
        assert all(len(t) == max_seq_len_ for t in tk_)

-        tk_t = torch.tensor(tk_)                  # (bs, max_seq_len_)
-        lg_t = torch.tensor(lengths.astype(int))  # (bs)
+        tk_t = torch.tensor(tk_)      # (bs, max_seq_len_)
+        lg_t = torch.tensor(lengths)  # (bs)
        return tk_t, lg_t
-
-    def get_batches_iterator(self,
-                             batches):
-        """
-        Return an iterator over batches.
-        """
-        for sequences_ids in batches:
-            token_ids, lengths = self.batch_sequences(self.token_ids[sequences_ids],
-                                                    self.lengths[sequences_ids])
-            yield (token_ids, lengths)
-
-    def get_iterator(self,
-                     seed: int = None):
-        """
-        Return a data iterator.
-        """
-        rng = np.random.RandomState(seed)
-
-        n_sequences = len(self)
-        indices = np.arange(n_sequences)
-
-        if self.group_by_size:
-            indices = indices[np.argsort(self.lengths[indices], kind='mergesort')]
-
-        if self.tokens_per_batch == -1:
-            batches = np.array_split(indices, math.ceil(len(indices) * 1. / self.batch_size))
-        else:
-            assert self.tokens_per_batch > 0
-            batch_ids = np.cumsum(self.lengths[indices]) // self.tokens_per_batch
-            _, bounds = np.unique(batch_ids, return_index=True)
-            batches = [indices[bounds[i]:bounds[i + 1]] for i in range(len(bounds) - 1)]
-            if bounds[-1] < len(indices):
-                batches.append(indices[bounds[-1]:])
-
-        if self.shuffle:
-            rng.shuffle(batches)
-
-        assert n_sequences == sum([len(x) for x in batches])
-        assert self.lengths[indices].sum() == sum([self.lengths[x].sum() for x in batches])
-
-        return self.get_batches_iterator(batches=batches)
--- a/examples/distillation/requirements.txt
+++ b/examples/distillation/requirements.txt
@@ -3,4 +3,4 @@ tensorboard>=1.14.0
 tensorboardX==1.8
 psutil==5.6.3
 scipy==1.3.1
-pytorch_transformers==1.2.0
+transformers==2.0.0
--- a/examples/distillation/run_squad_w_distillation.py
+++ b/examples/distillation/run_squad_w_distillation.py
@@ -0,0 +1,589 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" This is the exact same script as `examples/run_squad.py` (as of 2019, October 4th) with an additional and optional step of distillation."""
+
+from __future__ import absolute_import, division, print_function
+
+import argparse
+import logging
+import os
+import random
+import glob
+
+import numpy as np
+import torch
+from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
+                              TensorDataset)
+from torch.utils.data.distributed import DistributedSampler
+import torch.nn.functional as F
+import torch.nn as nn
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
+from tqdm import tqdm, trange
+
+from transformers import (WEIGHTS_NAME, BertConfig,
+                                  BertForQuestionAnswering, BertTokenizer,
+                                  XLMConfig, XLMForQuestionAnswering,
+                                  XLMTokenizer, XLNetConfig,
+                                  XLNetForQuestionAnswering,
+                                  XLNetTokenizer,
+                                  DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
+
+from transformers import AdamW, WarmupLinearSchedule
+
+from ..utils_squad import (read_squad_examples, convert_examples_to_features,
+                         RawResult, write_predictions,
+                         RawResultExtended, write_predictions_extended)
+
+# The follwing import is the official SQuAD evaluation script (2.0).
+# You can remove it from the dependencies if you are using this script outside of the library
+# We've added it here for automated tests (see examples/test_examples.py file)
+from ..utils_squad_evaluate import EVAL_OPTS, main as evaluate_on_squad
+
+logger = logging.getLogger(__name__)
+
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) \
+                  for conf in (BertConfig, XLNetConfig, XLMConfig)), ())
+
+MODEL_CLASSES = {
+    'bert': (BertConfig, BertForQuestionAnswering, BertTokenizer),
+    'xlnet': (XLNetConfig, XLNetForQuestionAnswering, XLNetTokenizer),
+    'xlm': (XLMConfig, XLMForQuestionAnswering, XLMTokenizer),
+    'distilbert': (DistilBertConfig, DistilBertForQuestionAnswering, DistilBertTokenizer)
+}
+
+def set_seed(args):
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    if args.n_gpu > 0:
+        torch.cuda.manual_seed_all(args.seed)
+
+def to_list(tensor):
+    return tensor.detach().cpu().tolist()
+
+def train(args, train_dataset, model, tokenizer, teacher=None):
+    """ Train the model """
+    if args.local_rank in [-1, 0]:
+        tb_writer = SummaryWriter()
+
+    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
+    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
+    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
+
+    if args.max_steps > 0:
+        t_total = args.max_steps
+        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
+    else:
+        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
+
+    # Prepare optimizer and schedule (linear warmup and decay)
+    no_decay = ['bias', 'LayerNorm.weight']
+    optimizer_grouped_parameters = [
+        {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
+        {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
+        ]
+    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
+    scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
+    if args.fp16:
+        try:
+            from apex import amp
+        except ImportError:
+            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
+        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
+
+    # multi-gpu training (should be after apex fp16 initialization)
+    if args.n_gpu > 1:
+        model = torch.nn.DataParallel(model)
+
+    # Distributed training (should be after apex fp16 initialization)
+    if args.local_rank != -1:
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
+                                                          output_device=args.local_rank,
+                                                          find_unused_parameters=True)
+
+    # Train!
+    logger.info("***** Running training *****")
+    logger.info("  Num examples = %d", len(train_dataset))
+    logger.info("  Num Epochs = %d", args.num_train_epochs)
+    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
+    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
+                   args.train_batch_size * args.gradient_accumulation_steps * (torch.distributed.get_world_size() if args.local_rank != -1 else 1))
+    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
+    logger.info("  Total optimization steps = %d", t_total)
+
+    global_step = 0
+    tr_loss, logging_loss = 0.0, 0.0
+    model.zero_grad()
+    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
+    set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
+    for _ in train_iterator:
+        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
+        for step, batch in enumerate(epoch_iterator):
+            model.train()
+            if teacher is not None:
+                teacher.eval()
+            batch = tuple(t.to(args.device) for t in batch)
+            inputs = {'input_ids':       batch[0],
+                      'attention_mask':  batch[1], 
+                      'start_positions': batch[3], 
+                      'end_positions':   batch[4]}
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[5],
+                               'p_mask':       batch[6]})
+            outputs = model(**inputs)
+            loss, start_logits_stu, end_logits_stu = outputs
+
+            # Distillation loss
+            if teacher is not None:
+                if 'token_type_ids' not in inputs:
+                    inputs['token_type_ids'] = None if args.teacher_type == 'xlm' else batch[2]
+                with torch.no_grad():
+                    start_logits_tea, end_logits_tea = teacher(input_ids=inputs['input_ids'],
+                                                               token_type_ids=inputs['token_type_ids'],
+                                                               attention_mask=inputs['attention_mask'])
+                assert start_logits_tea.size() == start_logits_stu.size()
+                assert end_logits_tea.size() == end_logits_stu.size()
+                
+                loss_fct = nn.KLDivLoss(reduction='batchmean')
+                loss_start = loss_fct(F.log_softmax(start_logits_stu/args.temperature, dim=-1),
+                                      F.softmax(start_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
+                loss_end = loss_fct(F.log_softmax(end_logits_stu/args.temperature, dim=-1),
+                                    F.softmax(end_logits_tea/args.temperature, dim=-1)) * (args.temperature**2)
+                loss_ce = (loss_start + loss_end)/2.
+
+                loss = args.alpha_ce*loss_ce + args.alpha_squad*loss
+
+            if args.n_gpu > 1:
+                loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
+            if args.gradient_accumulation_steps > 1:
+                loss = loss / args.gradient_accumulation_steps
+
+            if args.fp16:
+                with amp.scale_loss(loss, optimizer) as scaled_loss:
+                    scaled_loss.backward()
+                torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
+            else:
+                loss.backward()
+                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
+
+            tr_loss += loss.item()
+            if (step + 1) % args.gradient_accumulation_steps == 0:
+                optimizer.step()
+                scheduler.step()  # Update learning rate schedule
+                model.zero_grad()
+                global_step += 1
+
+                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
+                    # Log metrics
+                    if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
+                        results = evaluate(args, model, tokenizer)
+                        for key, value in results.items():
+                            tb_writer.add_scalar('eval_{}'.format(key), value, global_step)
+                    tb_writer.add_scalar('lr', scheduler.get_lr()[0], global_step)
+                    tb_writer.add_scalar('loss', (tr_loss - logging_loss)/args.logging_steps, global_step)
+                    logging_loss = tr_loss
+
+                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    # Save model checkpoint
+                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    if not os.path.exists(output_dir):
+                        os.makedirs(output_dir)
+                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+                    model_to_save.save_pretrained(output_dir)
+                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
+                    logger.info("Saving model checkpoint to %s", output_dir)
+
+            if args.max_steps > 0 and global_step > args.max_steps:
+                epoch_iterator.close()
+                break
+        if args.max_steps > 0 and global_step > args.max_steps:
+            train_iterator.close()
+            break
+
+    if args.local_rank in [-1, 0]:
+        tb_writer.close()
+
+    return global_step, tr_loss / global_step
+
+
+def evaluate(args, model, tokenizer, prefix=""):
+    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
+
+    if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+        os.makedirs(args.output_dir)
+
+    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
+    # Note that DistributedSampler samples randomly
+    eval_sampler = SequentialSampler(dataset) if args.local_rank == -1 else DistributedSampler(dataset)
+    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
+
+    # Eval!
+    logger.info("***** Running evaluation {} *****".format(prefix))
+    logger.info("  Num examples = %d", len(dataset))
+    logger.info("  Batch size = %d", args.eval_batch_size)
+    all_results = []
+    for batch in tqdm(eval_dataloader, desc="Evaluating"):
+        model.eval()
+        batch = tuple(t.to(args.device) for t in batch)
+        with torch.no_grad():
+            inputs = {'input_ids':      batch[0],
+                      'attention_mask': batch[1]
+                      }
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
+            example_indices = batch[3]
+            if args.model_type in ['xlnet', 'xlm']:
+                inputs.update({'cls_index': batch[4],
+                               'p_mask':    batch[5]})
+            outputs = model(**inputs)
+
+        for i, example_index in enumerate(example_indices):
+            eval_feature = features[example_index.item()]
+            unique_id = int(eval_feature.unique_id)
+            if args.model_type in ['xlnet', 'xlm']:
+                # XLNet uses a more complex post-processing procedure
+                result = RawResultExtended(unique_id            = unique_id,
+                                           start_top_log_probs  = to_list(outputs[0][i]),
+                                           start_top_index      = to_list(outputs[1][i]),
+                                           end_top_log_probs    = to_list(outputs[2][i]),
+                                           end_top_index        = to_list(outputs[3][i]),
+                                           cls_logits           = to_list(outputs[4][i]))
+            else:
+                result = RawResult(unique_id    = unique_id,
+                                   start_logits = to_list(outputs[0][i]),
+                                   end_logits   = to_list(outputs[1][i]))
+            all_results.append(result)
+
+    # Compute predictions
+    output_prediction_file = os.path.join(args.output_dir, "predictions_{}.json".format(prefix))
+    output_nbest_file = os.path.join(args.output_dir, "nbest_predictions_{}.json".format(prefix))
+    if args.version_2_with_negative:
+        output_null_log_odds_file = os.path.join(args.output_dir, "null_odds_{}.json".format(prefix))
+    else:
+        output_null_log_odds_file = None
+
+    if args.model_type in ['xlnet', 'xlm']:
+        # XLNet uses a more complex post-processing procedure
+        write_predictions_extended(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.predict_file,
+                        model.config.start_n_top, model.config.end_n_top,
+                        args.version_2_with_negative, tokenizer, args.verbose_logging)
+    else:
+        write_predictions(examples, features, all_results, args.n_best_size,
+                        args.max_answer_length, args.do_lower_case, output_prediction_file,
+                        output_nbest_file, output_null_log_odds_file, args.verbose_logging,
+                        args.version_2_with_negative, args.null_score_diff_threshold)
+
+    # Evaluate with the official SQuAD script
+    evaluate_options = EVAL_OPTS(data_file=args.predict_file,
+                                 pred_file=output_prediction_file,
+                                 na_prob_file=output_null_log_odds_file)
+    results = evaluate_on_squad(evaluate_options)
+    return results
+
+
+def load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False):
+    if args.local_rank not in [-1, 0] and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Load data features from cache or dataset file
+    input_file = args.predict_file if evaluate else args.train_file
+    cached_features_file = os.path.join(os.path.dirname(input_file), 'cached_{}_{}_{}'.format(
+        'dev' if evaluate else 'train',
+        list(filter(None, args.model_name_or_path.split('/'))).pop(),
+        str(args.max_seq_length)))
+    if os.path.exists(cached_features_file) and not args.overwrite_cache and not output_examples:
+        logger.info("Loading features from cached file %s", cached_features_file)
+        features = torch.load(cached_features_file)
+    else:
+        logger.info("Creating features from dataset file at %s", input_file)
+        examples = read_squad_examples(input_file=input_file,
+                                                is_training=not evaluate,
+                                                version_2_with_negative=args.version_2_with_negative)
+        features = convert_examples_to_features(examples=examples,
+                                                tokenizer=tokenizer,
+                                                max_seq_length=args.max_seq_length,
+                                                doc_stride=args.doc_stride,
+                                                max_query_length=args.max_query_length,
+                                                is_training=not evaluate)
+        if args.local_rank in [-1, 0]:
+            logger.info("Saving features into cached file %s", cached_features_file)
+            torch.save(features, cached_features_file)
+
+    if args.local_rank == 0 and not evaluate:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
+
+    # Convert to Tensors and build dataset
+    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
+    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
+    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
+    all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
+    all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
+    if evaluate:
+        all_example_index = torch.arange(all_input_ids.size(0), dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_example_index, all_cls_index, all_p_mask)
+    else:
+        all_start_positions = torch.tensor([f.start_position for f in features], dtype=torch.long)
+        all_end_positions = torch.tensor([f.end_position for f in features], dtype=torch.long)
+        dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
+                                all_start_positions, all_end_positions,
+                                all_cls_index, all_p_mask)
+
+    if output_examples:
+        return dataset, examples, features
+    return dataset
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    ## Required parameters
+    parser.add_argument("--train_file", default=None, type=str, required=True,
+                        help="SQuAD json for training. E.g., train-v1.1.json")
+    parser.add_argument("--predict_file", default=None, type=str, required=True,
+                        help="SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+    parser.add_argument("--model_type", default=None, type=str, required=True,
+                        help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
+    parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
+                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
+    parser.add_argument("--output_dir", default=None, type=str, required=True,
+                        help="The output directory where the model checkpoints and predictions will be written.")
+
+    # Distillation parameters (optional)
+    parser.add_argument('--teacher_type', default=None, type=str,
+                        help="Teacher type. Teacher tokenizer and student (model) tokenizer must output the same tokenization. Only for distillation.")
+    parser.add_argument('--teacher_name_or_path', default=None, type=str,
+                        help="Path to the already SQuAD fine-tuned teacher model. Only for distillation.")
+    parser.add_argument('--alpha_ce', default=0.5, type=float,
+                        help="Distillation loss linear weight. Only for distillation.")
+    parser.add_argument('--alpha_squad', default=0.5, type=float,
+                        help="True SQuAD loss linear weight. Only for distillation.")
+    parser.add_argument('--temperature', default=2.0, type=float,
+                        help="Distillation temperature. Only for distillation.")
+
+    ## Other parameters
+    parser.add_argument("--config_name", default="", type=str,
+                        help="Pretrained config name or path if not the same as model_name")
+    parser.add_argument("--tokenizer_name", default="", type=str,
+                        help="Pretrained tokenizer name or path if not the same as model_name")
+    parser.add_argument("--cache_dir", default="", type=str,
+                        help="Where do you want to store the pre-trained models downloaded from s3")
+
+    parser.add_argument('--version_2_with_negative', action='store_true',
+                        help='If true, the SQuAD examples contain some that do not have an answer.')
+    parser.add_argument('--null_score_diff_threshold', type=float, default=0.0,
+                        help="If null_score - best_non_null is greater than the threshold predict null.")
+
+    parser.add_argument("--max_seq_length", default=384, type=int,
+                        help="The maximum total input sequence length after WordPiece tokenization. Sequences "
+                             "longer than this will be truncated, and sequences shorter than this will be padded.")
+    parser.add_argument("--doc_stride", default=128, type=int,
+                        help="When splitting up a long document into chunks, how much stride to take between chunks.")
+    parser.add_argument("--max_query_length", default=64, type=int,
+                        help="The maximum number of tokens for the question. Questions longer than this will "
+                             "be truncated to this length.")
+    parser.add_argument("--do_train", action='store_true',
+                        help="Whether to run training.")
+    parser.add_argument("--do_eval", action='store_true',
+                        help="Whether to run eval on the dev set.")
+    parser.add_argument("--evaluate_during_training", action='store_true',
+                        help="Rul evaluation during training at each logging step.")
+    parser.add_argument("--do_lower_case", action='store_true',
+                        help="Set this flag if you are using an uncased model.")
+
+    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for training.")
+    parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
+                        help="Batch size per GPU/CPU for evaluation.")
+    parser.add_argument("--learning_rate", default=5e-5, type=float,
+                        help="The initial learning rate for Adam.")
+    parser.add_argument('--gradient_accumulation_steps', type=int, default=1,
+                        help="Number of updates steps to accumulate before performing a backward/update pass.")
+    parser.add_argument("--weight_decay", default=0.0, type=float,
+                        help="Weight deay if we apply some.")
+    parser.add_argument("--adam_epsilon", default=1e-8, type=float,
+                        help="Epsilon for Adam optimizer.")
+    parser.add_argument("--max_grad_norm", default=1.0, type=float,
+                        help="Max gradient norm.")
+    parser.add_argument("--num_train_epochs", default=3.0, type=float,
+                        help="Total number of training epochs to perform.")
+    parser.add_argument("--max_steps", default=-1, type=int,
+                        help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
+    parser.add_argument("--warmup_steps", default=0, type=int,
+                        help="Linear warmup over warmup_steps.")
+    parser.add_argument("--n_best_size", default=20, type=int,
+                        help="The total number of n-best predictions to generate in the nbest_predictions.json output file.")
+    parser.add_argument("--max_answer_length", default=30, type=int,
+                        help="The maximum length of an answer that can be generated. This is needed because the start "
+                             "and end predictions are not conditioned on one another.")
+    parser.add_argument("--verbose_logging", action='store_true',
+                        help="If true, all of the warnings related to data processing will be printed. "
+                             "A number of warnings are expected for a normal SQuAD evaluation.")
+
+    parser.add_argument('--logging_steps', type=int, default=50,
+                        help="Log every X updates steps.")
+    parser.add_argument('--save_steps', type=int, default=50,
+                        help="Save checkpoint every X updates steps.")
+    parser.add_argument("--eval_all_checkpoints", action='store_true',
+                        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
+    parser.add_argument("--no_cuda", action='store_true',
+                        help="Whether not to use CUDA when available")
+    parser.add_argument('--overwrite_output_dir', action='store_true',
+                        help="Overwrite the content of the output directory")
+    parser.add_argument('--overwrite_cache', action='store_true',
+                        help="Overwrite the cached training and evaluation sets")
+    parser.add_argument('--seed', type=int, default=42,
+                        help="random seed for initialization")
+
+    parser.add_argument("--local_rank", type=int, default=-1,
+                        help="local_rank for distributed training on gpus")
+    parser.add_argument('--fp16', action='store_true',
+                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
+    parser.add_argument('--fp16_opt_level', type=str, default='O1',
+                        help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
+                             "See details at https://nvidia.github.io/apex/amp.html")
+    parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
+    parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")
+    args = parser.parse_args()
+
+    if os.path.exists(args.output_dir) and os.listdir(args.output_dir) and args.do_train and not args.overwrite_output_dir:
+        raise ValueError("Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(args.output_dir))
+
+    # Setup distant debugging if needed
+    if args.server_ip and args.server_port:
+        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
+        import ptvsd
+        print("Waiting for debugger attach")
+        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
+        ptvsd.wait_for_attach()
+
+    # Setup CUDA, GPU & distributed training
+    if args.local_rank == -1 or args.no_cuda:
+        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+        args.n_gpu = torch.cuda.device_count()
+    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
+        torch.cuda.set_device(args.local_rank)
+        device = torch.device("cuda", args.local_rank)
+        torch.distributed.init_process_group(backend='nccl')
+        args.n_gpu = 1
+    args.device = device
+
+    # Setup logging
+    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
+                        datefmt = '%m/%d/%Y %H:%M:%S',
+                        level = logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
+    logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
+                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
+
+    # Set seed
+    set_seed(args)
+
+    # Load pretrained model and tokenizer
+    if args.local_rank not in [-1, 0]:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    args.model_type = args.model_type.lower()
+    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
+    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
+    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path, do_lower_case=args.do_lower_case)
+    model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool('.ckpt' in args.model_name_or_path), config=config)
+
+    if args.teacher_type is not None:
+        assert args.teacher_name_or_path is not None
+        assert args.alpha_ce > 0.
+        assert args.alpha_ce + args.alpha_squad > 0.
+        assert args.teacher_type != 'distilbert', "We constraint teachers not to be of type DistilBERT."
+        teacher_config_class, teacher_model_class, _ = MODEL_CLASSES[args.teacher_type]
+        teacher_config = teacher_config_class.from_pretrained(args.teacher_name_or_path)
+        teacher = teacher_model_class.from_pretrained(args.teacher_name_or_path, config=teacher_config)
+        teacher.to(args.device)
+    else:
+        teacher = None
+
+    if args.local_rank == 0:
+        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
+
+    model.to(args.device)
+
+    logger.info("Training/evaluation parameters %s", args)
+
+    # Training
+    if args.do_train:
+        train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
+        global_step, tr_loss = train(args, train_dataset, model, tokenizer, teacher=teacher)
+        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
+
+
+    # Save the trained model and the tokenizer
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+        # Create output directory if needed
+        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
+            os.makedirs(args.output_dir)
+
+        logger.info("Saving model checkpoint to %s", args.output_dir)
+        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
+        # They can then be reloaded using `from_pretrained()`
+        model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+        model_to_save.save_pretrained(args.output_dir)
+        tokenizer.save_pretrained(args.output_dir)
+
+        # Good practice: save your training arguments together with the trained model
+        torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))
+
+        # Load a trained model and vocabulary that you have fine-tuned
+        model = model_class.from_pretrained(args.output_dir)
+        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
+        model.to(args.device)
+
+
+    # Evaluation - we can ask to evaluate all the checkpoints (sub-directories) in a directory
+    results = {}
+    if args.do_eval and args.local_rank in [-1, 0]:
+        checkpoints = [args.output_dir]
+        if args.eval_all_checkpoints:
+            checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
+            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce model loading logs
+
+        logger.info("Evaluate the following checkpoints: %s", checkpoints)
+
+        for checkpoint in checkpoints:
+            # Reload the model
+            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            model = model_class.from_pretrained(checkpoint)
+            model.to(args.device)
+
+            # Evaluate
+            result = evaluate(args, model, tokenizer, prefix=global_step)
+
+            result = dict((k + ('_{}'.format(global_step) if global_step else ''), v) for k, v in result.items())
+            results.update(result)
+
+    logger.info("Results: {}".format(results))
+
+    return results
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/distillation/scripts/binarized_data.py
+++ b/examples/distillation/scripts/binarized_data.py
@@ -13,14 +13,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DistilBERT.
+Preprocessing script before distillation.
 """
 import argparse
 import pickle
 import random
 import time
 import numpy as np
-from transformers import BertTokenizer, RobertaTokenizer
+from transformers import BertTokenizer, RobertaTokenizer, GPT2Tokenizer
 import logging

 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
@@ -32,7 +32,7 @@ def main():
    parser = argparse.ArgumentParser(description="Preprocess the data to avoid re-doing it several times by (tokenization + token_to_ids).")
    parser.add_argument('--file_path', type=str, default='data/dump.txt',
                        help='The path to the data.')
-    parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta'])
+    parser.add_argument('--tokenizer_type', type=str, default='bert', choices=['bert', 'roberta', 'gpt2'])
    parser.add_argument('--tokenizer_name', type=str, default='bert-base-uncased',
                        help="The tokenizer to use.")
    parser.add_argument('--dump_file', type=str, default='data/dump',
@@ -43,10 +43,16 @@ def main():
    logger.info(f'Loading Tokenizer ({args.tokenizer_name})')
    if args.tokenizer_type == 'bert':
        tokenizer = BertTokenizer.from_pretrained(args.tokenizer_name)
+        bos = tokenizer.special_tokens_map['cls_token'] # `[CLS]`
+        sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]`
    elif args.tokenizer_type == 'roberta':
        tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name)
-    bos = tokenizer.special_tokens_map['bos_token'] # `[CLS]` for bert, `<s>` for roberta
-    sep = tokenizer.special_tokens_map['sep_token'] # `[SEP]` for bert, `</s>` for roberta
+        bos = tokenizer.special_tokens_map['cls_token'] # `<s>`
+        sep = tokenizer.special_tokens_map['sep_token'] # `</s>`
+    elif args.tokenizer_type == 'gpt2':
+        tokenizer = GPT2Tokenizer.from_pretrained(args.tokenizer_name)
+        bos = tokenizer.special_tokens_map['bos_token'] # `<|endoftext|>`
+        sep = tokenizer.special_tokens_map['eos_token'] # `<|endoftext|>`    

    logger.info(f'Loading text from {args.file_path}')
    with open(args.file_path, 'r', encoding='utf8') as fp:
--- a/examples/distillation/scripts/extract.py
+++ b/examples/distillation/scripts/extract.py
@@ -0,0 +1,89 @@
+# coding=utf-8
+# Copyright 2019-present, the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocessing script before training the distilled model.
+Specific to RoBERTa -> DistilRoBERTa and GPT2 -> DistilGPT2.
+"""
+from transformers import BertForMaskedLM, RobertaForMaskedLM, GPT2LMHeadModel
+import torch
+import argparse
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Extraction some layers of the full RobertaForMaskedLM or GPT2LMHeadModel for Transfer Learned Distillation")
+    parser.add_argument("--model_type", default="roberta", choices=["roberta", "gpt2"])
+    parser.add_argument("--model_name", default='roberta-large', type=str)
+    parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_roberta_048131723.pth', type=str)
+    parser.add_argument("--vocab_transform", action='store_true')
+    args = parser.parse_args()
+
+
+    if args.model_type == 'roberta':
+        model = RobertaForMaskedLM.from_pretrained(args.model_name)
+        prefix = 'roberta'
+    elif args.model_type == 'gpt2':
+        model = GPT2LMHeadModel.from_pretrained(args.model_name)
+        prefix = 'transformer'
+
+    state_dict = model.state_dict()
+    compressed_sd = {}
+
+    ### Embeddings ###
+    if args.model_type == 'gpt2':
+        for param_name in ['wte.weight', 'wpe.weight']:
+            compressed_sd[f'{prefix}.{param_name}'] = state_dict[f'{prefix}.{param_name}']
+    else:
+        for w in ['word_embeddings', 'position_embeddings', 'token_type_embeddings']:
+            param_name = f'{prefix}.embeddings.{w}.weight'
+            compressed_sd[param_name] = state_dict[param_name]
+        for w in ['weight', 'bias']:
+            param_name = f'{prefix}.embeddings.LayerNorm.{w}'
+            compressed_sd[param_name] = state_dict[param_name]
+
+    ### Transformer Blocks ###
+    std_idx = 0
+    for teacher_idx in [0, 2, 4, 7, 9, 11]:
+        if args.model_type == 'gpt2':
+            for layer in ['ln_1', 'attn.c_attn', 'attn.c_proj', 'ln_2', 'mlp.c_fc', 'mlp.c_proj']:
+                for w in ['weight', 'bias']:
+                    compressed_sd[f'{prefix}.h.{std_idx}.{layer}.{w}'] = \
+                        state_dict[f'{prefix}.h.{teacher_idx}.{layer}.{w}']
+            compressed_sd[f'{prefix}.h.{std_idx}.attn.bias'] = state_dict[f'{prefix}.h.{teacher_idx}.attn.bias']
+        else:
+            for layer in ['attention.self.query', 'attention.self.key', 'attention.self.value',
+                        'attention.output.dense', 'attention.output.LayerNorm',
+                        'intermediate.dense', 'output.dense', 'output.LayerNorm']:
+                for w in ['weight', 'bias']:
+                    compressed_sd[f'{prefix}.encoder.layer.{std_idx}.{layer}.{w}'] = \
+                        state_dict[f'{prefix}.encoder.layer.{teacher_idx}.{layer}.{w}']
+        std_idx += 1
+
+    ### Language Modeling Head ###s
+    if args.model_type == 'roberta':
+        for layer in ['lm_head.decoder.weight', 'lm_head.bias']:
+            compressed_sd[f'{layer}'] = state_dict[f'{layer}']
+        if args.vocab_transform:
+            for w in ['weight', 'bias']:
+                compressed_sd[f'lm_head.dense.{w}'] = state_dict[f'lm_head.dense.{w}']
+                compressed_sd[f'lm_head.layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
+    elif args.model_type == 'gpt2':
+        for w in ['weight', 'bias']:
+            compressed_sd[f'{prefix}.ln_f.{w}'] = state_dict[f'{prefix}.ln_f.{w}']
+        compressed_sd[f'lm_head.weight'] = state_dict[f'lm_head.weight']
+
+    print(f'N layers selected for distillation: {std_idx}')
+    print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
+
+    print(f'Save transfered checkpoint to {args.dump_checkpoint}.')
+    torch.save(compressed_sd, args.dump_checkpoint)
--- a/examples/distillation/scripts/extract_distilbert.py
+++ b/examples/distillation/scripts/extract_distilbert.py
@@ -14,6 +14,7 @@
 # limitations under the License.
 """
 Preprocessing script before training DistilBERT.
+Specific to BERT -> DistilBERT.
 """
 from transformers import BertForMaskedLM, RobertaForMaskedLM
 import torch
@@ -21,7 +22,7 @@ import argparse

 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Extraction some layers of the full BertForMaskedLM or RObertaForMaskedLM for Transfer Learned Distillation")
-    parser.add_argument("--model_type", default="bert", choices=["bert", "roberta"])
+    parser.add_argument("--model_type", default="bert", choices=["bert"])
    parser.add_argument("--model_name", default='bert-base-uncased', type=str)
    parser.add_argument("--dump_checkpoint", default='serialization_dir/tf_bert-base-uncased_0247911.pth', type=str)
    parser.add_argument("--vocab_transform", action='store_true')
@@ -31,9 +32,8 @@ if __name__ == '__main__':
    if args.model_type == 'bert':
        model = BertForMaskedLM.from_pretrained(args.model_name)
        prefix = 'bert'
-    elif args.model_type == 'roberta':
-        model = RobertaForMaskedLM.from_pretrained(args.model_name)
-        prefix = 'roberta'
+    else:
+        raise ValueError(f'args.model_type should be "bert".')

    state_dict = model.state_dict()
    compressed_sd = {}
@@ -68,20 +68,12 @@ if __name__ == '__main__':
                state_dict[f'{prefix}.encoder.layer.{teacher_idx}.output.LayerNorm.{w}']
        std_idx += 1

-    if args.model_type == 'bert':
-        compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
-        compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
-        if args.vocab_transform:
-            for w in ['weight', 'bias']:
-                compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
-                compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']
-    elif args.model_type == 'roberta':
-        compressed_sd[f'vocab_projector.weight'] = state_dict[f'lm_head.decoder.weight']
-        compressed_sd[f'vocab_projector.bias'] = state_dict[f'lm_head.bias']
-        if args.vocab_transform:
-            for w in ['weight', 'bias']:
-                compressed_sd[f'vocab_transform.{w}'] = state_dict[f'lm_head.dense.{w}']
-                compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'lm_head.layer_norm.{w}']
+    compressed_sd[f'vocab_projector.weight'] = state_dict[f'cls.predictions.decoder.weight']
+    compressed_sd[f'vocab_projector.bias'] = state_dict[f'cls.predictions.bias']
+    if args.vocab_transform:
+        for w in ['weight', 'bias']:
+            compressed_sd[f'vocab_transform.{w}'] = state_dict[f'cls.predictions.transform.dense.{w}']
+            compressed_sd[f'vocab_layer_norm.{w}'] = state_dict[f'cls.predictions.transform.LayerNorm.{w}']

    print(f'N layers selected for distillation: {std_idx}')
    print(f'Number of params transfered for distillation: {len(compressed_sd.keys())}')
--- a/examples/distillation/scripts/token_counts.py
+++ b/examples/distillation/scripts/token_counts.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Preprocessing script before training DistilBERT.
+Preprocessing script before training the distilled model.
 """
 from collections import Counter
 import argparse
--- a/examples/distillation/train.py
+++ b/examples/distillation/train.py
@@ -13,7 +13,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
-Training DistilBERT.
+Training the distilled model.
+Supported architectures include: BERT -> DistilBERT, RoBERTa -> DistilRoBERTa, GPT2 -> DistilGPT2.
 """
 import os
 import argparse
@@ -23,68 +24,96 @@ import shutil
 import numpy as np
 import torch

-from transformers import BertTokenizer, BertForMaskedLM, RobertaTokenizer, RobertaForMaskedLM
-from transformers import DistilBertForMaskedLM, DistilBertConfig
+from transformers import BertConfig, BertForMaskedLM, BertTokenizer
+from transformers import RobertaConfig, RobertaForMaskedLM, RobertaTokenizer
+from transformers import DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer
+from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer

 from distiller import Distiller
 from utils import git_log, logger, init_gpu_params, set_seed
-from dataset import Dataset
+from lm_seqs_dataset import LmSeqsDataset


+MODEL_CLASSES = {
+    'distilbert': (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
+    'roberta': (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
+    'bert': (BertConfig, BertForMaskedLM, BertTokenizer),
+    'gpt2': (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer)
+}
+
+def sanity_checks(args):
+    """
+    A bunch of args sanity checks to perform even starting...
+    """
+    assert (args.mlm and args.alpha_mlm > 0.) or (not args.mlm and args.alpha_mlm == 0.)
+    assert (args.alpha_mlm > 0. and args.alpha_clm == 0.) or (args.alpha_mlm == 0. and args.alpha_clm > 0.)
+    if args.mlm:
+        assert os.path.isfile(args.token_counts)
+        assert (args.student_type in ['roberta', 'distilbert']) and (args.teacher_type in ['roberta', 'bert'])
+    else:
+        assert (args.student_type in ['gpt2']) and (args.teacher_type in ['gpt2'])
+
+    assert args.teacher_type == args.student_type or (args.student_type=='distilbert' and args.teacher_type=='bert')
+    assert os.path.isfile(args.student_config)
+    if args.student_pretrained_weights is not None:
+        assert os.path.isfile(args.student_pretrained_weights)
+
+    if args.freeze_token_type_embds: assert args.student_type in ['roberta']
+
+    assert args.alpha_ce >= 0.
+    assert args.alpha_mlm >= 0.
+    assert args.alpha_clm >= 0.
+    assert args.alpha_mse >= 0.
+    assert args.alpha_cos >= 0.
+    assert args.alpha_ce + args.alpha_mlm + args.alpha_clm + args.alpha_mse + args.alpha_cos > 0.
+
+def freeze_pos_embeddings(student, args):
+    if args.student_type == 'roberta':
+        student.roberta.embeddings.position_embeddings.weight.requires_grad = False
+    elif args.student_type == 'gpt2':
+        student.transformer.wpe.weight.requires_grad = False
+
+def freeze_token_type_embeddings(student, args):
+    if args.student_type == 'roberta':
+        student.roberta.embeddings.token_type_embeddings.weight.requires_grad = False
+
 def main():
    parser = argparse.ArgumentParser(description="Training")
+    parser.add_argument("--force", action='store_true',
+                        help="Overwrite dump_path if it already exists.")

    parser.add_argument("--dump_path", type=str, required=True,
                        help="The output directory (log, checkpoints, parameters, etc.)")
    parser.add_argument("--data_file", type=str, required=True,
                        help="The binarized file (tokenized + tokens_to_ids) and grouped by sequence.")
-    parser.add_argument("--token_counts", type=str, required=True,
-                        help="The token counts in the data_file for MLM.")
-    parser.add_argument("--force", action='store_true',
-                        help="Overwrite dump_path if it already exists.")

-    parser.add_argument("--vocab_size", default=30522, type=int,
-                        help="The vocabulary size.")
-    parser.add_argument("--max_position_embeddings", default=512, type=int,
-                        help="Maximum sequence length we can model (including [CLS] and [SEP]).")
-    parser.add_argument("--sinusoidal_pos_embds", action='store_false',
-                        help="If true, the position embeddings are simply fixed with sinusoidal embeddings.")
-    parser.add_argument("--n_layers", default=6, type=int,
-                        help="Number of Transformer blocks.")
-    parser.add_argument("--n_heads", default=12, type=int,
-                        help="Number of heads in the self-attention module.")
-    parser.add_argument("--dim", default=768, type=int,
-                        help="Dimension through the network. Must be divisible by n_heads")
-    parser.add_argument("--hidden_dim", default=3072, type=int,
-                        help="Intermediate dimension in the FFN.")
-    parser.add_argument("--dropout", default=0.1, type=float,
-                        help="Dropout.")
-    parser.add_argument("--attention_dropout", default=0.1, type=float,
-                        help="Dropout in self-attention.")
-    parser.add_argument("--activation", default='gelu', type=str,
-                        help="Activation to use in self-attention")
-    parser.add_argument("--tie_weights_", action='store_false',
-                        help="If true, we tie the embeddings matrix with the projection over the vocabulary matrix. Default is true.")
-
-    parser.add_argument("--from_pretrained_weights", default=None, type=str,
+    parser.add_argument("--student_type", type=str, choices=["distilbert", "roberta", "gpt2"], required=True,
+                        help="The student type (DistilBERT, RoBERTa).")
+    parser.add_argument("--student_config", type=str, required=True,
+                        help="Path to the student configuration.")
+    parser.add_argument("--student_pretrained_weights", default=None, type=str,
                        help="Load student initialization checkpoint.")
-    parser.add_argument("--from_pretrained_config", default=None, type=str,
-                        help="Load student initialization architecture config.")
-    parser.add_argument("--teacher_type", default="bert", choices=["bert", "roberta"],
+
+    parser.add_argument("--teacher_type", choices=["bert", "roberta", "gpt2"], required=True,
                        help="Teacher type (BERT, RoBERTa).")
-    parser.add_argument("--teacher_name", default="bert-base-uncased", type=str,
+    parser.add_argument("--teacher_name", type=str, required=True,
                        help="The teacher model.")

    parser.add_argument("--temperature", default=2., type=float,
                        help="Temperature for the softmax temperature.")
    parser.add_argument("--alpha_ce", default=0.5, type=float,
                        help="Linear weight for the distillation loss. Must be >=0.")
-    parser.add_argument("--alpha_mlm", default=0.5, type=float,
-                        help="Linear weight for the MLM loss. Must be >=0.")
+    parser.add_argument("--alpha_mlm", default=0.0, type=float,
+                        help="Linear weight for the MLM loss. Must be >=0. Should be used in coonjunction with `mlm` flag.")
+    parser.add_argument("--alpha_clm", default=0.5, type=float,
+                        help="Linear weight for the CLM loss. Must be >=0.")
    parser.add_argument("--alpha_mse", default=0.0, type=float,
                        help="Linear weight of the MSE loss. Must be >=0.")
    parser.add_argument("--alpha_cos", default=0.0, type=float,
                        help="Linear weight of the cosine embedding loss. Must be >=0.")
+
+    parser.add_argument("--mlm", action="store_true",
+                        help="The LM step: MLM or CLM. If `mlm` is True, the MLM is used over CLM.")
    parser.add_argument("--mlm_mask_prop", default=0.15, type=float,
                        help="Proportion of tokens for which we need to make a prediction.")
    parser.add_argument("--word_mask", default=0.8, type=float,
@@ -95,17 +124,20 @@ def main():
                        help="Proportion of tokens to randomly replace.")
    parser.add_argument("--mlm_smoothing", default=0.7, type=float,
                        help="Smoothing parameter to emphasize more rare tokens (see XLM, similar to word2vec).")
+    parser.add_argument("--token_counts", type=str,
+                        help="The token counts in the data_file for MLM.")
+
    parser.add_argument("--restrict_ce_to_mask", action='store_true',
                        help="If true, compute the distilation loss only the [MLM] prediction distribution.")
+    parser.add_argument("--freeze_pos_embs", action="store_true",
+                        help="Freeze positional embeddings during distillation. For student_type in ['roberta', 'gpt2'] only.")
+    parser.add_argument("--freeze_token_type_embds", action="store_true",
+                        help="Freeze token type embeddings during distillation if existent. For student_type in ['roberta'] only.")

    parser.add_argument("--n_epoch", type=int, default=3,
                        help="Number of pass on the whole dataset.")
    parser.add_argument("--batch_size", type=int, default=5,
                        help="Batch size (for each process).")
-    parser.add_argument("--tokens_per_batch", type=int, default=-1,
-                        help="If specified, modify the batches so that they have approximately this number of tokens.")
-    parser.add_argument("--shuffle", action='store_false',
-                        help="If true, shuffle the sequence order. Default is true.")
    parser.add_argument("--group_by_size", action='store_false',
                        help="If true, group sequences that have similar length into the same batch. Default is true.")

@@ -141,6 +173,7 @@ def main():
    parser.add_argument("--checkpoint_interval", type=int, default=4000,
                        help="Checkpoint interval.")
    args = parser.parse_args()
+    sanity_checks(args)


    ## ARGS ##
@@ -164,21 +197,19 @@ def main():
        with open(os.path.join(args.dump_path, 'parameters.json'), 'w') as f:
            json.dump(vars(args), f, indent=4)
        git_log(args.dump_path)
-    assert (args.from_pretrained_weights is None and args.from_pretrained_config is None) or \
-           (args.from_pretrained_weights is not None and args.from_pretrained_config is not None)

+    student_config_class, student_model_class, _ = MODEL_CLASSES[args.student_type]
+    teacher_config_class, teacher_model_class, teacher_tokenizer_class = MODEL_CLASSES[args.teacher_type]

    ### TOKENIZER ###
-    if args.teacher_type == 'bert':
-        tokenizer = BertTokenizer.from_pretrained(args.teacher_name)
-    elif args.teacher_type == 'roberta':
-        tokenizer = RobertaTokenizer.from_pretrained(args.teacher_name)
+    tokenizer = teacher_tokenizer_class.from_pretrained(args.teacher_name)
    special_tok_ids = {}
    for tok_name, tok_symbol in tokenizer.special_tokens_map.items():
        idx = tokenizer.all_special_tokens.index(tok_symbol)
        special_tok_ids[tok_name] = tokenizer.all_special_ids[idx]
    logger.info(f'Special tokens {special_tok_ids}')
    args.special_tok_ids = special_tok_ids
+    args.max_model_input_size = tokenizer.max_model_input_sizes[args.teacher_name]


    ## DATA LOADER ##
@@ -187,35 +218,34 @@ def main():
        data = pickle.load(fp)


-    assert os.path.isfile(args.token_counts)
-    logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
-    with open(args.token_counts, 'rb') as fp:
-        counts = pickle.load(fp)
-        assert len(counts) == args.vocab_size
-    token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
-    for idx in special_tok_ids.values():
-        token_probs[idx] = 0.  # do not predict special tokens
-    token_probs = torch.from_numpy(token_probs)
+    if args.mlm:
+        logger.info(f'Loading token counts from {args.token_counts} (already pre-computed)')
+        with open(args.token_counts, 'rb') as fp:
+            counts = pickle.load(fp)
+        
+        token_probs = np.maximum(counts, 1) ** -args.mlm_smoothing
+        for idx in special_tok_ids.values():
+            token_probs[idx] = 0.  # do not predict special tokens
+        token_probs = torch.from_numpy(token_probs)
+    else:
+        token_probs = None


-    train_dataloader = Dataset(params=args, data=data)
+    train_lm_seq_dataset = LmSeqsDataset(params=args, data=data)
    logger.info(f'Data loader created.')


    ## STUDENT ##
-    if args.from_pretrained_weights is not None:
-        assert os.path.isfile(args.from_pretrained_weights)
-        assert os.path.isfile(args.from_pretrained_config)
-        logger.info(f'Loading pretrained weights from {args.from_pretrained_weights}')
-        logger.info(f'Loading pretrained config from {args.from_pretrained_config}')
-        stu_architecture_config = DistilBertConfig.from_json_file(args.from_pretrained_config)
-        stu_architecture_config.output_hidden_states = True
-        student = DistilBertForMaskedLM.from_pretrained(args.from_pretrained_weights,
-                                                        config=stu_architecture_config)
+    logger.info(f'Loading student config from {args.student_config}')
+    stu_architecture_config = student_config_class.from_pretrained(args.student_config)
+    stu_architecture_config.output_hidden_states = True
+
+    if args.student_pretrained_weights is not None:
+        logger.info(f'Loading pretrained weights from {args.student_pretrained_weights}')
+        student = student_model_class.from_pretrained(args.student_pretrained_weights,
+                                                      config=stu_architecture_config)
    else:
-        args.vocab_size_or_config_json_file = args.vocab_size
-        stu_architecture_config = DistilBertConfig(**vars(args), output_hidden_states=True)
-        student = DistilBertForMaskedLM(stu_architecture_config)
+        student = student_model_class(stu_architecture_config)


    if args.n_gpu > 0:
@@ -224,18 +254,31 @@ def main():


    ## TEACHER ##
-    if args.teacher_type == 'bert':
-        teacher = BertForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
-    elif args.teacher_type == 'roberta':
-        teacher = RobertaForMaskedLM.from_pretrained(args.teacher_name, output_hidden_states=True)
+    teacher = teacher_model_class.from_pretrained(args.teacher_name, output_hidden_states=True)
    if args.n_gpu > 0:
        teacher.to(f'cuda:{args.local_rank}')
    logger.info(f'Teacher loaded from {args.teacher_name}.')

+
+    ## FREEZING ##
+    if args.freeze_pos_embs:
+        freeze_pos_embeddings(student, args)
+    if args.freeze_token_type_embds:
+        freeze_token_type_embeddings(student, args)
+
+
+    ## SANITY CHECKS ##
+    assert student.config.vocab_size == teacher.config.vocab_size
+    assert student.config.hidden_size == teacher.config.hidden_size
+    assert student.config.max_position_embeddings == teacher.config.max_position_embeddings
+    if args.mlm:
+        assert token_probs.size(0) == stu_architecture_config.vocab_size
+
+
    ## DISTILLER ##
    torch.cuda.empty_cache()
    distiller = Distiller(params=args,
-                          dataloader=train_dataloader,
+                          dataset=train_lm_seq_dataset,
                          token_probs=token_probs,
                          student=student,
                          teacher=teacher)
--- a/examples/distillation/training_configs/distilbert-base-uncased.json
+++ b/examples/distillation/training_configs/distilbert-base-uncased.json
@@ -0,0 +1,15 @@
+{
+	"activation": "gelu",
+	"attention_dropout": 0.1,
+	"dim": 768,
+	"dropout": 0.1,
+	"hidden_dim": 3072,
+	"initializer_range": 0.02,
+	"max_position_embeddings": 512,
+	"n_heads": 12,
+	"n_layers": 6,
+	"sinusoidal_pos_embds": true,
+	"tie_weights_": true,
+	"vocab_size": 30522
+  }
+  
--- a/examples/distillation/training_configs/distilgpt2.json
+++ b/examples/distillation/training_configs/distilgpt2.json
@@ -0,0 +1,10 @@
+{
+	"initializer_range": 0.02,
+	"layer_norm_epsilon": 0.00001,
+	"n_ctx": 1024,
+	"n_embd": 768,
+	"n_head": 12,
+	"n_layer": 6,
+	"n_positions": 1024,
+	"vocab_size": 50257
+}
--- a/examples/run_generation.py
+++ b/examples/run_generation.py
@@ -14,7 +14,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/Transformer-XL/XLNet)
+""" Conditional text generation with the auto-regressive models of the library (GPT/GPT-2/CTRL/Transformer-XL/XLNet)
 """
 from __future__ import absolute_import, division, print_function, unicode_literals

@@ -26,12 +26,14 @@ import torch
 import torch.nn.functional as F
 import numpy as np

-from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig
+from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig

 from transformers import GPT2LMHeadModel, GPT2Tokenizer
 from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
 from transformers import XLNetLMHeadModel, XLNetTokenizer
 from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
+from transformers import CTRLLMHeadModel, CTRLTokenizer
+from transformers import XLMWithLMHeadModel, XLMTokenizer


 logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
@@ -41,13 +43,15 @@ logger = logging.getLogger(__name__)

 MAX_LENGTH = int(10000)  # Hardcoded max length to avoid infinite loop

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig)), ())

 MODEL_CLASSES = {
    'gpt2': (GPT2LMHeadModel, GPT2Tokenizer),
+    'ctrl': (CTRLLMHeadModel, CTRLTokenizer),
    'openai-gpt': (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    'xlnet': (XLNetLMHeadModel, XLNetTokenizer),
    'transfo-xl': (TransfoXLLMHeadModel, TransfoXLTokenizer),
+    'xlm': (XLMWithLMHeadModel, XLMTokenizer),
 }

 # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia
@@ -103,7 +107,8 @@ def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')
    return logits


-def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, is_xlnet=False, device='cpu'):
+def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=0, top_p=0.0, repetition_penalty=1.0,
+                    is_xlnet=False, is_xlm_mlm=False, xlm_mask_token=None, xlm_lang=None, device='cpu'):
    context = torch.tensor(context, dtype=torch.long, device=device)
    context = context.unsqueeze(0).repeat(num_samples, 1)
    generated = context
@@ -121,10 +126,27 @@ def sample_sequence(model, length, context, num_samples=1, temperature=1, top_k=
                target_mapping[0, 0, -1] = 1.0  # predict last token
                inputs = {'input_ids': input_ids, 'perm_mask': perm_mask, 'target_mapping': target_mapping}

-            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet (cached hidden-states)
-            next_token_logits = outputs[0][0, -1, :] / temperature
+            if is_xlm_mlm and xlm_mask_token:
+                # XLM MLM models are direct models (predict same token, not next token)
+                # => need one additional dummy token in the input (will be masked and guessed)
+                input_ids = torch.cat((generated, torch.full((1, 1), xlm_mask_token, dtype=torch.long, device=device)), dim=1)
+                inputs = {'input_ids': input_ids}
+
+            if xlm_lang is not None:
+                inputs["langs"] = torch.tensor([xlm_lang] * inputs["input_ids"].shape[1], device=device).view(1, -1)
+
+            outputs = model(**inputs)  # Note: we could also use 'past' with GPT-2/Transfo-XL/XLNet/CTRL (cached hidden-states)
+            next_token_logits = outputs[0][0, -1, :] / (temperature if temperature > 0 else 1.)
+
+            # reptition penalty from CTRL (https://arxiv.org/abs/1909.05858)
+            for _ in set(generated):
+                next_token_logits[_] /= repetition_penalty
+                
            filtered_logits = top_k_top_p_filtering(next_token_logits, top_k=top_k, top_p=top_p)
-            next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
+            if temperature == 0: #greedy sampling:
+                next_token = torch.argmax(filtered_logits).unsqueeze(0)
+            else:
+                next_token = torch.multinomial(F.softmax(filtered_logits, dim=-1), num_samples=1)
            generated = torch.cat((generated, next_token.unsqueeze(0)), dim=1)
    return generated

@@ -137,14 +159,20 @@ def main():
                        help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
    parser.add_argument("--prompt", type=str, default="")
    parser.add_argument("--padding_text", type=str, default="")
+    parser.add_argument("--xlm_lang", type=str, default="", help="Optional language when used with the XLM model.")
    parser.add_argument("--length", type=int, default=20)
-    parser.add_argument("--temperature", type=float, default=1.0)
+    parser.add_argument("--temperature", type=float, default=1.0,
+                        help="temperature of 0 implies greedy sampling")
+    parser.add_argument("--repetition_penalty", type=float, default=1.0,
+                        help="primarily useful for CTRL model; in that case, use 1.2")
    parser.add_argument("--top_k", type=int, default=0)
    parser.add_argument("--top_p", type=float, default=0.9)
    parser.add_argument("--no_cuda", action='store_true',
                        help="Avoid using CUDA when available")
    parser.add_argument('--seed', type=int, default=42,
                        help="random seed for initialization")
+    parser.add_argument('--stop_token', type=str, default=None,
+                        help="Token at which text generation is stopped")
    args = parser.parse_args()

    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
@@ -166,8 +194,31 @@ def main():
    elif args.length < 0:
        args.length = MAX_LENGTH  # avoid infinite loop

-    print(args)
+    logger.info(args)
+    if args.model_type in ["ctrl"]:
+        if args.temperature > 0.7 : 
+            logger.info('CTRL typically works better with lower temperatures (and lower top_k).')
+
    while True:
+        xlm_lang = None
+        # XLM Language usage detailed in the issues #1414
+        if args.model_type in ["xlm"] and hasattr(tokenizer, 'lang2id') and hasattr(model.config, 'use_lang_emb') \
+                and model.config.use_lang_emb:
+            if args.xlm_lang:
+                language = args.xlm_lang
+            else:
+                language = None
+                while language not in tokenizer.lang2id.keys():
+                    language = input("Using XLM. Select language in " + str(list(tokenizer.lang2id.keys())) + " >>> ")
+            xlm_lang = tokenizer.lang2id[language]
+
+        # XLM masked-language modeling (MLM) models need masked token (see details in sample_sequence)
+        is_xlm_mlm = args.model_type in ["xlm"] and 'mlm' in args.model_name_or_path
+        if is_xlm_mlm:
+            xlm_mask_token = tokenizer.mask_token_id
+        else:
+            xlm_mask_token = None
+
        raw_text = args.prompt if args.prompt else input("Model prompt >>> ")
        if args.model_type in ["transfo-xl", "xlnet"]:
            # Models with memory likes to have a long prompt for short inputs.
@@ -180,11 +231,18 @@ def main():
            temperature=args.temperature,
            top_k=args.top_k,
            top_p=args.top_p,
-            device=args.device,
+            repetition_penalty=args.repetition_penalty,
            is_xlnet=bool(args.model_type == "xlnet"),
+            is_xlm_mlm=is_xlm_mlm,
+            xlm_mask_token=xlm_mask_token,
+            xlm_lang=xlm_lang,
+            device=args.device,
        )
        out = out[0, len(context_tokens):].tolist()
-        text = tokenizer.decode(out, clean_up_tokenization_spaces=True)
+
+        text = tokenizer.decode(out, clean_up_tokenization_spaces=True, skip_special_tokens=True)
+        text = text[: text.find(args.stop_token) if args.stop_token else None]
+
        print(text)
        if args.prompt:
            break
--- a/examples/run_glue.py
+++ b/examples/run_glue.py
@@ -28,7 +28,12 @@ import torch
 from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
 from torch.utils.data.distributed import DistributedSampler
-from tensorboardX import SummaryWriter
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
 from tqdm import tqdm, trange

 from transformers import (WEIGHTS_NAME, BertConfig,
@@ -53,7 +58,8 @@ from transformers import glue_convert_examples_to_features as convert_examples_t

 logger = logging.getLogger(__name__)

-ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, RobertaConfig)), ())
+ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, XLNetConfig, XLMConfig, 
+                                                                                RobertaConfig, DistilBertConfig)), ())

 MODEL_CLASSES = {
    'bert': (BertConfig, BertForSequenceClassification, BertTokenizer),
@@ -134,8 +140,9 @@ def train(args, train_dataset, model, tokenizer):
            batch = tuple(t.to(args.device) for t in batch)
            inputs = {'input_ids':      batch[0],
                      'attention_mask': batch[1],
-                      'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM, DistilBERT and RoBERTa don't use segment_ids
                      'labels':         batch[3]}
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
            outputs = model(**inputs)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

@@ -153,7 +160,7 @@ def train(args, train_dataset, model, tokenizer):
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

            tr_loss += loss.item()
-            if (step + 1) % args.gradient_accumulation_steps == 0:
+            if (step + 1) % args.gradient_accumulation_steps == 0 and not args.tpu:
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
@@ -179,6 +186,11 @@ def train(args, train_dataset, model, tokenizer):
                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
                    logger.info("Saving model checkpoint to %s", output_dir)

+            if args.tpu:
+                args.xla_model.optimizer_step(optimizer, barrier=True)
+                model.zero_grad()
+                global_step += 1
+
            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
@@ -224,8 +236,9 @@ def evaluate(args, model, tokenizer, prefix=""):
            with torch.no_grad():
                inputs = {'input_ids':      batch[0],
                          'attention_mask': batch[1],
-                          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None,  # XLM, DistilBERT and RoBERTa don't use segment_ids
                          'labels':         batch[3]}
+                if args.model_type != 'distilbert':
+                    inputs['token_type_ids'] = batch[2] if args.model_type in ['bert', 'xlnet'] else None  # XLM, DistilBERT and RoBERTa don't use segment_ids
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]

@@ -246,7 +259,7 @@ def evaluate(args, model, tokenizer, prefix=""):
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)

-        output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+        output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results {} *****".format(prefix))
            for key in sorted(result.keys()):
@@ -268,7 +281,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False):
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
-    if os.path.exists(cached_features_file):
+    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
@@ -377,6 +390,15 @@ def main():
    parser.add_argument('--seed', type=int, default=42,
                        help="random seed for initialization")

+    parser.add_argument('--tpu', action='store_true',
+                        help="Whether to run on the TPU defined in the environment variables")
+    parser.add_argument('--tpu_ip_address', type=str, default='',
+                        help="TPU IP address if none are set in the environment variables")
+    parser.add_argument('--tpu_name', type=str, default='',
+                        help="TPU name if none are set in the environment variables")
+    parser.add_argument('--xrt_tpu_config', type=str, default='',
+                        help="XRT TPU config if none are set in the environment variables")
+
    parser.add_argument('--fp16', action='store_true',
                        help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
    parser.add_argument('--fp16_opt_level', type=str, default='O1',
@@ -410,6 +432,23 @@ def main():
        args.n_gpu = 1
    args.device = device

+    if args.tpu:
+        if args.tpu_ip_address:
+            os.environ["TPU_IP_ADDRESS"] = args.tpu_ip_address
+        if args.tpu_name:
+            os.environ["TPU_NAME"] = args.tpu_name
+        if args.xrt_tpu_config:
+            os.environ["XRT_TPU_CONFIG"] = args.xrt_tpu_config
+
+        assert "TPU_IP_ADDRESS" in os.environ
+        assert "TPU_NAME" in os.environ
+        assert "XRT_TPU_CONFIG" in os.environ
+
+        import torch_xla
+        import torch_xla.core.xla_model as xm
+        args.device = xm.xla_device()
+        args.xla_model = xm
+
    # Setup logging
    logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                        datefmt = '%m/%d/%Y %H:%M:%S',
@@ -455,7 +494,7 @@ def main():


    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
-    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0) and not args.tpu:
        # Create output directory if needed
        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(args.output_dir)
@@ -487,9 +526,11 @@ def main():
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
+            
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
-            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
            results.update(result)

--- a/examples/run_lm_finetuning.py
+++ b/examples/run_lm_finetuning.py
@@ -27,12 +27,19 @@ import logging
 import os
 import pickle
 import random
+import re
+import shutil

 import numpy as np
 import torch
 from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
 from torch.utils.data.distributed import DistributedSampler
-from tensorboardX import SummaryWriter
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
 from tqdm import tqdm, trange

 from transformers import (WEIGHTS_NAME, AdamW, WarmupLinearSchedule,
@@ -59,7 +66,7 @@ class TextDataset(Dataset):
    def __init__(self, tokenizer, file_path='train', block_size=512):
        assert os.path.isfile(file_path)
        directory, filename = os.path.split(file_path)
-        cached_features_file = os.path.join(directory, f'cached_lm_{block_size}_{filename}')
+        cached_features_file = os.path.join(directory, 'cached_lm_' + str(block_size) + '_' + filename)

        if os.path.exists(cached_features_file):
            logger.info("Loading features from cached file %s", cached_features_file)
@@ -74,9 +81,8 @@ class TextDataset(Dataset):

            tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))

-            while len(tokenized_text) >= block_size:  # Truncate in block of block_size
-                self.examples.append(tokenizer.add_special_tokens_single_sequence(tokenized_text[:block_size]))
-                tokenized_text = tokenized_text[block_size:]
+            for i in range(0, len(tokenized_text)-block_size+1, block_size): # Truncate in block of block_size
+                self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i:i+block_size]))
            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            # If your dataset is small, first you should loook for a bigger one :-) and second you
            # can change this behavior by adding (model specific) padding.
@@ -105,11 +111,43 @@ def set_seed(args):
        torch.cuda.manual_seed_all(args.seed)


+def _rotate_checkpoints(args, checkpoint_prefix, use_mtime=False):
+    if not args.save_total_limit:
+        return
+    if args.save_total_limit <= 0:
+        return
+
+    # Check if we should delete older checkpoint(s)
+    glob_checkpoints = glob.glob(os.path.join(args.output_dir, '{}-*'.format(checkpoint_prefix)))
+    if len(glob_checkpoints) <= args.save_total_limit:
+        return
+
+    ordering_and_checkpoint_path = []
+    for path in glob_checkpoints:
+        if use_mtime:
+            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
+        else:
+            regex_match = re.match('.*{}-([0-9]+)'.format(checkpoint_prefix), path)
+            if regex_match and regex_match.groups():
+                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))
+
+    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
+    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
+    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
+    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
+    for checkpoint in checkpoints_to_be_deleted:
+        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
+        shutil.rmtree(checkpoint)
+
+
 def mask_tokens(inputs, tokenizer, args):
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
-    masked_indices = torch.bernoulli(torch.full(labels.shape, args.mlm_probability)).bool()
+    probability_matrix = torch.full(labels.shape, args.mlm_probability)
+    special_tokens_mask = [tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]
+    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
+    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -1  # We only compute loss on masked tokens

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
@@ -223,8 +261,9 @@ def train(args, train_dataset, model, tokenizer):
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
+                    checkpoint_prefix = 'checkpoint'
                    # Save model checkpoint
-                    output_dir = os.path.join(args.output_dir, 'checkpoint-{}'.format(global_step))
+                    output_dir = os.path.join(args.output_dir, '{}-{}'.format(checkpoint_prefix, global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
@@ -232,6 +271,8 @@ def train(args, train_dataset, model, tokenizer):
                    torch.save(args, os.path.join(output_dir, 'training_args.bin'))
                    logger.info("Saving model checkpoint to %s", output_dir)

+                    _rotate_checkpoints(args, checkpoint_prefix)
+
            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
@@ -283,7 +324,7 @@ def evaluate(args, model, tokenizer, prefix=""):
        "perplexity": perplexity
    }

-    output_eval_file = os.path.join(eval_output_dir, "eval_results.txt")
+    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
@@ -360,6 +401,8 @@ def main():
                        help="Log every X updates steps.")
    parser.add_argument('--save_steps', type=int, default=50,
                        help="Save checkpoint every X updates steps.")
+    parser.add_argument('--save_total_limit', type=int, default=None,
+                        help='Limit the total amount of checkpoints, delete the older checkpoints in the output_dir, does not delete by default')
    parser.add_argument("--eval_all_checkpoints", action='store_true',
                        help="Evaluate all checkpoints starting with the same prefix as model_name_or_path ending and ending with step number")
    parser.add_argument("--no_cuda", action='store_true',
@@ -485,9 +528,11 @@ def main():
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
+            
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
-            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
            results.update(result)

--- a/examples/run_multiple_choice.py
+++ b/examples/run_multiple_choice.py
@@ -29,7 +29,12 @@ import torch
 from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
 from torch.utils.data.distributed import DistributedSampler
-from tensorboardX import SummaryWriter
+
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
 from tqdm import tqdm, trange

 from transformers import (WEIGHTS_NAME, BertConfig,
@@ -293,7 +298,7 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
        list(filter(None, args.model_name_or_path.split('/'))).pop(),
        str(args.max_seq_length),
        str(task)))
-    if os.path.exists(cached_features_file):
+    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
@@ -306,14 +311,14 @@ def load_and_cache_examples(args, task, tokenizer, evaluate=False, test=False):
        else:
            examples = processor.get_train_examples(args.data_dir)
        logger.info("Training number: %s", str(len(examples)))
-        features = convert_examples_to_features(examples, label_list, args.max_seq_length, tokenizer,
-            cls_token_at_end=bool(args.model_type in ['xlnet']),            # xlnet has a cls token at the end
-            cls_token=tokenizer.cls_token,
-            sep_token=tokenizer.sep_token,
-            sep_token_extra=bool(args.model_type in ['roberta']),
-            cls_token_segment_id=2 if args.model_type in ['xlnet'] else 0,
+        features = convert_examples_to_features(
+            examples,
+            label_list,
+            args.max_seq_length,
+            tokenizer,
            pad_on_left=bool(args.model_type in ['xlnet']),                 # pad on the left for xlnet
-            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0)
+            pad_token_segment_id=4 if args.model_type in ['xlnet'] else 0
+        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)
@@ -362,7 +367,7 @@ def main():
                        help="Whether to run eval on the dev set.")
    parser.add_argument("--do_test", action='store_true', help='Whether to run test on the test set')
    parser.add_argument("--evaluate_during_training", action='store_true',
-                        help="Rul evaluation during training at each logging step.")
+                        help="Run evaluation during training at each logging step.")
    parser.add_argument("--do_lower_case", action='store_true',
                        help="Set this flag if you are using an uncased model.")

@@ -512,9 +517,11 @@ def main():
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
+            
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
-            result = evaluate(args, model, tokenizer, prefix=global_step)
+            result = evaluate(args, model, tokenizer, prefix=prefix)
            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
            results.update(result)

@@ -528,9 +535,11 @@ def main():
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
+            prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
+            
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
-            result = evaluate(args, model, tokenizer, prefix=global_step, test=True)
+            result = evaluate(args, model, tokenizer, prefix=prefix, test=True)
            result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
            results.update(result)
    if best_steps:
--- a/examples/run_squad.py
+++ b/examples/run_squad.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" Finetuning the library models for question-answering on SQuAD (Bert, XLM, XLNet)."""
+""" Finetuning the library models for question-answering on SQuAD (DistilBERT, Bert, XLM, XLNet)."""

 from __future__ import absolute_import, division, print_function

@@ -28,9 +28,13 @@ import torch
 from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
 from torch.utils.data.distributed import DistributedSampler
-from tqdm import tqdm, trange

-from tensorboardX import SummaryWriter
+try:
+    from torch.utils.tensorboard import SummaryWriter
+except:
+    from tensorboardX import SummaryWriter
+
+from tqdm import tqdm, trange

 from transformers import (WEIGHTS_NAME, BertConfig,
                                  BertForQuestionAnswering, BertTokenizer,
@@ -135,9 +139,10 @@ def train(args, train_dataset, model, tokenizer):
            batch = tuple(t.to(args.device) for t in batch)
            inputs = {'input_ids':       batch[0],
                      'attention_mask':  batch[1], 
-                      'token_type_ids':  None if args.model_type == 'xlm' else batch[2],  
                      'start_positions': batch[3], 
                      'end_positions':   batch[4]}
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]
            if args.model_type in ['xlnet', 'xlm']:
                inputs.update({'cls_index': batch[5],
                               'p_mask':       batch[6]})
@@ -218,9 +223,10 @@ def evaluate(args, model, tokenizer, prefix=""):
        batch = tuple(t.to(args.device) for t in batch)
        with torch.no_grad():
            inputs = {'input_ids':      batch[0],
-                      'attention_mask': batch[1],
-                      'token_type_ids': None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
+                      'attention_mask': batch[1]
                      }
+            if args.model_type != 'distilbert':
+                inputs['token_type_ids'] = None if args.model_type == 'xlm' else batch[2]  # XLM don't use segment_ids
            example_indices = batch[3]
            if args.model_type in ['xlnet', 'xlm']:
                inputs.update({'cls_index': batch[4],
--- a/examples/run_tf_glue.py
+++ b/examples/run_tf_glue.py
@@ -1,6 +1,6 @@
 import tensorflow as tf
 import tensorflow_datasets
-from transformers import *
+from transformers import BertTokenizer, TFBertForSequenceClassification, glue_convert_examples_to_features, BertForSequenceClassification

 # Load dataset, tokenizer, model from pretrained model/vocabulary
 tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
@@ -23,12 +23,6 @@ model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
 history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

->>> Train for 115 steps, validate for 7 steps
->>> Epoch 1/2
->>> 115/115 [==============================] - 53s 459ms/step - loss: 0.6033 - accuracy: 0.6712 - val_loss: 0.4964 - val_accuracy: 0.7647
->>> Epoch 2/2
->>> 115/115 [==============================] - 33s 289ms/step - loss: 0.4141 - accuracy: 0.8160 - val_loss: 0.3914 - val_accuracy: 0.8382
-
 # Load the TensorFlow model in PyTorch for inspection
 model.save_pretrained('./save/')
 pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf=True)
@@ -44,5 +38,3 @@ pred_1 = pytorch_model(**inputs_1)[0].argmax().item()
 pred_2 = pytorch_model(**inputs_2)[0].argmax().item()
 print("sentence_1 is", "a paraphrase" if pred_1 else "not a paraphrase", "of sentence_0")
 print("sentence_2 is", "a paraphrase" if pred_2 else "not a paraphrase", "of sentence_0")
->>> sentence_1 is a paraphrase of sentence_0
->>> sentence_2 is not a paraphrase of sentence_0
--- a/examples/utils_multiple_choice.py
+++ b/examples/utils_multiple_choice.py
@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" BERT multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension  """
+""" Multiple choice fine-tuning: utilities to work with multiple choice tasks of reading comprehension  """

 from __future__ import absolute_import, division, print_function

@@ -26,6 +26,8 @@ import json
 import csv
 import glob
 import tqdm
+from typing import List
+from transformers import PreTrainedTokenizer


 logger = logging.getLogger(__name__)
@@ -34,13 +36,13 @@ logger = logging.getLogger(__name__)
 class InputExample(object):
    """A single training/test example for multiple choice"""

-    def __init__(self, example_id, question,  contexts, endings, label=None):
+    def __init__(self, example_id, question, contexts, endings, label=None):
        """Constructs a InputExample.

        Args:
            example_id: Unique id for the example.
            contexts: list of str. The untokenized text of the first sequence (context of corresponding question).
-            question: string. The untokenized text of the second sequence (qustion).
+            question: string. The untokenized text of the second sequence (question).
            endings: list of str. multiple choice's options. Its length must be equal to contexts' length.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
@@ -66,7 +68,7 @@ class InputFeatures(object):
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
-            for _, input_ids, input_mask, segment_ids in choices_features
+            for input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label

@@ -192,7 +194,7 @@ class SwagProcessor(DataProcessor):
            return lines


-    def _create_examples(self, lines, type):
+    def _create_examples(self, lines: List[List[str]], type: str):
        """Creates examples for the training and dev sets."""
        if type == "train" and lines[0][-1] != 'label':
            raise ValueError(
@@ -300,24 +302,18 @@ class ArcProcessor(DataProcessor):
        return examples


-def convert_examples_to_features(examples, label_list, max_seq_length,
-                                 tokenizer,
-                                 cls_token_at_end=False,
-                                 cls_token='[CLS]',
-                                 cls_token_segment_id=1,
-                                 sep_token='[SEP]',
-                                 sequence_a_segment_id=0,
-                                 sequence_b_segment_id=1,
-                                 sep_token_extra=False,
-                                 pad_token_segment_id=0,
-                                 pad_on_left=False,
-                                 pad_token=0,
-                                 mask_padding_with_zero=True):
-    """ Loads a data file into a list of `InputBatch`s
-        `cls_token_at_end` define the location of the CLS token:
-            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
-            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
-        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
+def convert_examples_to_features(
+    examples: List[InputExample],
+    label_list: List[str],
+    max_length: int,
+    tokenizer: PreTrainedTokenizer,
+    pad_token_segment_id=0,
+    pad_on_left=False,
+    pad_token=0,
+    mask_padding_with_zero=True,
+) -> List[InputFeatures]:
+    """
+    Loads a data file into a list of `InputFeatures`
    """

    label_map = {label : i for i, label in enumerate(label_list)}
@@ -328,125 +324,70 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
            logger.info("Writing example %d of %d" % (ex_index, len(examples)))
        choices_features = []
        for ending_idx, (context, ending) in enumerate(zip(example.contexts, example.endings)):
-            tokens_a = tokenizer.tokenize(context)
-            tokens_b = None
+            text_a = context
            if example.question.find("_") != -1:
-                #this is for cloze question
-                tokens_b = tokenizer.tokenize(example.question.replace("_", ending))
+                # this is for cloze question
+                text_b = example.question.replace("_", ending)
            else:
-                tokens_b = tokenizer.tokenize(example.question + " " + ending)
-                # you can add seq token between quesiotn and ending. This does not make too much difference.
-                # tokens_b = tokenizer.tokenize(example.question)
-                # tokens_b += [sep_token]
-                # if sep_token_extra:
-                #     tokens_b += [sep_token]
-                # tokens_b += tokenizer.tokenize(ending)
+                text_b = example.question + " " + ending

-            special_tokens_count = 4 if sep_token_extra else 3
-            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - special_tokens_count)
+            inputs = tokenizer.encode_plus(
+                text_a,
+                text_b,
+                add_special_tokens=True,
+                max_length=max_length,
+            )
+            if 'num_truncated_tokens' in inputs and inputs['num_truncated_tokens'] > 0:
+                logger.info('Attention! you are cropping tokens (swag task is ok). '
+                        'If you are training ARC and RACE and you are poping question + options,'
+                        'you need to try to use a bigger max seq length!')

-            # The convention in BERT is:
-            # (a) For sequence pairs:
-            #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
-            #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
-            # (b) For single sequences:
-            #  tokens:   [CLS] the dog is hairy . [SEP]
-            #  type_ids:   0   0   0   0  0     0   0
-            #
-            # Where "type_ids" are used to indicate whether this is the first
-            # sequence or the second sequence. The embedding vectors for `type=0` and
-            # `type=1` were learned during pre-training and are added to the wordpiece
-            # embedding vector (and position vector). This is not *strictly* necessary
-            # since the [SEP] token unambiguously separates the sequences, but it makes
-            # it easier for the model to learn the concept of sequences.
-            #
-            # For classification tasks, the first vector (corresponding to [CLS]) is
-            # used as as the "sentence vector". Note that this only makes sense because
-            # the entire model is fine-tuned.
-            tokens = tokens_a + [sep_token]
-            if sep_token_extra:
-                # roberta uses an extra separator b/w pairs of sentences
-                tokens += [sep_token]
-
-            segment_ids = [sequence_a_segment_id] * len(tokens)
-
-            if tokens_b:
-                tokens += tokens_b + [sep_token]
-                segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1)
-
-            if cls_token_at_end:
-                tokens = tokens + [cls_token]
-                segment_ids = segment_ids + [cls_token_segment_id]
-            else:
-                tokens = [cls_token] + tokens
-                segment_ids = [cls_token_segment_id] + segment_ids
-
-            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+            input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]

            # The mask has 1 for real tokens and 0 for padding tokens. Only real
            # tokens are attended to.
-            input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
+            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

            # Zero-pad up to the sequence length.
-            padding_length = max_seq_length - len(input_ids)
+            padding_length = max_length - len(input_ids)
            if pad_on_left:
                input_ids = ([pad_token] * padding_length) + input_ids
-                input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
-                segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
+                attention_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + attention_mask
+                token_type_ids = ([pad_token_segment_id] * padding_length) + token_type_ids
            else:
                input_ids = input_ids + ([pad_token] * padding_length)
-                input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
-                segment_ids = segment_ids + ([pad_token_segment_id] * padding_length)
+                attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
+                token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
+
+            assert len(input_ids) == max_length
+            assert len(attention_mask) == max_length
+            assert len(token_type_ids) == max_length
+            choices_features.append((input_ids, attention_mask, token_type_ids))
+

-            assert len(input_ids) == max_seq_length
-            assert len(input_mask) == max_seq_length
-            assert len(segment_ids) == max_seq_length
-            choices_features.append((tokens, input_ids, input_mask, segment_ids))
        label = label_map[example.label]

        if ex_index < 2:
            logger.info("*** Example ***")
            logger.info("race_id: {}".format(example.example_id))
-            for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
+            for choice_idx, (input_ids, attention_mask, token_type_ids) in enumerate(choices_features):
                logger.info("choice: {}".format(choice_idx))
-                logger.info("tokens: {}".format(' '.join(tokens)))
                logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
-                logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
-                logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
+                logger.info("attention_mask: {}".format(' '.join(map(str, attention_mask))))
+                logger.info("token_type_ids: {}".format(' '.join(map(str, token_type_ids))))
                logger.info("label: {}".format(label))

        features.append(
            InputFeatures(
-                example_id = example.example_id,
-                choices_features = choices_features,
-                label = label
+                example_id=example.example_id,
+                choices_features=choices_features,
+                label=label,
            )
        )

    return features


-def _truncate_seq_pair(tokens_a, tokens_b, max_length):
-    """Truncates a sequence pair in place to the maximum length."""
-
-    # This is a simple heuristic which will always truncate the longer sequence
-    # one token at a time. This makes more sense than truncating an equal percent
-    # of tokens from each, since if one sequence is very short then each token
-    # that's truncated likely contains more information than a longer sequence.
-
-    # However, since we'd better not to remove tokens of options and questions, you can choose to use a bigger
-    # length or only pop from context
-    while True:
-        total_length = len(tokens_a) + len(tokens_b)
-        if total_length <= max_length:
-            break
-        if len(tokens_a) > len(tokens_b):
-            tokens_a.pop()
-        else:
-            logger.info('Attention! you are removing from token_b (swag task is ok). '
-                        'If you are training ARC and RACE (you are poping question + options), '
-                        'you need to try to use a bigger max seq length!')
-            tokens_b.pop()


 processors = {
@@ -456,7 +397,7 @@ processors = {
 }


-GLUE_TASKS_NUM_LABELS = {
+MULTIPLE_CHOICE_TASKS_NUM_LABELS = {
    "race", 4,
    "swag", 4,
    "arc", 4
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -0,0 +1,48 @@
+absl-py==0.8.0
+astor==0.8.0
+atomicwrites==1.3.0
+attrs==19.2.0
+boto3==1.9.243
+botocore==1.12.243
+certifi==2019.9.11
+chardet==3.0.4
+Click==7.0
+docutils==0.15.2
+gast==0.2.2
+google-pasta==0.1.7
+grpcio==1.24.1
+h5py==2.10.0
+idna==2.8
+importlib-metadata==0.23
+jmespath==0.9.4
+joblib==0.14.0
+Keras-Applications==1.0.8
+Keras-Preprocessing==1.1.0
+Markdown==3.1.1
+more-itertools==7.2.0
+numpy==1.17.2
+opt-einsum==3.1.0
+packaging==19.2
+pluggy==0.13.0
+protobuf==3.10.0
+py==1.8.0
+pyparsing==2.4.2
+pytest==5.2.1
+python-dateutil==2.8.0
+regex==2019.8.19
+requests==2.22.0
+s3transfer==0.2.1
+sacremoses==0.0.35
+sentencepiece==0.1.83
+six==1.12.0
+tensorboard==2.0.0
+tensorflow==2.0.0
+tensorflow-estimator==2.0.0
+termcolor==1.1.0
+torch==1.2.0
+tqdm==4.36.1
+urllib3==1.25.6
+wcwidth==0.1.7
+Werkzeug==0.16.0
+wrapt==1.11.2
+zipp==0.6.0
--- a/setup.py
+++ b/setup.py
@@ -3,7 +3,7 @@ Simple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/m

 To create the package for pypi.

-1. Change the version in __init__.py and setup.py.
+1. Change the version in __init__.py, setup.py as well as docs/source/conf.py.

 2. Commit these changes with the message: "Release: VERSION"

@@ -38,13 +38,13 @@ from setuptools import find_packages, setup

 setup(
    name="transformers",
-    version="2.0.0",
-    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors",
+    version="2.1.1",
+    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
-    description="Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM",
+    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
    long_description=open("README.md", "r", encoding='utf-8').read(),
    long_description_content_type="text/markdown",
-    keywords='NLP deep learning transformer pytorch BERT GPT GPT-2 google openai CMU',
+    keywords='NLP deep learning transformer pytorch tensorflow BERT GPT GPT-2 google openai CMU',
    license='Apache',
    url="https://github.com/huggingface/transformers",
    packages=find_packages(exclude=["*.tests", "*.tests.*",
--- a/transformers/init.py
+++ b/transformers/init.py
@@ -1,4 +1,4 @@
-__version__ = "2.0.0"
+__version__ = "2.1.1"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -37,6 +37,7 @@ from .tokenization_bert import BertTokenizer, BasicTokenizer, WordpieceTokenizer
 from .tokenization_openai import OpenAIGPTTokenizer
 from .tokenization_transfo_xl import (TransfoXLTokenizer, TransfoXLCorpus)
 from .tokenization_gpt2 import GPT2Tokenizer
+from .tokenization_ctrl import CTRLTokenizer
 from .tokenization_xlnet import XLNetTokenizer, SPIECE_UNDERLINE
 from .tokenization_xlm import XLMTokenizer
 from .tokenization_roberta import RobertaTokenizer
@@ -49,7 +50,9 @@ from .configuration_bert import BertConfig, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_openai import OpenAIGPTConfig, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_transfo_xl import TransfoXLConfig, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_gpt2 import GPT2Config, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_xlnet import XLNetConfig, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP
+from .configuration_ctrl import CTRLConfig, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_xlm import XLMConfig, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_roberta import RobertaConfig, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP
 from .configuration_distilbert import DistilBertConfig, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP
@@ -73,15 +76,19 @@ if is_torch_available():
    from .modeling_gpt2 import (GPT2PreTrainedModel, GPT2Model,
                                GPT2LMHeadModel, GPT2DoubleHeadsModel,
                                load_tf_weights_in_gpt2, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)
+    from .modeling_ctrl import (CTRLPreTrainedModel, CTRLModel,
+                                CTRLLMHeadModel,
+                                CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
    from .modeling_xlnet import (XLNetPreTrainedModel, XLNetModel, XLNetLMHeadModel,
-                                XLNetForSequenceClassification, XLNetForQuestionAnsweringSimple,
-                                XLNetForQuestionAnswering,
+                                XLNetForSequenceClassification, XLNetForMultipleChoice,
+                                XLNetForQuestionAnsweringSimple, XLNetForQuestionAnswering,
                                load_tf_weights_in_xlnet, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)
    from .modeling_xlm import (XLMPreTrainedModel , XLMModel,
                            XLMWithLMHeadModel, XLMForSequenceClassification,
                            XLMForQuestionAnswering, XLMForQuestionAnsweringSimple,
                            XLM_PRETRAINED_MODEL_ARCHIVE_MAP)
-    from .modeling_roberta import (RobertaForMaskedLM, RobertaModel, RobertaForSequenceClassification,
+    from .modeling_roberta import (RobertaForMaskedLM, RobertaModel,
+                                RobertaForSequenceClassification, RobertaForMultipleChoice,
                                ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)
    from .modeling_distilbert import (DistilBertForMaskedLM, DistilBertModel,
                                DistilBertForSequenceClassification, DistilBertForQuestionAnswering,
@@ -103,60 +110,55 @@ if is_tf_available():
                                   TFBertForMaskedLM, TFBertForNextSentencePrediction,
                                   TFBertForSequenceClassification, TFBertForMultipleChoice,
                                   TFBertForTokenClassification, TFBertForQuestionAnswering,
-                                   load_bert_pt_weights_in_tf2,
                                   TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_gpt2 import (TFGPT2PreTrainedModel, TFGPT2MainLayer,
                                   TFGPT2Model, TFGPT2LMHeadModel, TFGPT2DoubleHeadsModel,
-                                   load_gpt2_pt_weights_in_tf2,
                                   TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_openai import (TFOpenAIGPTPreTrainedModel, TFOpenAIGPTMainLayer,
                                     TFOpenAIGPTModel, TFOpenAIGPTLMHeadModel, TFOpenAIGPTDoubleHeadsModel,
-                                     load_openai_gpt_pt_weights_in_tf2,
                                     TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_transfo_xl import (TFTransfoXLPreTrainedModel, TFTransfoXLMainLayer,
                                         TFTransfoXLModel, TFTransfoXLLMHeadModel,
-                                         load_transfo_xl_pt_weights_in_tf2,
                                         TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_xlnet import (TFXLNetPreTrainedModel, TFXLNetMainLayer,
                                    TFXLNetModel, TFXLNetLMHeadModel,
                                    TFXLNetForSequenceClassification,
                                    TFXLNetForQuestionAnsweringSimple,
-                                    load_xlnet_pt_weights_in_tf2,
                                    TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_xlm import (TFXLMPreTrainedModel, TFXLMMainLayer,
                                  TFXLMModel, TFXLMWithLMHeadModel,
                                  TFXLMForSequenceClassification,
                                  TFXLMForQuestionAnsweringSimple,
-                                  load_xlm_pt_weights_in_tf2,
                                  TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_roberta import (TFRobertaPreTrainedModel, TFRobertaMainLayer,
                                      TFRobertaModel, TFRobertaForMaskedLM,
                                      TFRobertaForSequenceClassification,
-                                      load_roberta_pt_weights_in_tf2,
                                      TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP)

    from .modeling_tf_distilbert import (TFDistilBertPreTrainedModel, TFDistilBertMainLayer,
                                         TFDistilBertModel, TFDistilBertForMaskedLM,
                                         TFDistilBertForSequenceClassification,
                                         TFDistilBertForQuestionAnswering,
-                                         load_distilbert_pt_weights_in_tf2,
                                         TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)

+    from .modeling_tf_ctrl import (TFCTRLPreTrainedModel, TFCTRLModel,
+                                    TFCTRLLMHeadModel,
+                                    TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
+
 # TF 2.0 <=> PyTorch conversion utilities
-if is_tf_available() and is_torch_available():
-    from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
-                                            load_pytorch_checkpoint_in_tf2_model,
-                                            load_pytorch_weights_in_tf2_model,
-                                            load_pytorch_model_in_tf2_model,
-                                            load_tf2_checkpoint_in_pytorch_model,
-                                            load_tf2_weights_in_pytorch_model,
-                                            load_tf2_model_in_pytorch_model)
+from .modeling_tf_pytorch_utils import (convert_tf_weight_name_to_pt_weight_name,
+                                        load_pytorch_checkpoint_in_tf2_model,
+                                        load_pytorch_weights_in_tf2_model,
+                                        load_pytorch_model_in_tf2_model,
+                                        load_tf2_checkpoint_in_pytorch_model,
+                                        load_tf2_weights_in_pytorch_model,
+                                        load_tf2_model_in_pytorch_model)

 if not is_tf_available() and not is_torch_available():
    logger.warning("Neither PyTorch nor TensorFlow >= 2.0 have been found."
--- a/transformers/configuration_auto.py
+++ b/transformers/configuration_auto.py
@@ -26,6 +26,7 @@ from .configuration_xlnet import XLNetConfig
 from .configuration_xlm import XLMConfig
 from .configuration_roberta import RobertaConfig
 from .configuration_distilbert import DistilBertConfig
+from .configuration_ctrl import CTRLConfig

 logger = logging.getLogger(__name__)

@@ -49,7 +50,7 @@ class AutoConfig(object):
            - contains `xlnet`: XLNetConfig (XLNet model)
            - contains `xlm`: XLMConfig (XLM model)
            - contains `roberta`: RobertaConfig (RoBERTa model)
-
+            - contains `ctrl` : CTRLConfig (CTRL model)
        This class cannot be instantiated using `__init__()` (throw an error).
    """
    def __init__(self):
@@ -71,7 +72,7 @@ class AutoConfig(object):
            - contains `xlnet`: XLNetConfig (XLNet model)
            - contains `xlm`: XLMConfig (XLM model)
            - contains `roberta`: RobertaConfig (RoBERTa model)
-
+            - contains `ctrl` : CTRLConfig (CTRL model)
        Params:
            pretrained_model_name_or_path: either:

@@ -129,7 +130,8 @@ class AutoConfig(object):
            return XLNetConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
        elif 'xlm' in pretrained_model_name_or_path:
            return XLMConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
-
+        elif 'ctrl' in pretrained_model_name_or_path:
+            return CTRLConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta', 'ctrl'".format(pretrained_model_name_or_path))
--- a/transformers/configuration_bert.py
+++ b/transformers/configuration_bert.py
@@ -40,6 +40,8 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
+    'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json",
+    'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json",
 }


--- a/transformers/configuration_ctrl.py
+++ b/transformers/configuration_ctrl.py
@@ -0,0 +1,143 @@
+# coding=utf-8
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Salesforce CTRL configuration """
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import json
+import logging
+import sys
+from io import open
+
+from .configuration_utils import PretrainedConfig
+
+logger = logging.getLogger(__name__)
+
+CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/ctrl-config.json"}
+
+class CTRLConfig(PretrainedConfig):
+    """Configuration class to store the configuration of a `CTRLModel`.
+
+    Args:
+        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
+        n_positions: Number of positional embeddings.
+        n_ctx: Size of the causal mask (usually same as n_positions).
+        dff: Size of the inner dimension of the FFN.
+        n_embd: Dimensionality of the embeddings and hidden states.
+        n_layer: Number of hidden layers in the Transformer encoder.
+        n_head: Number of attention heads for each attention layer in
+            the Transformer encoder.
+        layer_norm_epsilon: epsilon to use in the layer norm layers
+        resid_pdrop: The dropout probabilitiy for all fully connected
+            layers in the embeddings, encoder, and pooler.
+        attn_pdrop: The dropout ratio for the attention
+            probabilities.
+        embd_pdrop: The dropout ratio for the embeddings.
+        initializer_range: The sttdev of the truncated_normal_initializer for
+            initializing all weight matrices.
+    """
+    pretrained_config_archive_map = CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(
+        self,
+        vocab_size_or_config_json_file=246534,
+        n_positions=256,
+        n_ctx=256,
+        n_embd=1280,
+        dff=8192,
+        n_layer=48,
+        n_head=16,
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        attn_pdrop=0.1,
+        layer_norm_epsilon=1e-6,
+        initializer_range=0.02,
+
+        num_labels=1,
+        summary_type='cls_index',
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        **kwargs
+    ):
+        """Constructs CTRLConfig.
+
+        Args:
+            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `CTRLModel` or a configuration json file.
+            n_positions: Number of positional embeddings.
+            n_ctx: Size of the causal mask (usually same as n_positions).
+            dff: Size of the inner dimension of the FFN.
+            n_embd: Dimensionality of the embeddings and hidden states.
+            n_layer: Number of hidden layers in the Transformer encoder.
+            n_head: Number of attention heads for each attention layer in
+                the Transformer encoder.
+            layer_norm_epsilon: epsilon to use in the layer norm layers
+            resid_pdrop: The dropout probabilitiy for all fully connected
+                layers in the embeddings, encoder, and pooler.
+            attn_pdrop: The dropout ratio for the attention
+                probabilities.
+            embd_pdrop: The dropout ratio for the embeddings.
+            initializer_range: The sttdev of the truncated_normal_initializer for
+                initializing all weight matrices.
+        """
+        super(CTRLConfig, self).__init__(**kwargs)
+
+        self.vocab_size = vocab_size_or_config_json_file if isinstance(vocab_size_or_config_json_file, int) else -1
+        self.n_ctx = n_ctx
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.dff = dff
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.attn_pdrop = attn_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+
+        self.num_labels = num_labels
+        self.summary_type = summary_type
+        self.summary_use_proj = summary_use_proj
+        self.summary_activation = summary_activation
+        self.summary_first_dropout = summary_first_dropout
+        self.summary_proj_to_labels = summary_proj_to_labels
+        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
+                        and isinstance(vocab_size_or_config_json_file, unicode)):
+            with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
+                json_config = json.loads(reader.read())
+            for key, value in json_config.items():
+                self.__dict__[key] = value
+        elif not isinstance(vocab_size_or_config_json_file, int):
+            raise ValueError(
+                "First argument must be either a vocabulary size (int)"
+                "or the path to a pretrained model config file (str)"
+            )
+
+    @property
+    def max_position_embeddings(self):
+        return self.n_positions
+
+    @property
+    def hidden_size(self):
+        return self.n_embd
+
+    @property
+    def num_attention_heads(self):
+        return self.n_head
+
+    @property
+    def num_hidden_layers(self):
+        return self.n_layer
--- a/transformers/configuration_gpt2.py
+++ b/transformers/configuration_gpt2.py
@@ -28,7 +28,8 @@ logger = logging.getLogger(__name__)

 GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json",
                                      "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json",
-                                      "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json"}
+                                      "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-config.json",
+                                      "distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json",}

 class GPT2Config(PretrainedConfig):
    """Configuration class to store the configuration of a `GPT2Model`.
--- a/transformers/configuration_utils.py
+++ b/transformers/configuration_utils.py
@@ -53,7 +53,8 @@ class PretrainedConfig(object):
        self.num_labels = kwargs.pop('num_labels', 2)
        self.output_attentions = kwargs.pop('output_attentions', False)
        self.output_hidden_states = kwargs.pop('output_hidden_states', False)
-        self.torchscript = kwargs.pop('torchscript', False)
+        self.output_past = kwargs.pop('output_past', True)  # Not used by all models
+        self.torchscript = kwargs.pop('torchscript', False)  # Only used by PyTorch models
        self.use_bfloat16 = kwargs.pop('use_bfloat16', False)
        self.pruned_heads = kwargs.pop('pruned_heads', {})

@@ -130,20 +131,19 @@ class PretrainedConfig(object):
        # redirect to the cache, if necessary
        try:
            resolved_config_file = cached_path(config_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-        except EnvironmentError as e:
+        except EnvironmentError:
            if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
-                logger.error(
-                    "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
-                        config_file))
+                msg = "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
+                        config_file)
            else:
-                logger.error(
-                    "Model name '{}' was not found in model name list ({}). "
-                    "We assumed '{}' was a path or url but couldn't find any file "
-                    "associated to this path or url.".format(
+                msg = "Model name '{}' was not found in model name list ({}). " \
+                      "We assumed '{}' was a path or url to a configuration file named {} or " \
+                      "a directory containing such a file but couldn't find any such file at this path or url.".format(
                        pretrained_model_name_or_path,
                        ', '.join(cls.pretrained_config_archive_map.keys()),
-                        config_file))
-            raise e
+                        config_file, CONFIG_NAME)
+            raise EnvironmentError(msg)
+
        if resolved_config_file == config_file:
            logger.info("loading configuration file {}".format(config_file))
        else:
@@ -154,7 +154,7 @@ class PretrainedConfig(object):
        config = cls.from_json_file(resolved_config_file)

        if hasattr(config, 'pruned_heads'):
-            config.pruned_heads = dict((int(key), set(value)) for key, value in config.pruned_heads.items())
+            config.pruned_heads = dict((int(key), value) for key, value in config.pruned_heads.items())

        # Update config with kwargs if needed
        to_remove = []
@@ -165,7 +165,7 @@ class PretrainedConfig(object):
        for key in to_remove:
            kwargs.pop(key, None)

-        logger.info("Model config %s", config)
+        logger.info("Model config %s", str(config))
        if return_unused_kwargs:
            return config, kwargs
        else:
--- a/transformers/convert_pytorch_checkpoint_to_tf2.py
+++ b/transformers/convert_pytorch_checkpoint_to_tf2.py
@@ -24,14 +24,16 @@ import tensorflow as tf

 from transformers import is_torch_available, cached_path

-from transformers import (BertConfig, TFBertForPreTraining, TFBertForQuestionAnswering, TFBertForSequenceClassification, load_bert_pt_weights_in_tf2, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  GPT2Config, TFGPT2LMHeadModel, load_gpt2_pt_weights_in_tf2, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  XLNetConfig, TFXLNetLMHeadModel, load_xlnet_pt_weights_in_tf2, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  XLMConfig, TFXLMWithLMHeadModel, load_xlm_pt_weights_in_tf2, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  TransfoXLConfig, TFTransfoXLLMHeadModel, load_transfo_xl_pt_weights_in_tf2, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, load_openai_gpt_pt_weights_in_tf2, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
-                                  DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP)
+from transformers import (load_pytorch_checkpoint_in_tf2_model,
+    BertConfig, TFBertForPreTraining, TFBertForQuestionAnswering, TFBertForSequenceClassification, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  GPT2Config, TFGPT2LMHeadModel, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  XLNetConfig, TFXLNetLMHeadModel, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  XLMConfig, TFXLMWithLMHeadModel, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  TransfoXLConfig, TFTransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  RobertaConfig, TFRobertaForMaskedLM, TFRobertaForSequenceClassification, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  DistilBertConfig, TFDistilBertForMaskedLM, TFDistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
+                                  CTRLConfig, TFCTRLLMHeadModel, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)

 if is_torch_available():
    import torch
@@ -43,7 +45,8 @@ if is_torch_available():
                                      TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
                                      OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
                                      RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
-                                      DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP)
+                                      DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                      CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
 else:
    (BertForPreTraining, BertForQuestionAnswering, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
    GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -52,7 +55,8 @@ else:
    TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP,
    OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP,
    RobertaForMaskedLM, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP,
-    DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,) = (
+    DistilBertForMaskedLM, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP,
+    CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP) = (
        None, None, None, None,
        None, None,
        None, None,
@@ -60,33 +64,35 @@ else:
        None, None,
        None, None,
        None, None, None,
-        None, None, None,)
+        None, None, None,
+        None, None)


 import logging
 logging.basicConfig(level=logging.INFO)

 MODEL_CLASSES = {
-    'bert': (BertConfig, TFBertForPreTraining, load_bert_pt_weights_in_tf2, BertForPreTraining, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'bert-large-uncased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, load_bert_pt_weights_in_tf2, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'bert-large-cased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, load_bert_pt_weights_in_tf2, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'bert-base-cased-finetuned-mrpc': (BertConfig, TFBertForSequenceClassification, load_bert_pt_weights_in_tf2, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'gpt2': (GPT2Config, TFGPT2LMHeadModel, load_gpt2_pt_weights_in_tf2, GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'xlnet': (XLNetConfig, TFXLNetLMHeadModel, load_xlnet_pt_weights_in_tf2, XLNetLMHeadModel, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'xlm': (XLMConfig, TFXLMWithLMHeadModel, load_xlm_pt_weights_in_tf2, XLMWithLMHeadModel, XLM_PRETRAINED_MODEL_ARCHIVE_MAP, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'transfo-xl': (TransfoXLConfig, TFTransfoXLLMHeadModel, load_transfo_xl_pt_weights_in_tf2, TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'openai-gpt': (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, load_openai_gpt_pt_weights_in_tf2, OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'roberta': (RobertaConfig, TFRobertaForMaskedLM, load_roberta_pt_weights_in_tf2, RobertaForMaskedLM, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, load_roberta_pt_weights_in_tf2, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, load_distilbert_pt_weights_in_tf2, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
-    'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, load_distilbert_pt_weights_in_tf2, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'bert': (BertConfig, TFBertForPreTraining, BertForPreTraining, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'bert-large-uncased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'bert-large-cased-whole-word-masking-finetuned-squad': (BertConfig, TFBertForQuestionAnswering, BertForQuestionAnswering, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'bert-base-cased-finetuned-mrpc': (BertConfig, TFBertForSequenceClassification, BertForSequenceClassification, BERT_PRETRAINED_MODEL_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'gpt2': (GPT2Config, TFGPT2LMHeadModel, GPT2LMHeadModel, GPT2_PRETRAINED_MODEL_ARCHIVE_MAP, GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'xlnet': (XLNetConfig, TFXLNetLMHeadModel, XLNetLMHeadModel, XLNET_PRETRAINED_MODEL_ARCHIVE_MAP, XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'xlm': (XLMConfig, TFXLMWithLMHeadModel, XLMWithLMHeadModel, XLM_PRETRAINED_MODEL_ARCHIVE_MAP, XLM_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'transfo-xl': (TransfoXLConfig, TFTransfoXLLMHeadModel, TransfoXLLMHeadModel, TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP, TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'openai-gpt': (OpenAIGPTConfig, TFOpenAIGPTLMHeadModel, OpenAIGPTLMHeadModel, OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP, OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'roberta': (RobertaConfig, TFRobertaForMaskedLM, RobertaForMaskedLM, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'roberta-large-mnli': (RobertaConfig, TFRobertaForSequenceClassification, RobertaForSequenceClassification, ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP, ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'distilbert': (DistilBertConfig, TFDistilBertForMaskedLM, DistilBertForMaskedLM, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'distilbert-base-uncased-distilled-squad': (DistilBertConfig, TFDistilBertForQuestionAnswering, DistilBertForQuestionAnswering, DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP, DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP),
+    'ctrl': (CTRLConfig, TFCTRLLMHeadModel, CTRLLMHeadModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP, CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP)
 }

 def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file, tf_dump_path, compare_with_pt_model=False, use_cached_models=True):
    if model_type not in MODEL_CLASSES:
        raise ValueError("Unrecognized model type, should be one of {}.".format(list(MODEL_CLASSES.keys())))

-    config_class, model_class, loading_fct, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
+    config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]

    # Initialise TF model
    if config_file in aws_config_map:
@@ -100,7 +106,8 @@ def convert_pt_checkpoint_to_tf(model_type, pytorch_checkpoint_path, config_file
    # Load weights from tf checkpoint
    if pytorch_checkpoint_path in aws_model_maps:
        pytorch_checkpoint_path = cached_path(aws_model_maps[pytorch_checkpoint_path], force_download=not use_cached_models)
-    tf_model = loading_fct(tf_model, pytorch_checkpoint_path)
+    # Load PyTorch checkpoint in tf2 model:
+    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)

    if compare_with_pt_model:
        inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
@@ -142,7 +149,7 @@ def convert_all_pt_checkpoints_to_tf(args_model_type, tf_dump_path, model_shortc
        if model_type not in MODEL_CLASSES:
            raise ValueError("Unrecognized model type {}, should be one of {}.".format(model_type, list(MODEL_CLASSES.keys())))

-        config_class, model_class, loading_fct, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]
+        config_class, model_class, pt_model_class, aws_model_maps, aws_config_map = MODEL_CLASSES[model_type]

        if model_shortcut_names_or_path is None:
            model_shortcut_names_or_path = list(aws_model_maps.keys())
@@ -173,10 +180,12 @@ def convert_all_pt_checkpoints_to_tf(args_model_type, tf_dump_path, model_shortc
            else:
                model_file = cached_path(model_shortcut_name, force_download=not use_cached_models)

-            convert_pt_checkpoint_to_tf(model_type,
-                                        model_file,
-                                        config_file,
-                                        os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
+            if os.path.isfile(model_shortcut_name):
+                model_shortcut_name = 'converted_model'
+            convert_pt_checkpoint_to_tf(model_type=model_type,
+                                        pytorch_checkpoint_path=model_file,
+                                        config_file=config_file,
+                                        tf_dump_path=os.path.join(tf_dump_path, model_shortcut_name + '-tf_model.h5'),
                                        compare_with_pt_model=compare_with_pt_model)
            os.remove(config_file)
            os.remove(model_file)
@@ -228,6 +237,7 @@ if __name__ == "__main__":
    convert_all_pt_checkpoints_to_tf(args.model_type.lower() if args.model_type is not None else None,
                                        args.tf_dump_path,
                                        model_shortcut_names_or_path=[args.pytorch_checkpoint_path] if args.pytorch_checkpoint_path is not None else None,
+                                        config_shortcut_names_or_path=[args.config_file] if args.config_file is not None else None,
                                        compare_with_pt_model=args.compare_with_pt_model,
                                        use_cached_models=args.use_cached_models,
                                        only_convert_finetuned_models=args.only_convert_finetuned_models)
--- a/transformers/data/processors/glue.py
+++ b/transformers/data/processors/glue.py
@@ -79,17 +79,13 @@ def glue_convert_examples_to_features(examples, tokenizer,
        if ex_index % 10000 == 0:
            logger.info("Writing example %d" % (ex_index))
        if is_tf_dataset:
-            example = InputExample(example['idx'].numpy(),
-                                   example['sentence1'].numpy().decode('utf-8'),
-                                   example['sentence2'].numpy().decode('utf-8'),
-                                   str(example['label'].numpy()))
+            example = processor.get_example_from_tensor_dict(example)

        inputs = tokenizer.encode_plus(
            example.text_a,
            example.text_b,
            add_special_tokens=True,
            max_length=max_length,
-            truncate_first_sequence=True  # We're truncating the first sequence in priority
        )
        input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]

@@ -157,6 +153,13 @@ def glue_convert_examples_to_features(examples, tokenizer,
 class MrpcProcessor(DataProcessor):
    """Processor for the MRPC data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence1'].numpy().decode('utf-8'),
+                            tensor_dict['sentence2'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        logger.info("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
@@ -190,6 +193,13 @@ class MrpcProcessor(DataProcessor):
 class MnliProcessor(DataProcessor):
    """Processor for the MultiNLI data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['premise'].numpy().decode('utf-8'),
+                            tensor_dict['hypothesis'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -233,6 +243,13 @@ class MnliMismatchedProcessor(MnliProcessor):
 class ColaProcessor(DataProcessor):
    """Processor for the CoLA data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence'].numpy().decode('utf-8'),
+                            None,
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -262,6 +279,13 @@ class ColaProcessor(DataProcessor):
 class Sst2Processor(DataProcessor):
    """Processor for the SST-2 data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence'].numpy().decode('utf-8'),
+                            None,
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -293,6 +317,13 @@ class Sst2Processor(DataProcessor):
 class StsbProcessor(DataProcessor):
    """Processor for the STS-B data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence1'].numpy().decode('utf-8'),
+                            tensor_dict['sentence2'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -325,6 +356,13 @@ class StsbProcessor(DataProcessor):
 class QqpProcessor(DataProcessor):
    """Processor for the QQP data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['question1'].numpy().decode('utf-8'),
+                            tensor_dict['question2'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -360,6 +398,13 @@ class QqpProcessor(DataProcessor):
 class QnliProcessor(DataProcessor):
    """Processor for the QNLI data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['question'].numpy().decode('utf-8'),
+                            tensor_dict['sentence'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -393,6 +438,13 @@ class QnliProcessor(DataProcessor):
 class RteProcessor(DataProcessor):
    """Processor for the RTE data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence1'].numpy().decode('utf-8'),
+                            tensor_dict['sentence2'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
@@ -425,6 +477,13 @@ class RteProcessor(DataProcessor):
 class WnliProcessor(DataProcessor):
    """Processor for the WNLI data set (GLUE version)."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """See base class."""
+        return InputExample(tensor_dict['idx'].numpy(),
+                            tensor_dict['sentence1'].numpy().decode('utf-8'),
+                            tensor_dict['sentence2'].numpy().decode('utf-8'),
+                            str(tensor_dict['label'].numpy()))
+
    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
--- a/transformers/data/processors/utils.py
+++ b/transformers/data/processors/utils.py
@@ -86,6 +86,15 @@ class InputFeatures(object):
 class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

+    def get_example_from_tensor_dict(self, tensor_dict):
+        """Gets an example from a dict with tensorflow tensors
+
+        Args:
+            tensor_dict: Keys and values should match the corresponding Glue
+                tensorflow_dataset examples.
+        """
+        raise NotImplementedError()
+
    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()
--- a/transformers/file_utils.py
+++ b/transformers/file_utils.py
@@ -27,7 +27,7 @@ logger = logging.getLogger(__name__)  # pylint: disable=invalid-name

 try:
    import tensorflow as tf
-    assert int(tf.__version__[0]) >= 2
+    assert hasattr(tf, '__version__') and int(tf.__version__[0]) >= 2
    _tf_available = True  # pylint: disable=invalid-name
    logger.info("TensorFlow version {} available.".format(tf.__version__))
 except (ImportError, AssertionError):
--- a/transformers/modeling_auto.py
+++ b/transformers/modeling_auto.py
@@ -21,6 +21,7 @@ import logging
 from .modeling_bert import BertModel, BertForMaskedLM, BertForSequenceClassification, BertForQuestionAnswering
 from .modeling_openai import OpenAIGPTModel, OpenAIGPTLMHeadModel
 from .modeling_gpt2 import GPT2Model, GPT2LMHeadModel
+from .modeling_ctrl import CTRLModel, CTRLLMHeadModel
 from .modeling_transfo_xl import TransfoXLModel, TransfoXLLMHeadModel
 from .modeling_xlnet import XLNetModel, XLNetLMHeadModel, XLNetForSequenceClassification, XLNetForQuestionAnswering
 from .modeling_xlm import XLMModel, XLMWithLMHeadModel, XLMForSequenceClassification, XLMForQuestionAnswering
@@ -51,6 +52,7 @@ class AutoModel(object):
            - contains `bert`: BertModel (Bert model)
            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `ctrl`: CTRLModel (Salesforce CTRL  model)
            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
            - contains `xlnet`: XLNetModel (XLNet model)
            - contains `xlm`: XLMModel (XLM model)
@@ -73,6 +75,7 @@ class AutoModel(object):
            - contains `bert`: BertModel (Bert model)
            - contains `openai-gpt`: OpenAIGPTModel (OpenAI GPT model)
            - contains `gpt2`: GPT2Model (OpenAI GPT-2 model)
+            - contains `ctrl`: CTRLModel (Salesforce CTRL  model)
            - contains `transfo-xl`: TransfoXLModel (Transformer-XL model)
            - contains `xlnet`: XLNetModel (XLNet model)
            - contains `xlm`: XLMModel (XLM model)
@@ -149,10 +152,11 @@ class AutoModel(object):
            return XLNetModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
        elif 'xlm' in pretrained_model_name_or_path:
            return XLMModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
-
+        elif 'ctrl' in pretrained_model_name_or_path:
+            return CTRLModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta, 'ctrl'".format(pretrained_model_name_or_path))


 class AutoModelWithLMHead(object):
@@ -172,6 +176,7 @@ class AutoModelWithLMHead(object):
            - contains `bert`: BertForMaskedLM (Bert model)
            - contains `openai-gpt`: OpenAIGPTLMHeadModel (OpenAI GPT model)
            - contains `gpt2`: GPT2LMHeadModel (OpenAI GPT-2 model)
+            - contains `ctrl`: CTRLLMModel (Salesforce CTRL model)
            - contains `transfo-xl`: TransfoXLLMHeadModel (Transformer-XL model)
            - contains `xlnet`: XLNetLMHeadModel (XLNet model)
            - contains `xlm`: XLMWithLMHeadModel (XLM model)
@@ -273,10 +278,11 @@ class AutoModelWithLMHead(object):
            return XLNetLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
        elif 'xlm' in pretrained_model_name_or_path:
            return XLMWithLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
-
+        elif 'ctrl' in pretrained_model_name_or_path:
+            return CTRLLMHeadModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
        raise ValueError("Unrecognized model identifier in {}. Should contains one of "
                         "'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', "
-                         "'xlm', 'roberta'".format(pretrained_model_name_or_path))
+                         "'xlm', 'roberta','ctrl'".format(pretrained_model_name_or_path))


 class AutoModelForSequenceClassification(object):
--- a/transformers/modeling_bert.py
+++ b/transformers/modeling_bert.py
@@ -48,6 +48,8 @@ BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
    'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
    'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",
+    'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-pytorch_model.bin",
+    'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-pytorch_model.bin",
 }

 def load_tf_weights_in_bert(model, config, tf_checkpoint_path):
@@ -118,7 +120,7 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path):


 def gelu(x):
-    """ Original Implementation of the gelu activation function in Google Bert repo when initialy created.
+    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
--- a/transformers/modeling_ctrl.py
+++ b/transformers/modeling_ctrl.py
@@ -0,0 +1,485 @@
+# coding=utf-8
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch CTRL model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import json
+import logging
+import math
+import os
+import sys
+from io import open
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+from torch.nn.parameter import Parameter
+
+from .modeling_utils import PreTrainedModel, Conv1D, prune_conv1d_layer, SequenceSummary
+from .configuration_ctrl import CTRLConfig
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://storage.googleapis.com/sf-ctrl/pytorch/seqlen256_v1.bin"}
+
+
+def angle_defn(pos, i, d_model_size):
+    angle_rates = 1 / torch.pow(10000, (2 * (i//2)) / d_model_size)
+    return pos * angle_rates
+
+def positional_encoding(position, d_model_size, dtype):
+    # create the sinusoidal pattern for the positional encoding
+    angle_rads = (angle_defn(torch.arange(position, dtype=dtype).unsqueeze(1),
+                  torch.arange(d_model_size, dtype=dtype).unsqueeze(0),
+                  d_model_size))
+
+    sines = torch.sin(angle_rads[:, 0::2])
+    cosines = torch.cos(angle_rads[:, 1::2])
+
+    pos_encoding = torch.cat([sines, cosines], dim=-1)
+    return pos_encoding
+
+def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
+    # calculate attention
+    matmul_qk = torch.matmul(q, k.permute(0,1,3,2))
+
+    dk = k.shape[-1]
+    scaled_attention_logits = matmul_qk / np.sqrt(dk)
+
+    if mask is not None:
+        scaled_attention_logits += (mask * -1e4)
+
+    if attention_mask is not None:
+        # Apply the attention mask
+        scaled_attention_logits = scaled_attention_logits + attention_mask
+
+    attention_weights = torch.softmax(scaled_attention_logits, dim=-1) 
+
+    # Mask heads if we want to
+    if head_mask is not None:
+        attention_weights = attention_weights * head_mask
+
+    output = torch.matmul(attention_weights, v)
+
+    return output, attention_weights
+
+
+class MultiHeadAttention(torch.nn.Module):
+    def __init__(self, d_model_size, num_heads, output_attentions=False):
+        super(MultiHeadAttention, self).__init__()
+        self.output_attentions = output_attentions
+        self.num_heads = num_heads
+        self.d_model_size = d_model_size
+
+        self.depth = int(d_model_size / self.num_heads)
+
+        self.Wq = torch.nn.Linear(d_model_size, d_model_size)
+        self.Wk = torch.nn.Linear(d_model_size, d_model_size)
+        self.Wv = torch.nn.Linear(d_model_size, d_model_size)
+
+        self.dense = torch.nn.Linear(d_model_size, d_model_size)
+
+    def split_into_heads(self, x, batch_size):
+        x = x.reshape(batch_size, -1, self.num_heads, self.depth)
+        return x.permute([0, 2, 1, 3])
+
+    def forward(self, v, k, q, mask, layer_past=None, attention_mask=None, head_mask=None):
+        batch_size = q.shape[0]
+
+        q = self.Wq(q)
+        k = self.Wk(k)
+        v = self.Wv(v)
+
+        q = self.split_into_heads(q, batch_size)
+        k = self.split_into_heads(k, batch_size)
+        v = self.split_into_heads(v, batch_size)
+        if layer_past is not None:
+            past_key, past_value = layer_past[0], layer_past[1]
+            k = torch.cat((past_key, k), dim=-2)
+            v = torch.cat((past_value, v), dim=-2)
+        present = torch.stack((k, v))
+
+        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
+        scaled_attention = output[0].permute([0, 2, 1, 3])
+        attn = output[1]
+        original_size_attention = scaled_attention.reshape(batch_size, -1, self.d_model_size)
+        output = self.dense(original_size_attention)
+
+        outputs = (output, present)
+        if self.output_attentions:
+            outputs = outputs + (attn,)
+        return outputs
+
+
+
+def point_wise_feed_forward_network(d_model_size, dff):
+    return torch.nn.Sequential(torch.nn.Linear(d_model_size, dff),
+                               torch.nn.ReLU(),
+                               torch.nn.Linear(dff, d_model_size))
+
+
+class EncoderLayer(torch.nn.Module):
+    def __init__(self, d_model_size, num_heads, dff, rate=0.1, output_attentions=False):
+        super(EncoderLayer, self).__init__()
+
+        self.multi_head_attention = MultiHeadAttention(d_model_size, num_heads, output_attentions)
+        self.ffn = point_wise_feed_forward_network(d_model_size, dff)
+
+        self.layernorm1 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
+        self.layernorm2 = torch.nn.LayerNorm(d_model_size, eps=1e-6)
+
+        self.dropout1 = torch.nn.Dropout(rate)
+        self.dropout2 = torch.nn.Dropout(rate)
+
+    def forward(self, x, mask, layer_past=None, attention_mask=None, head_mask=None):
+        normed = self.layernorm1(x)
+        attn_outputs = self.multi_head_attention(normed, normed, normed, mask,
+                                                      layer_past=layer_past,
+                                                      attention_mask=attention_mask,
+                                                      head_mask=head_mask)
+        attn_output = attn_outputs[0]
+        attn_output = self.dropout1(attn_output)
+        out1 = x + attn_output
+
+        out2 = self.layernorm2(out1)
+        ffn_output = self.ffn(out2)
+        ffn_output = self.dropout2(ffn_output)
+        out2 = out1 + ffn_output
+
+        outputs = (out2,) + attn_outputs[1:]
+        return outputs
+
+
+class CTRLPreTrainedModel(PreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = CTRLConfig
+    pretrained_model_archive_map = CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "transformer"
+
+    def _init_weights(self, module):
+        """ Initialize the weights.
+        """
+        if isinstance(module, (nn.Linear, nn.Embedding, Conv1D)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if isinstance(module, (nn.Linear, Conv1D)) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+CTRL_START_DOCSTRING = r"""    CTRL model was proposed in 
+    `CTRL: A Conditional Transformer Language Model for Controllable Generation`_
+    by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
+    It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
+    corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
+        https://www.github.com/salesforce/ctrl
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+CTRL_INPUTS_DOCSTRING = r"""    Inputs:
+        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+            Indices can be obtained using :class:`transformers.CTRLTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.encode` and
+            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `past` output below). Can be used to speed up sequential decoding.
+        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
+        **position_ids**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
+                                            CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
+class CTRLModel(CTRLPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+        model = CTRLModel.from_pretrained('ctrl')
+        input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config):
+        super(CTRLModel, self).__init__(config)
+        self.output_hidden_states = config.output_hidden_states
+        self.output_attentions = config.output_attentions
+        self.output_past = config.output_past
+
+        self.d_model_size = config.n_embd
+        self.num_layers = config.n_layer
+
+        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size, torch.float)
+
+        self.w = nn.Embedding(config.vocab_size, config.n_embd)
+
+        self.dropout = nn.Dropout(config.embd_pdrop)
+        self.h = nn.ModuleList([EncoderLayer(config.n_embd,
+                                             config.n_head,
+                                             config.dff,
+                                             config.resid_pdrop,
+                                             config.output_attentions) for _ in range(config.n_layer)])
+        self.layernorm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+
+        self.init_weights()
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        self.w = self._get_resized_embeddings(self.w, new_num_tokens)
+        return self.w
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        for layer, heads in heads_to_prune.items():
+            self.h[layer].attn.prune_heads(heads)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None):
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_shape[-1])
+        if past is None:
+            past_length = 0
+            past = [None] * len(self.h)
+        else:
+            past_length = past[0][0].size(-2)
+        if position_ids is None:
+            position_ids = torch.arange(past_length, input_ids.size(-1) + past_length, dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
+
+        # Attention mask.
+        if attention_mask is not None:
+            attention_mask = attention_mask.view(-1, input_shape[-1])
+            # We create a 3D attention mask from a 2D tensor mask.
+            # Sizes are [batch_size, 1, 1, to_seq_length]
+            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+            # this attention mask is more simple than the triangular masking of causal attention
+            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+
+            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+            # masked positions, this operation will create a tensor which is 0.0 for
+            # positions we want to attend and -10000.0 for masked positions.
+            # Since we are adding it to the raw scores before the softmax, this is
+            # effectively the same as removing these entirely.
+            attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
+            attention_mask = (1.0 - attention_mask) * -10000.0
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # head_mask has shape n_layer x batch x n_heads x N x N
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.n_layer, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)  # We can specify head_mask for each layer
+            head_mask = head_mask.to(dtype=next(self.parameters()).dtype) # switch to fload if need + fp16 compatibility
+        else:
+            head_mask = [None] * self.config.n_layer
+
+        if token_type_ids is not None:
+            token_type_ids = token_type_ids.view(-1, input_shape[-1])
+            token_type_embeds = self.w(token_type_ids)
+            token_type_embeds *= np.sqrt(self.d_model_size)
+        else:
+            token_type_embeds = 0
+        position_ids = position_ids.view(-1, input_shape[-1])
+
+        inputs_embeds = self.w(input_ids)
+        # inputs_embeds = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
+        seq_len = input_ids.shape[-1]
+        mask = torch.triu(torch.ones(seq_len, seq_len), 1).to(inputs_embeds.device)
+
+        inputs_embeds *= np.sqrt(self.d_model_size)
+
+        pos_embeds = self.pos_encoding[position_ids, :].to(inputs_embeds.device)
+
+        hidden_states = inputs_embeds + pos_embeds + token_type_embeds
+
+        hidden_states = self.dropout(hidden_states)
+
+        output_shape = input_shape + (inputs_embeds.size(-1),)
+        presents = ()
+        all_hidden_states = ()
+        all_attentions = []
+        for i, (h, layer_past) in enumerate(zip(self.h, past)):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states.view(*output_shape),)
+            outputs = h(hidden_states,
+                        mask,
+                        layer_past=layer_past,
+                        attention_mask=attention_mask,
+                        head_mask=head_mask[i])
+            hidden_states, present = outputs[:2]
+            if self.output_past:
+                presents = presents + (present,)
+
+            if self.output_attentions:
+                all_attentions.append(outputs[2])
+
+        hidden_states = self.layernorm(hidden_states)
+        hidden_states = hidden_states.view(*output_shape)
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        outputs = (hidden_states,)
+        if self.output_past:
+            outputs = outputs + (presents,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            # let the number of heads free (-1) so we can extract attention even after head pruning
+            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
+            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
+            outputs = outputs + (all_attentions,)
+        return outputs
+
+
+@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
+(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
+class CTRLLMHeadModel(CTRLPreTrainedModel):
+    r"""
+        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+            Labels for language modeling.
+            Note that the labels **are shifted** inside the model, i.e. you can set ``lm_labels = input_ids``
+            Indices are selected in ``[-1, 0, ..., config.vocab_size]``
+            All labels set to ``-1`` are ignored (masked), the loss is only
+            computed for labels in ``[0, ..., config.vocab_size]``
+
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
+            Language modeling loss.
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **past**:
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import torch
+        from transformers import CTRLTokenizer, CTRLLMHeadModel
+
+        tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+        model = CTRLLMHeadModel.from_pretrained('ctrl')
+
+        input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config):
+        super(CTRLLMHeadModel, self).__init__(config)
+        self.transformer = CTRLModel(config)
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=True)
+
+        self.init_weights()
+        self.tie_weights()
+
+    def tie_weights(self):
+        """ Make sure we are sharing the input and output embeddings.
+                Export to TorchScript can't handle parameter sharing so we are cloning them instead.
+        """
+        self._tie_or_clone_weights(self.lm_head, self.transformer.w)
+
+    def forward(self, input_ids, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None,
+                labels=None):
+        transformer_outputs = self.transformer(input_ids,
+                                               past=past,
+                                               attention_mask=attention_mask,
+                                               token_type_ids=token_type_ids,
+                                               position_ids=position_ids,
+                                               head_mask=head_mask)
+
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss(ignore_index=-1)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)),
+                            shift_labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)
--- a/transformers/modeling_distilbert.py
+++ b/transformers/modeling_distilbert.py
@@ -159,8 +159,6 @@ class MultiHeadSelfAttention(nn.Module):

        dim_per_head = self.dim // self.n_heads

-        assert 2 <= mask.dim() <= 3
-        causal = (mask.dim() == 3)
        mask_reshp = (bs, 1, 1, k_length)

        def shape(x):
@@ -649,7 +647,7 @@ class DistilBertForQuestionAnswering(DistilBertPreTrainedModel):
        start_positions = torch.tensor([1])
        end_positions = torch.tensor([3])
        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
-        loss, start_scores, end_scores = outputs[:2]
+        loss, start_scores, end_scores = outputs[:3]

    """
    def __init__(self, config):
--- a/transformers/modeling_gpt2.py
+++ b/transformers/modeling_gpt2.py
@@ -38,7 +38,8 @@ logger = logging.getLogger(__name__)

 GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin",
                                     "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin",
-                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin"}
+                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-pytorch_model.bin",
+                                     "distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-pytorch_model.bin",}

 def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
    """ Load tf checkpoints in a pytorch model
@@ -346,6 +347,7 @@ class GPT2Model(GPT2PreTrainedModel):
        super(GPT2Model, self).__init__(config)
        self.output_hidden_states = config.output_hidden_states
        self.output_attentions = config.output_attentions
+        self.output_past = config.output_past

        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
@@ -439,7 +441,8 @@ class GPT2Model(GPT2PreTrainedModel):
                            head_mask=head_mask[i])

            hidden_states, present = outputs[:2]
-            presents = presents + (present,)
+            if self.output_past:
+                presents = presents + (present,)

            if self.output_attentions:
                all_attentions.append(outputs[2])
@@ -451,7 +454,9 @@ class GPT2Model(GPT2PreTrainedModel):
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

-        outputs = (hidden_states, presents)
+        outputs = (hidden_states,)
+        if self.output_past:
+            outputs = outputs + (presents,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
@@ -459,7 +464,7 @@ class GPT2Model(GPT2PreTrainedModel):
            attention_output_shape = input_shape[:-1] + (-1,) + all_attentions[0].shape[-2:]
            all_attentions = tuple(t.view(*attention_output_shape) for t in all_attentions)
            outputs = outputs + (all_attentions,)
-        return outputs  # last hidden state, presents, (all hidden_states), (attentions)
+        return outputs  # last hidden state, (presents), (all hidden_states), (attentions)


@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top
--- a/transformers/modeling_openai.py
+++ b/transformers/modeling_openai.py
@@ -170,7 +170,7 @@ class Attention(nn.Module):
        # w = w * self.bias + -1e9 * (1 - self.bias)  # TF implem method: mask_attn_weights
        # XD: self.b may be larger than w, so we need to crop it
        b = self.bias[:, :, : w.size(-2), : w.size(-1)]
-        w = w * b + -1e9 * (1 - b)
+        w = w * b + - 1e4 * (1 - b)

        if attention_mask is not None:
            # Apply the attention mask
--- a/transformers/modeling_roberta.py
+++ b/transformers/modeling_roberta.py
@@ -43,6 +43,9 @@ class RobertaEmbeddings(BertEmbeddings):
    def __init__(self, config):
        super(RobertaEmbeddings, self).__init__(config)
        self.padding_idx = 1
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=self.padding_idx)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size,
+                                                padding_idx=self.padding_idx)

    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        seq_length = input_ids.size(1)
@@ -169,7 +172,8 @@ class RobertaModel(BertModel):
        if input_ids[:, 0].sum().item() != 0:
            logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
                           "This model requires special tokens in order to work. "
-                           "Please specify add_special_tokens=True in your encoding.")
+                           "Please specify add_special_tokens=True in your tokenize.encode()"
+                           "or tokenizer.convert_tokens_to_ids().")
        return super(RobertaModel, self).forward(input_ids,
                                                 attention_mask=attention_mask,
                                                 token_type_ids=token_type_ids,
--- a/transformers/modeling_tf_bert.py
+++ b/transformers/modeling_tf_bert.py
@@ -30,7 +30,6 @@ import tensorflow as tf
 from .configuration_bert import BertConfig
 from .modeling_tf_utils import TFPreTrainedModel, get_initializer
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

@@ -52,17 +51,9 @@ TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
 }


-def load_bert_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 def gelu(x):
    """ Gaussian Error Linear Unit.
-    Original Implementation of the gelu activation function in Google Bert repo when initialy created.
+    Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
@@ -545,7 +536,6 @@ class TFBertPreTrainedModel(TFPreTrainedModel):
    """
    config_class = BertConfig
    pretrained_model_archive_map = TF_BERT_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_bert_pt_weights_in_tf2
    base_model_prefix = "bert"


--- a/transformers/modeling_tf_ctrl.py
+++ b/transformers/modeling_tf_ctrl.py
@@ -0,0 +1,487 @@
+# coding=utf-8
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TF 2.0 CTRL model."""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import logging
+import os
+import sys
+from io import open
+import numpy as np
+import tensorflow as tf
+
+from .configuration_ctrl import CTRLConfig
+from .modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list, TFSharedEmbeddings
+from .file_utils import add_start_docstrings
+
+logger = logging.getLogger(__name__)
+
+TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.huggingface.co/bert/ctrl-tf_model.h5"}
+
+def angle_defn(pos, i, d_model_size):
+    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model_size))
+    return pos * angle_rates
+
+def positional_encoding(position, d_model_size):
+    # create the sinusoidal pattern for the positional encoding
+    angle_rads = angle_defn(np.arange(position)[:, np.newaxis],
+                            np.arange(d_model_size)[np.newaxis, :],
+                            d_model_size)
+
+    sines = np.sin(angle_rads[:, 0::2])
+    cosines = np.cos(angle_rads[:, 1::2])
+
+    # pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1)[np.newaxis, ...], dtype=tf.float32)
+    pos_encoding = tf.cast(np.concatenate([sines, cosines], axis=-1), dtype=tf.float32)
+    return pos_encoding
+
+def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=None):
+    # calculate attention
+    matmul_qk = tf.matmul(q, k, transpose_b=True)
+    
+    dk = tf.cast(shape_list(k)[-1], tf.float32)
+    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
+
+    if mask is not None:
+        scaled_attention_logits += (mask * -1e4)
+
+    if attention_mask is not None:
+        # Apply the attention mask
+        scaled_attention_logits = scaled_attention_logits + attention_mask
+
+    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) 
+
+    # Mask heads if we want to
+    if head_mask is not None:
+        attention_weights = attention_weights * head_mask
+
+    output = tf.matmul(attention_weights, v) 
+
+    return output, attention_weights
+
+
+class TFMultiHeadAttention(tf.keras.layers.Layer):
+    def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):
+        super(TFMultiHeadAttention, self).__init__(**kwargs)
+        self.output_attentions = output_attentions
+        self.num_heads = num_heads
+        self.d_model_size = d_model_size
+
+        self.depth = int(d_model_size / self.num_heads)
+
+        self.Wq = tf.keras.layers.Dense(d_model_size, name='Wq')
+        self.Wk = tf.keras.layers.Dense(d_model_size, name='Wk')
+        self.Wv = tf.keras.layers.Dense(d_model_size, name='Wv')
+
+        self.dense = tf.keras.layers.Dense(d_model_size, name='dense')
+
+    def split_into_heads(self, x, batch_size):
+        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
+        return tf.transpose(x, perm=[0, 2, 1, 3])
+
+    def call(self, inputs, training=False):
+        v, k, q, mask, layer_past, attention_mask, head_mask = inputs
+        batch_size = q.shape[0]
+
+        q = self.Wq(q)
+        k = self.Wk(k)
+        v = self.Wv(v)
+
+        q = self.split_into_heads(q, batch_size)
+        k = self.split_into_heads(k, batch_size)
+        v = self.split_into_heads(v, batch_size)
+        if layer_past is not None:
+            past_key, past_value = tf.unstack(layer_past, axis=1)
+            k = tf.concat((past_key, k), dim=-2)
+            v = tf.concat((past_value, v), dim=-2)
+        present = tf.stack((k, v), axis=1)
+
+        output = scaled_dot_product_attention(q, k, v, mask, attention_mask, head_mask)
+        scaled_attention = tf.transpose(output[0], perm=[0, 2, 1, 3])
+        attn = output[1]
+        original_size_attention = tf.reshape(scaled_attention,  (batch_size, -1, self.d_model_size))
+        output = self.dense(original_size_attention)
+
+        outputs = (output, present)
+        if self.output_attentions:
+            outputs = outputs + (attn,)
+        return outputs
+
+
+
+def point_wise_feed_forward_network(d_model_size, dff, name=""):
+    return tf.keras.Sequential([
+            tf.keras.layers.Dense(dff, activation='relu', name="0"), 
+            tf.keras.layers.Dense(d_model_size, name="2")
+        ], name="ffn")
+
+
+class TFEncoderLayer(tf.keras.layers.Layer):
+    def __init__(self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs):
+        super(TFEncoderLayer, self).__init__(**kwargs)
+
+        self.multi_head_attention = TFMultiHeadAttention(d_model_size,
+                                                         num_heads,
+                                                         output_attentions,
+                                                         name="multi_head_attention")
+        self.ffn = point_wise_feed_forward_network(d_model_size, dff, name="ffn")
+
+        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm1")
+        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm2")
+
+        self.dropout1 = tf.keras.layers.Dropout(rate)
+        self.dropout2 = tf.keras.layers.Dropout(rate)
+
+    def call(self, inputs, training=False):
+        x, mask, layer_past, attention_mask, head_mask = inputs
+        normed = self.layernorm1(x)
+        attn_outputs = self.multi_head_attention([normed, normed, normed, mask, layer_past,
+                                                  attention_mask, head_mask], training=training)
+        attn_output = attn_outputs[0]
+        attn_output = self.dropout1(attn_output, training=training)
+        out1 = x + attn_output
+
+        out2 = self.layernorm2(out1)
+        ffn_output = self.ffn(out2)
+        ffn_output = self.dropout2(ffn_output, training=training)
+        out2 = out1 + ffn_output
+
+        outputs = (out2,) + attn_outputs[1:]
+        return outputs
+
+
+class TFCTRLMainLayer(tf.keras.layers.Layer):
+    def __init__(self, config, **kwargs):
+        super(TFCTRLMainLayer, self).__init__(**kwargs)
+        self.output_hidden_states = config.output_hidden_states
+        self.output_attentions = config.output_attentions
+        self.output_past = config.output_past
+
+        self.d_model_size = config.n_embd
+        self.num_layers = config.n_layer
+
+        self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)
+
+
+        self.w = TFSharedEmbeddings(config.vocab_size,
+                                    config.n_embd,
+                                    initializer_range=config.initializer_range,
+                                    name="w")
+
+        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)
+        self.h = [TFEncoderLayer(config.n_embd,
+                                 config.n_head,
+                                 config.dff,
+                                 config.resid_pdrop,
+                                 config.layer_norm_epsilon,
+                                 config.output_attentions,
+                                 name='h_._{}'.format(i)) for i in range(config.n_layer)]
+        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name="layernorm")
+
+    def _resize_token_embeddings(self, new_num_tokens):
+        raise NotImplementedError
+
+    def _prune_heads(self, heads_to_prune):
+        """ Prunes heads of the model.
+                heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
+        """
+        raise NotImplementedError
+
+    def call(self, inputs, past=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, training=False):
+        if isinstance(inputs, (tuple, list)):
+            input_ids = inputs[0]
+            past = inputs[1] if len(inputs) > 1 else past
+            attention_mask = inputs[2] if len(inputs) > 2 else attention_mask
+            token_type_ids = inputs[3] if len(inputs) > 3 else token_type_ids
+            position_ids = inputs[4] if len(inputs) > 4 else position_ids
+            head_mask = inputs[5] if len(inputs) > 5 else head_mask
+            assert len(inputs) <= 6, "Too many inputs."
+        elif isinstance(inputs, dict):
+            input_ids = inputs.get('input_ids')
+            past = inputs.get('past', past)
+            attention_mask = inputs.get('attention_mask', attention_mask)
+            token_type_ids = inputs.get('token_type_ids', token_type_ids)
+            position_ids = inputs.get('position_ids', position_ids)
+            head_mask = inputs.get('head_mask', head_mask)
+            assert len(inputs) <= 6, "Too many inputs."
+        else:
+            input_ids = inputs
+
+        input_shape = shape_list(input_ids)
+        input_ids = tf.reshape(input_ids, [-1, input_shape[-1]])
+
+        if past is None:
+            past_length = 0
+            past = [None] * len(self.h)
+        else:
+            past_length = shape_list(past[0][0])[-2]
+        if position_ids is None:
+            position_ids = tf.range(past_length, shape_list(input_ids)[-1] + past_length, dtype=tf.int32)[tf.newaxis, :]
+            position_ids = tf.tile(position_ids, [shape_list(input_ids)[0], 1])
+
+        # Attention mask.
+        if attention_mask is not None:
+            # We create a 3D attention mask from a 2D tensor mask.
+            # Sizes are [batch_size, 1, 1, to_seq_length]
+            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
+            # this attention mask is more simple than the triangular masking of causal attention
+            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
+            attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
+
+            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+            # masked positions, this operation will create a tensor which is 0.0 for
+            # positions we want to attend and -10000.0 for masked positions.
+            # Since we are adding it to the raw scores before the softmax, this is
+            # effectively the same as removing these entirely.
+
+            attention_mask = tf.cast(attention_mask, tf.float32)
+            attention_mask = (1.0 - attention_mask) * -10000.0
+        else:
+            attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # head_mask has shape n_layer x batch x n_heads x N x N
+        if head_mask is not None:
+            raise NotImplementedError
+        else:
+            head_mask = [None] * self.num_layers
+
+        if token_type_ids is not None:
+            token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
+            token_type_embeds = self.w(token_type_ids, mode='embedding')
+            token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
+        else:
+            token_type_embeds = 0
+        position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])
+
+        inputs_embeds = self.w(input_ids, mode='embedding')
+        # x = embedded.unsqueeze(0) if len(input_ids.shape)<2 else embedded
+        seq_len = input_shape[-1]
+        mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
+
+        inputs_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, tf.float32))
+
+        pos_embeds = tf.gather(self.pos_encoding, position_ids)
+
+        hidden_states = inputs_embeds + pos_embeds + token_type_embeds
+
+        hidden_states = self.dropout(hidden_states, training=training)
+
+        output_shape = input_shape + [shape_list(hidden_states)[-1]]
+        presents = ()
+        all_hidden_states = ()
+        all_attentions = []
+        for i, (h, layer_past) in enumerate(zip(self.h, past)):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)
+            outputs = h([hidden_states, mask, layer_past, attention_mask, head_mask[i]], training=training)
+            hidden_states, present = outputs[:2]
+
+            if self.output_past:
+                presents = presents + (present,)
+
+            if self.output_attentions:
+                all_attentions.append(outputs[2])
+
+        hidden_states = self.layernorm(hidden_states)
+        hidden_states = tf.reshape(hidden_states, output_shape)
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        outputs = (hidden_states,)
+        if self.output_past:
+            outputs = outputs + (presents,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            # let the number of heads free (-1) so we can extract attention even after head pruning
+            attention_output_shape = input_shape[:-1] + [-1] + shape_list(all_attentions[0])[-2:]
+            all_attentions = tuple(tf.reshape(t, attention_output_shape) for t in all_attentions)
+            outputs = outputs + (all_attentions,)
+        return outputs
+
+
+class TFCTRLPreTrainedModel(TFPreTrainedModel):
+    """ An abstract class to handle weights initialization and
+        a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = CTRLConfig
+    pretrained_model_archive_map = TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP
+    base_model_prefix = "transformer"
+
+
+CTRL_START_DOCSTRING = r"""    CTRL model was proposed in 
+    `CTRL: A Conditional Transformer Language Model for Controllable Generation`_
+    by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.
+    It's a causal (unidirectional) transformer pre-trained using language modeling on a very large
+    corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
+
+    This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and
+    refer to the PyTorch documentation for all matter related to general usage and behavior.
+
+    .. _`CTRL: A Conditional Transformer Language Model for Controllable Generation`:
+        https://www.github.com/salesforce/ctrl
+
+    .. _`torch.nn.Module`:
+        https://pytorch.org/docs/stable/nn.html#module
+
+    Parameters:
+        config (:class:`~transformers.CTRLConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the configuration.
+            Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
+"""
+
+CTRL_INPUTS_DOCSTRING = r"""    Inputs:
+        **input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of input sequence tokens in the vocabulary.
+            CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on
+            the right rather than the left.
+            Indices can be obtained using :class:`transformers.CTRLTokenizer`.
+            See :func:`transformers.PreTrainedTokenizer.encode` and
+            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
+        **past**:
+            list of ``Numpy array`` or ``tf.Tensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            (see `past` output below). Can be used to speed up sequential decoding.
+        **attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+            Mask to avoid performing attention on padding token indices.
+            Mask values selected in ``[0, 1]``:
+            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        **token_type_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+            A parallel sequence of tokens (can be used to indicate various portions of the inputs).
+            The embeddings from these tokens will be summed with the respective token embeddings.
+            Indices are selected in the vocabulary (unlike BERT which has a specific vocabulary for segment indices).
+        **position_ids**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+            Indices of positions of each input sequence tokens in the position embeddings.
+            Selected in the range ``[0, config.max_position_embeddings - 1]``.
+        **head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+            Mask to nullify selected heads of the self-attention modules.
+            Mask values selected in ``[0, 1]``:
+            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
+"""
+
+@add_start_docstrings("The bare CTRL Model transformer outputting raw hidden-states without any specific head on top.",
+                                            CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
+class TFCTRLModel(TFCTRLPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
+            Sequence of hidden-states at the last layer of the model.
+        **past**:
+            list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import tensorflow as tf
+        from transformers import CTRLTokenizer, TFCTRLModel
+
+        tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+        model = TFCTRLModel.from_pretrained('ctrl')
+        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
+        outputs = model(input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+    """
+    def __init__(self, config, *inputs, **kwargs):
+        super(TFCTRLModel, self).__init__(config, *inputs, **kwargs)
+        self.transformer = TFCTRLMainLayer(config, name='transformer')
+
+    def call(self, inputs, **kwargs):
+        outputs = self.transformer(inputs, **kwargs)
+        return outputs
+
+
+class TFCTRLLMHead(tf.keras.layers.Layer):
+    def __init__(self, config, input_embeddings, **kwargs):
+        super(TFCTRLLMHead, self).__init__(**kwargs)
+        self.vocab_size = config.vocab_size
+
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.input_embeddings = input_embeddings
+
+    def build(self, input_shape):
+        self.bias = self.add_weight(shape=(self.vocab_size,),
+                                    initializer='zeros',
+                                    trainable=True,
+                                    name='bias')
+        super(TFCTRLLMHead, self).build(input_shape)
+
+    def call(self, hidden_states):
+        hidden_states = self.input_embeddings(hidden_states, mode="linear")
+        hidden_states = hidden_states + self.bias
+        return hidden_states
+
+
+@add_start_docstrings("""The CTRL Model transformer with a language modeling head on top
+(linear layer with weights tied to the input embeddings). """, CTRL_START_DOCSTRING, CTRL_INPUTS_DOCSTRING)
+class TFCTRLLMHeadModel(TFCTRLPreTrainedModel):
+    r"""
+    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
+        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        **past**:
+            list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            that contains pre-computed hidden-states (key and values in the attention blocks).
+            Can be used (see `past` input) to speed up sequential decoding.
+        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
+            list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
+            of shape ``(batch_size, sequence_length, hidden_size)``:
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
+            list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
+
+    Examples::
+
+        import torch
+        from transformers import CTRLTokenizer, TFCTRLLMHeadModel
+
+        tokenizer = CTRLTokenizer.from_pretrained('ctrl')
+        model = TFCTRLLMHeadModel.from_pretrained('ctrl')
+
+        input_ids = torch.tensor(tokenizer.encode("Links Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=input_ids)
+        loss, logits = outputs[:2]
+
+    """
+    def __init__(self, config, *inputs, **kwargs):
+        super(TFCTRLLMHeadModel, self).__init__(config, *inputs, **kwargs)
+        self.transformer = TFCTRLMainLayer(config, name='transformer')
+
+        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name="lm_head")
+
+    def call(self, inputs, **kwargs):
+        transformer_outputs = self.transformer(inputs, **kwargs)
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        outputs = (lm_logits,) + transformer_outputs[1:]
+
+        return outputs  # lm_logits, presents, (all hidden_states), (attentions)
--- a/transformers/modeling_tf_distilbert.py
+++ b/transformers/modeling_tf_distilbert.py
@@ -31,7 +31,6 @@ import tensorflow as tf
 from .configuration_distilbert import DistilBertConfig
 from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list, get_initializer
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

@@ -45,7 +44,7 @@ TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
 ### UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE ###
 def gelu(x):
    """ Gaussian Error Linear Unit.
-    Original Implementation of the gelu activation function in Google Bert repo when initialy created.
+    Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
@@ -66,14 +65,6 @@ def gelu_new(x):
        (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
    return x * cdf

-def load_distilbert_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
-    attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
-    tf_inputs = [inputs_list, attns_list]
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
 class TFEmbeddings(tf.keras.layers.Layer):
    def __init__(self, config, **kwargs):
        super(TFEmbeddings, self).__init__(**kwargs)
@@ -226,8 +217,6 @@ class TFMultiHeadSelfAttention(tf.keras.layers.Layer):

        dim_per_head = self.dim // self.n_heads

-        assert 2 <= len(tf.shape(mask)) <= 3
-        causal = (len(tf.shape(mask)) == 3)
        mask_reshape = [bs, 1, 1, k_length]

        def shape(x):
@@ -456,7 +445,6 @@ class TFDistilBertPreTrainedModel(TFPreTrainedModel):
    """
    config_class = DistilBertConfig
    pretrained_model_archive_map = TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_distilbert_pt_weights_in_tf2
    base_model_prefix = "distilbert"


@@ -603,7 +591,7 @@ class TFDistilBertForMaskedLM(TFDistilBertPreTrainedModel):
        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        model = TFDistilBertForMaskedLM.from_pretrained('distilbert-base-uncased')
        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
-        outputs = model(input_ids, masked_lm_labels=input_ids)
+        outputs = model(input_ids)
        prediction_scores = outputs[0]

    """
@@ -715,9 +703,7 @@ class TFDistilBertForQuestionAnswering(TFDistilBertPreTrainedModel):
        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        model = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
-        start_positions = tf.constant([1])
-        end_positions = tf.constant([3])
-        outputs = model(input_ids, start_positions=start_positions, end_positions=end_positions)
+        outputs = model(input_ids)
        start_scores, end_scores = outputs[:2]

    """
--- a/transformers/modeling_tf_gpt2.py
+++ b/transformers/modeling_tf_gpt2.py
@@ -32,21 +32,13 @@ from .modeling_tf_utils import (TFPreTrainedModel, TFConv1D, TFSharedEmbeddings,
                                TFSequenceSummary, shape_list, get_initializer)
 from .configuration_gpt2 import GPT2Config
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

 TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP = {"gpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-tf_model.h5",
                                     "gpt2-medium": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-tf_model.h5",
-                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-tf_model.h5"}
-
-
-def load_gpt2_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
+                                     "gpt2-large": "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-large-tf_model.h5",
+                                     "distilgpt2": "https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-tf_model.h5",}


 def gelu(x):
@@ -349,7 +341,6 @@ class TFGPT2PreTrainedModel(TFPreTrainedModel):
    """
    config_class = GPT2Config
    pretrained_model_archive_map = TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_gpt2_pt_weights_in_tf2
    base_model_prefix = "transformer"


--- a/transformers/modeling_tf_openai.py
+++ b/transformers/modeling_tf_openai.py
@@ -32,21 +32,12 @@ from .modeling_tf_utils import (TFPreTrainedModel, TFConv1D, TFSharedEmbeddings,
                                TFSequenceSummary, shape_list, get_initializer)
 from .configuration_openai import OpenAIGPTConfig
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

 TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP = {"openai-gpt": "https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-tf_model.h5"}


-def load_openai_gpt_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 def gelu(x):
    """Gaussian Error Linear Unit.
    This is a smoother version of the RELU.
@@ -335,7 +326,6 @@ class TFOpenAIGPTPreTrainedModel(TFPreTrainedModel):
    """
    config_class = OpenAIGPTConfig
    pretrained_model_archive_map = TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_openai_gpt_pt_weights_in_tf2
    base_model_prefix = "transformer"


--- a/transformers/modeling_tf_pytorch_utils.py
+++ b/transformers/modeling_tf_pytorch_utils.py
@@ -25,8 +25,6 @@ import numpy

 logger = logging.getLogger(__name__)

-DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-
 def convert_tf_weight_name_to_pt_weight_name(tf_name, start_prefix_to_remove=''):
    """ Convert a TF 2.0 model variable name in a pytorch model weight name.

@@ -105,7 +103,7 @@ def load_pytorch_weights_in_tf2_model(tf_model, pt_state_dict, tf_inputs=None, a
        raise e

    if tf_inputs is None:
-        tf_inputs = tf.constant(DUMMY_INPUTS)
+        tf_inputs = tf_model.dummy_inputs

    if tf_inputs is not None:
        tfo = tf_model(tf_inputs, training=False)  # Make sure model is built
--- a/transformers/modeling_tf_roberta.py
+++ b/transformers/modeling_tf_roberta.py
@@ -26,7 +26,6 @@ import tensorflow as tf
 from .configuration_roberta import RobertaConfig
 from .modeling_tf_utils import TFPreTrainedModel, get_initializer
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 from .modeling_tf_bert import TFBertEmbeddings, TFBertMainLayer, gelu, gelu_new

@@ -38,14 +37,6 @@ TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP = {
    'roberta-large-mnli': "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-mnli-tf_model.h5",
 }

-def load_roberta_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 class TFRobertaEmbeddings(TFBertEmbeddings):
    """
    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
@@ -83,7 +74,7 @@ class TFRobertaMainLayer(TFBertMainLayer):
            input_ids = inputs

        if tf.not_equal(tf.reduce_sum(input_ids[:, 0]), 0):
-            logger.warning("A sequence with no special tokens has been passed to the RoBERTa model. "
+            tf.print("A sequence with no special tokens has been passed to the RoBERTa model. "
                           "This model requires special tokens in order to work. "
                           "Please specify add_special_tokens=True in your encoding.")

@@ -96,7 +87,6 @@ class TFRobertaPreTrainedModel(TFPreTrainedModel):
    """
    config_class = RobertaConfig
    pretrained_model_archive_map = TF_ROBERTA_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_roberta_pt_weights_in_tf2
    base_model_prefix = "roberta"


--- a/transformers/modeling_tf_transfo_xl.py
+++ b/transformers/modeling_tf_transfo_xl.py
@@ -33,7 +33,6 @@ from .configuration_transfo_xl import TransfoXLConfig
 from .modeling_tf_utils import TFPreTrainedModel, TFConv1D, TFSequenceSummary, shape_list, get_initializer
 from .modeling_tf_transfo_xl_utilities import TFAdaptiveSoftmaxMask
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

@@ -41,14 +40,6 @@ TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP = {
    'transfo-xl-wt103': "https://s3.amazonaws.com/models.huggingface.co/bert/transfo-xl-wt103-tf_model.h5",
 }

-def load_transfo_xl_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 class TFPositionalEmbedding(tf.keras.layers.Layer):
    def __init__(self, demb, **kwargs):
        super(TFPositionalEmbedding, self).__init__(**kwargs)
@@ -577,7 +568,6 @@ class TFTransfoXLPreTrainedModel(TFPreTrainedModel):
    """
    config_class = TransfoXLConfig
    pretrained_model_archive_map = TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_transfo_xl_pt_weights_in_tf2
    base_model_prefix = "transformer"


--- a/transformers/modeling_tf_utils.py
+++ b/transformers/modeling_tf_utils.py
@@ -25,9 +25,11 @@ import tensorflow as tf

 from .configuration_utils import PretrainedConfig
 from .file_utils import cached_path, WEIGHTS_NAME, TF_WEIGHTS_NAME, TF2_WEIGHTS_NAME
+from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

+DUMMY_INPUTS = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]

 class TFPreTrainedModel(tf.keras.Model):
    r""" Base class for all TF models.
@@ -48,8 +50,8 @@ class TFPreTrainedModel(tf.keras.Model):
    """
    config_class = None
    pretrained_model_archive_map = {}
-    load_pt_weights = lambda model, config, path: None
    base_model_prefix = ""
+    dummy_inputs = tf.constant(DUMMY_INPUTS)  # dummy inputs to build the network

    def __init__(self, config, *inputs, **kwargs):
        super(TFPreTrainedModel, self).__init__(*inputs, **kwargs)
@@ -262,17 +264,16 @@ class TFPreTrainedModel(tf.keras.Model):

        if from_pt:
            # Load from a PyTorch checkpoint
-            return cls.load_pt_weights(model, resolved_archive_file)
+            return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file)

-        inputs = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
-        ret = model(inputs, training=False)  # build the network with dummy inputs
+        ret = model(model.dummy_inputs, training=False)  # build the network with dummy inputs

        assert os.path.isfile(resolved_archive_file), "Error retrieving file {}".format(resolved_archive_file)
        # 'by_name' allow us to do transfer learning by skipping/adding layers
        # see https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1339-L1357
        model.load_weights(resolved_archive_file, by_name=True)

-        ret = model(inputs, training=False)  # Make sure restore ops are run
+        ret = model(model.dummy_inputs, training=False)  # Make sure restore ops are run

        return model

@@ -393,26 +394,26 @@ class TFSequenceSummary(tf.keras.layers.Layer):
            # We can probably just use the multi-head attention module of PyTorch >=1.1.0
            raise NotImplementedError

-        self.summary = None
-        if hasattr(config, 'summary_use_proj') and config.summary_use_proj:
+        self.has_summary = hasattr(config, 'summary_use_proj') and config.summary_use_proj
+        if self.has_summary:
            if hasattr(config, 'summary_proj_to_labels') and config.summary_proj_to_labels and config.num_labels > 0:
                num_classes = config.num_labels
            else:
                num_classes = config.hidden_size
            self.summary = tf.keras.layers.Dense(num_classes,
-                                                 kernel_initializer=get_initializer(initializer_range),
-                                                 name='summary')
+                                                    kernel_initializer=get_initializer(initializer_range),
+                                                    name='summary')

-        self.activation = None
-        if hasattr(config, 'summary_activation') and config.summary_activation == 'tanh':
+        self.has_activation = hasattr(config, 'summary_activation') and config.summary_activation == 'tanh'
+        if self.has_activation:
            self.activation = tf.keras.activations.tanh

-        self.first_dropout = None
-        if hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0:
+        self.has_first_dropout = hasattr(config, 'summary_first_dropout') and config.summary_first_dropout > 0
+        if self.has_first_dropout:
            self.first_dropout = tf.keras.layers.Dropout(config.summary_first_dropout)

-        self.last_dropout = None
-        if hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0:
+        self.has_last_dropout = hasattr(config, 'summary_last_dropout') and config.summary_last_dropout > 0
+        if self.has_last_dropout:
            self.last_dropout = tf.keras.layers.Dropout(config.summary_last_dropout)

    def call(self, inputs, training=False):
@@ -455,17 +456,17 @@ class TFSequenceSummary(tf.keras.layers.Layer):
        elif self.summary_type == 'attn':
            raise NotImplementedError

-        if training and self.first_dropout is not None:
-            output = self.first_dropout(output)
+        if self.has_first_dropout:
+            output = self.first_dropout(output, training=training)

-        if self.summary is not None:
+        if self.has_summary:
            output = self.summary(output)

-        if self.activation is not None:
+        if self.has_activation:
            output = self.activation(output)

-        if training and self.last_dropout is not None:
-            output = self.last_dropout(output)
+        if self.has_last_dropout:
+            output = self.last_dropout(output, training=training)

        return output

--- a/transformers/modeling_tf_xlm.py
+++ b/transformers/modeling_tf_xlm.py
@@ -25,9 +25,8 @@ import numpy as np
 import tensorflow as tf

 from .configuration_xlm import XLMConfig
-from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer
+from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer, DUMMY_INPUTS
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model

 logger = logging.getLogger(__name__)

@@ -45,19 +44,6 @@ TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP = {
 }


-def load_xlm_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    # build the network
-    inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
-    attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
-    if tf_model.config.use_lang_emb and tf_model.config.n_langs > 1:
-        langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
-    else:
-        langs_list = None
-    tf_inputs = [inputs_list, attns_list, langs_list]
-    tfo = tf_model(tf_inputs, training=False)
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 def create_sinusoidal_embeddings(n_pos, dim, out):
    position_enc = np.array([
        [pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)]
@@ -69,7 +55,7 @@ def create_sinusoidal_embeddings(n_pos, dim, out):

 def gelu(x):
    """ Gaussian Error Linear Unit.
-    Original Implementation of the gelu activation function in Google Bert repo when initialy created.
+    Original Implementation of the gelu activation function in Google Bert repo when initially created.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
@@ -441,9 +427,19 @@ class TFXLMPreTrainedModel(TFPreTrainedModel):
    """
    config_class = XLMConfig
    pretrained_model_archive_map = TF_XLM_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_xlm_pt_weights_in_tf2
    base_model_prefix = "transformer"

+    @property
+    def dummy_inputs(self):
+        # Sometimes XLM has language embeddings so don't forget to build them as well if needed
+        inputs_list = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
+        attns_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
+        if self.config.use_lang_emb and self.config.n_langs > 1:
+            langs_list = tf.constant([[1, 1, 0, 0, 1], [1, 1, 1, 0, 0], [1, 0, 0, 1, 1]])
+        else:
+            langs_list = None
+        return [inputs_list, attns_list, langs_list]
+

 XLM_START_DOCSTRING = r"""    The XLM model was proposed in
    `Cross-lingual Language Model Pretraining`_
--- a/transformers/modeling_tf_xlnet.py
+++ b/transformers/modeling_tf_xlnet.py
@@ -30,7 +30,6 @@ import tensorflow as tf
 from .configuration_xlnet import XLNetConfig
 from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, TFSequenceSummary, shape_list, get_initializer
 from .file_utils import add_start_docstrings
-from .modeling_tf_pytorch_utils import load_pytorch_checkpoint_in_tf2_model


 logger = logging.getLogger(__name__)
@@ -41,13 +40,6 @@ TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP = {
 }


-def load_xlnet_pt_weights_in_tf2(tf_model, pytorch_checkpoint_path):
-    inputs_list = [[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]
-    tf_inputs = tf.constant(inputs_list)
-    tfo = tf_model(tf_inputs, training=False)  # build the network
-    return load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path, tf_inputs=tf_inputs)
-
-
 def gelu(x):
    """ Implementation of the gelu activation function.
        XLNet is using OpenAI GPT's gelu
@@ -362,6 +354,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
        super(TFXLNetMainLayer, self).__init__(**kwargs)
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
+        self.output_past = config.output_past

        self.mem_len = config.mem_len
        self.reuse_len = config.reuse_len
@@ -421,16 +414,13 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):

    def cache_mem(self, curr_out, prev_mem):
        """cache hidden states into memory."""
-        if self.mem_len is None or self.mem_len == 0:
-            return None
-        else:
-            if self.reuse_len is not None and self.reuse_len > 0:
-                curr_out = curr_out[:self.reuse_len]
+        if self.reuse_len is not None and self.reuse_len > 0:
+            curr_out = curr_out[:self.reuse_len]

-            if prev_mem is None:
-                new_mem = curr_out[-self.mem_len:]
-            else:
-                new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len:]
+        if prev_mem is None:
+            new_mem = curr_out[-self.mem_len:]
+        else:
+            new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len:]

        return tf.stop_gradient(new_mem)

@@ -546,8 +536,8 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
            raise ValueError('Unsupported attention type: {}'.format(self.attn_type))

        # data mask: input mask & perm mask
-        assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) "
-        "or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one."
+        assert input_mask is None or attention_mask is None, "You can only use one of input_mask (uses 1 for padding) " \
+            "or attention_mask (uses 0 for padding, added for compatbility with BERT). Please choose one."
        if input_mask is None and attention_mask is not None:
            input_mask = 1.0 - attention_mask
        if input_mask is not None and perm_mask is not None:
@@ -632,7 +622,8 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
        hidden_states = []
        for i, layer_module in enumerate(self.layer):
            # cache new mems
-            new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
+            if self.mem_len is not None and self.mem_len > 0 and self.output_past:
+                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
            if self.output_hidden_states:
                hidden_states.append((output_h, output_g) if output_g is not None else output_h)

@@ -650,7 +641,11 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
        output = self.dropout(output_g if output_g is not None else output_h, training=training)

        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
-        outputs = (tf.transpose(output, perm=(1, 0, 2)), new_mems)
+        outputs = (tf.transpose(output, perm=(1, 0, 2)),)
+
+        if self.mem_len is not None and self.mem_len > 0 and self.output_past:
+            outputs = outputs + (new_mems,)
+
        if self.output_hidden_states:
            if output_g is not None:
                hidden_states = tuple(tf.transpose(h, perm=(1, 0, 2)) for hs in hidden_states for h in hs)
@@ -661,7 +656,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
            attentions = tuple(tf.transpose(t, perm=(2, 3, 0, 1)) for t in attentions)
            outputs = outputs + (attentions,)

-        return outputs  # outputs, new_mems, (hidden_states), (attentions)
+        return outputs  # outputs, (new_mems), (hidden_states), (attentions)


 class TFXLNetPreTrainedModel(TFPreTrainedModel):
@@ -670,7 +665,6 @@ class TFXLNetPreTrainedModel(TFPreTrainedModel):
    """
    config_class = XLNetConfig
    pretrained_model_archive_map = TF_XLNET_PRETRAINED_MODEL_ARCHIVE_MAP
-    load_pt_weights = load_xlnet_pt_weights_in_tf2
    base_model_prefix = "transformer"


@@ -777,7 +771,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the last layer of the model.
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``tf.Tensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -819,7 +813,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel):
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **prediction_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``tf.Tensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -863,7 +857,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel):

        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it

-        return outputs  # return logits, mems, (hidden states), (attentions)
+        return outputs  # return logits, (mems), (hidden states), (attentions)


@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
@@ -874,7 +868,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **logits**: ``tf.Tensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``tf.Tensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -918,7 +912,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel):

        outputs = (logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it

-        return outputs  # return logits, mems, (hidden states), (attentions)
+        return outputs  # return logits, (mems), (hidden states), (attentions)


 # @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
@@ -932,6 +926,11 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):
            Span-start scores (before SoftMax).
        **end_scores**: ``tf.Tensor`` of shape ``(batch_size, sequence_length,)``
            Span-end scores (before SoftMax).
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
+            list of ``tf.Tensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
@@ -971,7 +970,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel):

        outputs = (start_logits, end_logits,) + transformer_outputs[1:]  # Keep mems, hidden states, attentions if there are in it

-        return outputs  # start_logits, end_logits, (hidden_states), (attentions)
+        return outputs  # start_logits, end_logits, (mems), (hidden_states), (attentions)

 # @add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
 #     the hidden-states output to compute `span start logits` and `span end logits`). """,
--- a/transformers/modeling_utils.py
+++ b/transformers/modeling_utils.py
@@ -316,20 +316,20 @@ class PreTrainedModel(nn.Module):
            # redirect to the cache, if necessary
            try:
                resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir, force_download=force_download, proxies=proxies)
-            except EnvironmentError as e:
+            except EnvironmentError:
                if pretrained_model_name_or_path in cls.pretrained_model_archive_map:
-                    logger.error(
-                        "Couldn't reach server at '{}' to download pretrained weights.".format(
-                            archive_file))
+                    msg = "Couldn't reach server at '{}' to download pretrained weights.".format(
+                            archive_file)
                else:
-                    logger.error(
-                        "Model name '{}' was not found in model name list ({}). "
-                        "We assumed '{}' was a path or url but couldn't find any file "
-                        "associated to this path or url.".format(
+                    msg = "Model name '{}' was not found in model name list ({}). " \
+                        "We assumed '{}' was a path or url to model weight files named one of {} but " \
+                        "couldn't find any such file at this path or url.".format(
                            pretrained_model_name_or_path,
                            ', '.join(cls.pretrained_model_archive_map.keys()),
-                            archive_file))
-                raise e
+                            archive_file,
+                            [WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME])
+                raise EnvironmentError(msg)
+
            if resolved_archive_file == archive_file:
                logger.info("loading weights file {}".format(archive_file))
            else:
@@ -501,7 +501,10 @@ class PoolerEndLogits(nn.Module):
        x = self.dense_1(x).squeeze(-1)

        if p_mask is not None:
-            x = x * (1 - p_mask) - 1e30 * p_mask
+            if next(self.parameters()).dtype == torch.float16:
+                x = x * (1 - p_mask) - 65500 * p_mask
+            else:
+                x = x * (1 - p_mask) - 1e30 * p_mask

        return x

--- a/transformers/modeling_xlnet.py
+++ b/transformers/modeling_xlnet.py
@@ -239,45 +239,60 @@ class XLNetRelativeAttention(nn.Module):

        return x

+    @staticmethod
+    def rel_shift_bnij(x, klen=-1):
+        x_size = x.shape
+
+        x = x.reshape(x_size[0], x_size[1], x_size[3], x_size[2])
+        x = x[:, :, 1:, :]
+        x = x.reshape(x_size[0], x_size[1], x_size[2], x_size[3]-1)
+        # Note: the tensor-slice form was faster in my testing than torch.index_select
+        #       However, tracing doesn't like the nature of the slice, and if klen changes
+        #       during the run then it'll fail, whereas index_select will be fine.
+        x = torch.index_select(x, 3, torch.arange(klen, device=x.device, dtype=torch.long))
+        # x = x[:, :, :, :klen]
+
+        return x
+
    def rel_attn_core(self, q_head, k_head_h, v_head_h, k_head_r, seg_mat=None, attn_mask=None, head_mask=None):
        """Core relative positional attention operations."""

        # content based attention score
-        ac = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_w_bias, k_head_h)
+        ac = torch.einsum('ibnd,jbnd->bnij', q_head + self.r_w_bias, k_head_h)

        # position based attention score
-        bd = torch.einsum('ibnd,jbnd->ijbn', q_head + self.r_r_bias, k_head_r)
-        bd = self.rel_shift(bd, klen=ac.shape[1])
+        bd = torch.einsum('ibnd,jbnd->bnij', q_head + self.r_r_bias, k_head_r)
+        bd = self.rel_shift_bnij(bd, klen=ac.shape[3])

        # segment based attention score
        if seg_mat is None:
            ef = 0
        else:
            ef = torch.einsum('ibnd,snd->ibns', q_head + self.r_s_bias, self.seg_embed)
-            ef = torch.einsum('ijbs,ibns->ijbn', seg_mat, ef)
+            ef = torch.einsum('ijbs,ibns->bnij', seg_mat, ef)

        # merge attention scores and perform masking
        attn_score = (ac + bd + ef) * self.scale
        if attn_mask is not None:
            # attn_score = attn_score * (1 - attn_mask) - 1e30 * attn_mask
            if attn_mask.dtype == torch.float16:
-                attn_score = attn_score - 65500 * attn_mask
+                attn_score = attn_score - 65500 * torch.einsum('ijbn->bnij', attn_mask)
            else:
-                attn_score = attn_score - 1e30 * attn_mask
+                attn_score = attn_score - 1e30 * torch.einsum('ijbn->bnij', attn_mask)

        # attention probability
-        attn_prob = F.softmax(attn_score, dim=1)
+        attn_prob = F.softmax(attn_score, dim=3)
        attn_prob = self.dropout(attn_prob)

        # Mask heads if we want to
        if head_mask is not None:
-            attn_prob = attn_prob * head_mask
+            attn_prob = attn_prob * torch.einsum('ijbn->bnij', head_mask)

        # attention output
-        attn_vec = torch.einsum('ijbn,jbnd->ibnd', attn_prob, v_head_h)
+        attn_vec = torch.einsum('bnij,jbnd->ibnd', attn_prob, v_head_h)

        if self.output_attentions:
-            return attn_vec, attn_prob
+            return attn_vec, torch.einsum('bnij->ijbn', attn_prob)

        return attn_vec

@@ -555,7 +570,7 @@ class XLNetModel(XLNetPreTrainedModel):
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the last layer of the model.
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``torch.FloatTensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -581,6 +596,7 @@ class XLNetModel(XLNetPreTrainedModel):
        super(XLNetModel, self).__init__(config)
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
+        self.output_past = config.output_past

        self.mem_len = config.mem_len
        self.reuse_len = config.reuse_len
@@ -637,16 +653,13 @@ class XLNetModel(XLNetPreTrainedModel):

    def cache_mem(self, curr_out, prev_mem):
        """cache hidden states into memory."""
-        if self.mem_len is None or self.mem_len == 0:
-            return None
-        else:
-            if self.reuse_len is not None and self.reuse_len > 0:
-                curr_out = curr_out[:self.reuse_len]
+        if self.reuse_len is not None and self.reuse_len > 0:
+            curr_out = curr_out[:self.reuse_len]

-            if prev_mem is None:
-                new_mem = curr_out[-self.mem_len:]
-            else:
-                new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len:]
+        if prev_mem is None:
+            new_mem = curr_out[-self.mem_len:]
+        else:
+            new_mem = torch.cat([prev_mem, curr_out], dim=0)[-self.mem_len:]

        return new_mem.detach()

@@ -817,8 +830,9 @@ class XLNetModel(XLNetPreTrainedModel):
        attentions = []
        hidden_states = []
        for i, layer_module in enumerate(self.layer):
-            # cache new mems
-            new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
+            if self.mem_len is not None and self.mem_len > 0 and self.output_past:
+                # cache new mems
+                new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
            if self.output_hidden_states:
                hidden_states.append((output_h, output_g) if output_g is not None else output_h)

@@ -836,7 +850,11 @@ class XLNetModel(XLNetPreTrainedModel):
        output = self.dropout(output_g if output_g is not None else output_h)

        # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
-        outputs = (output.permute(1, 0, 2).contiguous(), new_mems)
+        outputs = (output.permute(1, 0, 2).contiguous(),)
+
+        if self.mem_len is not None and self.mem_len > 0 and self.output_past:
+            outputs = outputs + (new_mems,)
+
        if self.output_hidden_states:
            if output_g is not None:
                hidden_states = tuple(h.permute(1, 0, 2).contiguous() for hs in hidden_states for h in hs)
@@ -847,7 +865,7 @@ class XLNetModel(XLNetPreTrainedModel):
            attentions = tuple(t.permute(2, 3, 0, 1).contiguous() for t in attentions)
            outputs = outputs + (attentions,)

-        return outputs  # outputs, new_mems, (hidden_states), (attentions)
+        return outputs  # outputs, (new_mems), (hidden_states), (attentions)


@add_start_docstrings("""XLNet Model with a language modeling head on top
@@ -867,7 +885,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
            Language modeling loss.
        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``torch.FloatTensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -918,7 +936,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
                                               perm_mask=perm_mask,
                                               target_mapping=target_mapping,
                                               token_type_ids=token_type_ids,
-                                               input_mask=input_mask, 
+                                               input_mask=input_mask,
                                               head_mask=head_mask)

        logits = self.lm_loss(transformer_outputs[0])
@@ -932,7 +950,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
                            labels.view(-1))
            outputs = (loss,) + outputs

-        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)


@add_start_docstrings("""XLNet Model with a sequence classification/regression head on top (a linear layer on top of
@@ -951,7 +969,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
            Classification (or regression if config.num_labels==1) loss.
        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``torch.FloatTensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -992,7 +1010,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
                                               perm_mask=perm_mask,
                                               target_mapping=target_mapping,
                                               token_type_ids=token_type_ids,
-                                               input_mask=input_mask, 
+                                               input_mask=input_mask,
                                               head_mask=head_mask)
        output = transformer_outputs[0]

@@ -1011,7 +1029,7 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

-        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)

@add_start_docstrings("""XLNet Model with a multiple choice classification head on top (a linear layer on top of
    the pooled output and a softmax) e.g. for RACE/SWAG tasks. """,
@@ -1046,6 +1064,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
        **classification_scores**: ``torch.FloatTensor`` of shape ``(batch_size, num_choices)`` where `num_choices` is the size of the second dimension
            of the input tensors. (see `input_ids` above).
            Classification scores (before SoftMax).
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
+            list of ``torch.FloatTensor`` (one for each layer):
+            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
+            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
+            See details in the docstring of the `mems` input above.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
@@ -1102,7 +1125,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
            loss = loss_fct(reshaped_logits, labels.view(-1))
            outputs = (loss,) + outputs

-        return outputs  # return (loss), logits, mems, (hidden states), (attentions)
+        return outputs  # return (loss), logits, (mems), (hidden states), (attentions)


@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
@@ -1126,7 +1149,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
            Span-start scores (before SoftMax).
        **end_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length,)``
            Span-end scores (before SoftMax).
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``torch.FloatTensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -1169,7 +1192,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
                                    perm_mask=perm_mask,
                                    target_mapping=target_mapping,
                                    token_type_ids=token_type_ids,
-                                    input_mask=input_mask, 
+                                    input_mask=input_mask,
                                    head_mask=head_mask)

        sequence_output = outputs[0]
@@ -1197,7 +1220,7 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
            total_loss = (start_loss + end_loss) / 2
            outputs = (total_loss,) + outputs

-        return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)
+        return outputs  # (loss), start_logits, end_logits, (mems), (hidden_states), (attentions)


@add_start_docstrings("""XLNet Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of
@@ -1239,7 +1262,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
        **cls_logits**: (`optional`, returned if ``start_positions`` or ``end_positions`` is not provided)
            ``torch.FloatTensor`` of shape ``(batch_size,)``
            Log probabilities for the ``is_impossible`` label of the answers.
-        **mems**:
+        **mems**: (`optional`, returned when ``config.mem_len > 0``)
            list of ``torch.FloatTensor`` (one for each layer):
            that contains pre-computed hidden-states (key and values in the attention blocks) as computed by the model
            if config.mem_len > 0 else tuple of None. Can be used to speed up sequential decoding and attend to longer context.
@@ -1284,7 +1307,7 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
                                               perm_mask=perm_mask,
                                               target_mapping=target_mapping,
                                               token_type_ids=token_type_ids,
-                                               input_mask=input_mask, 
+                                               input_mask=input_mask,
                                               head_mask=head_mask)
        hidden_states = transformer_outputs[0]
        start_logits = self.start_logits(hidden_states, p_mask=p_mask)
--- a/transformers/tests/modeling_common_test.py
+++ b/transformers/tests/modeling_common_test.py
@@ -17,8 +17,10 @@ from __future__ import division
 from __future__ import print_function

 import copy
+import sys
 import os
 import shutil
+import tempfile
 import json
 import random
 import uuid
@@ -31,6 +33,7 @@ from transformers import is_torch_available

 if is_torch_available():
    import torch
+    import numpy as np

    from transformers import (PretrainedConfig, PreTrainedModel,
                                    BertModel, BertConfig, BERT_PRETRAINED_MODEL_ARCHIVE_MAP,
@@ -38,6 +41,20 @@ if is_torch_available():
 else:
    pytestmark = pytest.mark.skip("Require Torch")

+if sys.version_info[0] == 2:
+    import cPickle as pickle
+
+    class TemporaryDirectory(object):
+        """Context manager for tempfile.mkdtemp() so it's usable with "with" statement."""
+        def __enter__(self):
+            self.name = tempfile.mkdtemp()
+            return self.name
+        def __exit__(self, exc_type, exc_value, traceback):
+            shutil.rmtree(self.name)
+else:
+    import pickle
+    TemporaryDirectory = tempfile.TemporaryDirectory
+    unicode = str

 def _config_zero_init(config):
    configs_no_init = copy.deepcopy(config)
@@ -57,6 +74,29 @@ class CommonTestCases:
        test_resize_embeddings = True
        test_head_masking = True

+        def test_save_load(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            for model_class in self.all_model_classes:
+                model = model_class(config)
+                model.eval()
+                with torch.no_grad():
+                    outputs = model(**inputs_dict)
+
+                with TemporaryDirectory() as tmpdirname:
+                    model.save_pretrained(tmpdirname)
+                    model = model_class.from_pretrained(tmpdirname)
+                    with torch.no_grad():
+                        after_outputs = model(**inputs_dict)
+
+                    # Make sure we don't have nans
+                    out_1 = after_outputs[0].numpy()
+                    out_2 = outputs[0].numpy()
+                    out_1 = out_1[~np.isnan(out_1)]
+                    out_2 = out_2[~np.isnan(out_2)]
+                    max_diff = np.amax(np.abs(out_1 - out_2))
+                    self.assertLessEqual(max_diff, 1e-5)
+
        def test_initialization(self):
            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

--- a/transformers/tests/modeling_ctrl_test.py
+++ b/transformers/tests/modeling_ctrl_test.py
@@ -0,0 +1,215 @@
+# coding=utf-8
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import pytest
+import shutil
+import pdb
+
+from transformers import is_torch_available
+
+if is_torch_available():
+    from transformers import (CTRLConfig, CTRLModel, CTRL_PRETRAINED_MODEL_ARCHIVE_MAP,
+                                    CTRLLMHeadModel)
+else:
+    pytestmark = pytest.mark.skip("Require Torch")
+
+from .modeling_common_test import (CommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+
+class CTRLModelTest(CommonTestCases.CommonModelTester):
+
+    all_model_classes = (CTRLModel, CTRLLMHeadModel) if is_torch_available() else ()
+    test_pruning = False
+    test_torchscript = False
+    test_resize_embeddings = False
+    test_head_masking = False
+
+    class CTRLModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_token_type_ids=True,
+                     use_input_mask=True,
+                     use_labels=True,
+                     use_mc_token_ids=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                     ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_token_type_ids = use_token_type_ids
+            self.use_input_mask = use_input_mask
+            self.use_labels = use_labels
+            self.use_mc_token_ids = use_mc_token_ids
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            mc_token_ids = None
+            if self.use_mc_token_ids:
+                mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = CTRLConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                n_embd=self.hidden_size,
+                n_layer=self.num_hidden_layers,
+                n_head=self.num_attention_heads,
+                # intermediate_size=self.intermediate_size,
+                # hidden_act=self.hidden_act,
+                # hidden_dropout_prob=self.hidden_dropout_prob,
+                # attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                n_positions=self.max_position_embeddings,
+                n_ctx=self.max_position_embeddings
+                # type_vocab_size=self.type_vocab_size,
+                # initializer_range=self.initializer_range
+            )
+
+            head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+            return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
+
+        def check_loss_output(self, result):
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+
+        def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+            model = CTRLModel(config=config)
+            model.eval()
+
+            model(input_ids, token_type_ids=token_type_ids, head_mask=head_mask)
+            model(input_ids, token_type_ids=token_type_ids)
+            sequence_output, presents = model(input_ids)
+
+            result = {
+                "sequence_output": sequence_output,
+                "presents": presents,
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].size()),
+                [self.batch_size, self.seq_length, self.hidden_size])
+            self.parent.assertEqual(len(result["presents"]), config.n_layer)
+
+        def create_and_check_lm_head_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+            model = CTRLLMHeadModel(config)
+            model.eval()
+
+            loss, lm_logits, _ = model(input_ids, token_type_ids=token_type_ids, labels=input_ids)
+
+            result = {
+                "loss": loss,
+                "lm_logits": lm_logits
+            }
+            self.parent.assertListEqual(
+                list(result["loss"].size()),
+                [])
+            self.parent.assertListEqual(
+                list(result["lm_logits"].size()),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+
+            (config, input_ids, input_mask, head_mask, token_type_ids,
+             mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
+
+            inputs_dict = {
+                'input_ids': input_ids,
+                'token_type_ids': token_type_ids,
+                'head_mask': head_mask
+            }
+
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = CTRLModelTest.CTRLModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_ctrl_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
+
+    def test_ctrl_lm_head_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_lm_head_model(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/transformers_test/"
+        for model_name in list(CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = CTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/transformers/tests/modeling_tf_common_test.py
+++ b/transformers/tests/modeling_tf_common_test.py
@@ -22,6 +22,7 @@ import random
 import shutil
 import unittest
 import uuid
+import tempfile

 import pytest
 import sys
@@ -36,6 +37,20 @@ if is_tf_available():
 else:
    pytestmark = pytest.mark.skip("Require TensorFlow")

+if sys.version_info[0] == 2:
+    import cPickle as pickle
+
+    class TemporaryDirectory(object):
+        """Context manager for tempfile.mkdtemp() so it's usable with "with" statement."""
+        def __enter__(self):
+            self.name = tempfile.mkdtemp()
+            return self.name
+        def __exit__(self, exc_type, exc_value, traceback):
+            shutil.rmtree(self.name)
+else:
+    import pickle
+    TemporaryDirectory = tempfile.TemporaryDirectory
+    unicode = str

 def _config_zero_init(config):
    configs_no_init = copy.deepcopy(config)
@@ -66,11 +81,31 @@ class TFCommonTestCases:
            #             self.assertIn(param.data.mean().item(), [0.0, 1.0],
            #             msg="Parameter {} of model {} seems not properly initialized".format(name, model_class))

+        def test_save_load(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            for model_class in self.all_model_classes:
+                model = model_class(config)
+                outputs = model(inputs_dict)
+
+                with TemporaryDirectory() as tmpdirname:
+                    model.save_pretrained(tmpdirname)
+                    model = model_class.from_pretrained(tmpdirname)
+                    after_outputs = model(inputs_dict)
+
+                    # Make sure we don't have nans
+                    out_1 = after_outputs[0].numpy()
+                    out_2 = outputs[0].numpy()
+                    out_1 = out_1[~np.isnan(out_1)]
+                    out_2 = out_2[~np.isnan(out_2)]
+                    max_diff = np.amax(np.abs(out_1 - out_2))
+                    self.assertLessEqual(max_diff, 1e-5)

        def test_pt_tf_model_equivalence(self):
            if not is_torch_available():
                return

+            import torch
            import transformers

            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
@@ -79,12 +114,51 @@ class TFCommonTestCases:
                pt_model_class_name = model_class.__name__[2:]  # Skip the "TF" at the beggining
                pt_model_class = getattr(transformers, pt_model_class_name)

+                config.output_hidden_states = True
                tf_model = model_class(config)
                pt_model = pt_model_class(config)

+                # Check we can load pt model in tf and vice-versa (architecture similar)
                tf_model = transformers.load_pytorch_model_in_tf2_model(tf_model, pt_model, tf_inputs=inputs_dict)
                pt_model = transformers.load_tf2_model_in_pytorch_model(pt_model, tf_model)

+                # Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
+                pt_model.eval()
+                pt_inputs_dict = dict((name, torch.from_numpy(key.numpy()).to(torch.long))
+                                      for name, key in inputs_dict.items())
+                with torch.no_grad():
+                    pto = pt_model(**pt_inputs_dict)
+                tfo = tf_model(inputs_dict)
+                max_diff = np.amax(np.abs(tfo[0].numpy() - pto[0].numpy()))
+                self.assertLessEqual(max_diff, 2e-2)
+
+        def test_compile_tf_model(self):
+            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+            input_ids = tf.keras.Input(batch_shape=(2, 2000), name='input_ids', dtype='int32')
+            optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
+            loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
+            metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
+
+            for model_class in self.all_model_classes:
+                # Prepare our model
+                model = model_class(config)
+                
+                # Let's load it from the disk to be sure we can use pretrained weights
+                with TemporaryDirectory() as tmpdirname:
+                    outputs = model(inputs_dict)  # build the model
+                    model.save_pretrained(tmpdirname)
+                    model = model_class.from_pretrained(tmpdirname)
+
+                outputs_dict = model(input_ids)
+                hidden_states = outputs_dict[0]
+
+                # Add a dense layer on top to test intetgration with other keras modules
+                outputs = tf.keras.layers.Dense(2, activation='softmax', name='outputs')(hidden_states)
+
+                # Compile extended model
+                extended_model = tf.keras.Model(inputs=[input_ids], outputs=[outputs])
+                extended_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

        def test_keyword_and_dict_args(self):
            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
--- a/transformers/tests/modeling_tf_ctrl_test.py
+++ b/transformers/tests/modeling_tf_ctrl_test.py
@@ -0,0 +1,201 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import unittest
+import shutil
+import pytest
+import sys
+
+from .modeling_tf_common_test import (TFCommonTestCases, ids_tensor)
+from .configuration_common_test import ConfigTester
+
+from transformers import CTRLConfig, is_tf_available
+
+if is_tf_available():
+    import tensorflow as tf
+    from transformers.modeling_tf_ctrl import (TFCTRLModel, TFCTRLLMHeadModel,
+                                                TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP)
+else:
+    pytestmark = pytest.mark.skip("Require TensorFlow")
+
+
+class TFCTRLModelTest(TFCommonTestCases.TFCommonModelTester):
+
+    all_model_classes = (TFCTRLModel, TFCTRLLMHeadModel) if is_tf_available() else ()
+
+    class TFCTRLModelTester(object):
+
+        def __init__(self,
+                     parent,
+                     batch_size=13,
+                     seq_length=7,
+                     is_training=True,
+                     use_token_type_ids=True,
+                     use_input_mask=True,
+                     use_labels=True,
+                     use_mc_token_ids=True,
+                     vocab_size=99,
+                     hidden_size=32,
+                     num_hidden_layers=5,
+                     num_attention_heads=4,
+                     intermediate_size=37,
+                     hidden_act="gelu",
+                     hidden_dropout_prob=0.1,
+                     attention_probs_dropout_prob=0.1,
+                     max_position_embeddings=512,
+                     type_vocab_size=16,
+                     type_sequence_label_size=2,
+                     initializer_range=0.02,
+                     num_labels=3,
+                     num_choices=4,
+                     scope=None,
+                     ):
+            self.parent = parent
+            self.batch_size = batch_size
+            self.seq_length = seq_length
+            self.is_training = is_training
+            self.use_token_type_ids = use_token_type_ids
+            self.use_input_mask = use_input_mask
+            self.use_labels = use_labels
+            self.use_mc_token_ids = use_mc_token_ids
+            self.vocab_size = vocab_size
+            self.hidden_size = hidden_size
+            self.num_hidden_layers = num_hidden_layers
+            self.num_attention_heads = num_attention_heads
+            self.intermediate_size = intermediate_size
+            self.hidden_act = hidden_act
+            self.hidden_dropout_prob = hidden_dropout_prob
+            self.attention_probs_dropout_prob = attention_probs_dropout_prob
+            self.max_position_embeddings = max_position_embeddings
+            self.type_vocab_size = type_vocab_size
+            self.type_sequence_label_size = type_sequence_label_size
+            self.initializer_range = initializer_range
+            self.num_labels = num_labels
+            self.num_choices = num_choices
+            self.scope = scope
+
+        def prepare_config_and_inputs(self):
+            input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+            input_mask = None
+            if self.use_input_mask:
+                input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+
+            token_type_ids = None
+            if self.use_token_type_ids:
+                token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
+
+            mc_token_ids = None
+            if self.use_mc_token_ids:
+                mc_token_ids = ids_tensor([self.batch_size, self.num_choices], self.seq_length)
+
+            sequence_labels = None
+            token_labels = None
+            choice_labels = None
+            if self.use_labels:
+                sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+                token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+                choice_labels = ids_tensor([self.batch_size], self.num_choices)
+
+            config = CTRLConfig(
+                vocab_size_or_config_json_file=self.vocab_size,
+                n_embd=self.hidden_size,
+                n_layer=self.num_hidden_layers,
+                n_head=self.num_attention_heads,
+                # intermediate_size=self.intermediate_size,
+                # hidden_act=self.hidden_act,
+                # hidden_dropout_prob=self.hidden_dropout_prob,
+                # attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+                n_positions=self.max_position_embeddings,
+                n_ctx=self.max_position_embeddings
+                # type_vocab_size=self.type_vocab_size,
+                # initializer_range=self.initializer_range
+            )
+
+            head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+            return config, input_ids, input_mask, head_mask, token_type_ids, mc_token_ids, sequence_labels, token_labels, choice_labels
+
+        def create_and_check_ctrl_model(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+            model = TFCTRLModel(config=config)
+            inputs = {'input_ids': input_ids,
+                      'attention_mask': input_mask,
+                      'token_type_ids': token_type_ids}
+            sequence_output = model(inputs)[0]
+
+            inputs = [input_ids, None, input_mask]  # None is the input for 'past'
+            sequence_output = model(inputs)[0]
+
+            sequence_output = model(input_ids)[0]
+
+            result = {
+                "sequence_output": sequence_output.numpy(),
+            }
+            self.parent.assertListEqual(
+                list(result["sequence_output"].shape),
+                [self.batch_size, self.seq_length, self.hidden_size])
+
+
+        def create_and_check_ctrl_lm_head(self, config, input_ids, input_mask, head_mask, token_type_ids, *args):
+            model = TFCTRLLMHeadModel(config=config)
+            inputs = {'input_ids': input_ids,
+                      'attention_mask': input_mask,
+                      'token_type_ids': token_type_ids}
+            prediction_scores = model(inputs)[0]
+            result = {
+                "prediction_scores": prediction_scores.numpy(),
+            }
+            self.parent.assertListEqual(
+                list(result["prediction_scores"].shape),
+                [self.batch_size, self.seq_length, self.vocab_size])
+
+        def prepare_config_and_inputs_for_common(self):
+            config_and_inputs = self.prepare_config_and_inputs()
+
+            (config, input_ids, input_mask, head_mask, token_type_ids,
+             mc_token_ids, sequence_labels, token_labels, choice_labels) = config_and_inputs
+
+            inputs_dict = {'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': input_mask}
+            return config, inputs_dict
+
+    def setUp(self):
+        self.model_tester = TFCTRLModelTest.TFCTRLModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=CTRLConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_ctrl_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_ctrl_model(*config_and_inputs)
+
+    def test_ctrl_lm_head(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_ctrl_lm_head(*config_and_inputs)
+
+    @pytest.mark.slow
+    def test_model_from_pretrained(self):
+        cache_dir = "/tmp/transformers_test/"
+        for model_name in list(TF_CTRL_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            model = TFCTRLModel.from_pretrained(model_name, cache_dir=cache_dir)
+            shutil.rmtree(cache_dir)
+            self.assertIsNotNone(model)
+
+if __name__ == "__main__":
+    unittest.main()
+
--- a/transformers/tests/modeling_tf_gpt2_test.py
+++ b/transformers/tests/modeling_tf_gpt2_test.py
@@ -222,7 +222,7 @@ class TFGPT2ModelTest(TFCommonTestCases.TFCommonModelTester):
    @pytest.mark.slow
    def test_model_from_pretrained(self):
        cache_dir = "/tmp/transformers_test/"
-        for model_name in list(TF_gpt2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+        for model_name in list(TF_GPT2_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
            model = TFGPT2Model.from_pretrained(model_name, cache_dir=cache_dir)
            shutil.rmtree(cache_dir)
            self.assertIsNotNone(model)
--- a/transformers/tests/modeling_tf_xlnet_test.py
+++ b/transformers/tests/modeling_tf_xlnet_test.py
@@ -161,6 +161,11 @@ class TFXLNetModelTest(TFCommonTestCases.TFCommonModelTester):
                "outputs": outputs.numpy(),
            }

+            config.mem_len = 0
+            model = TFXLNetModel(config)
+            no_mems_outputs = model(inputs)
+            self.parent.assertEqual(len(no_mems_outputs), 1)
+
            self.parent.assertListEqual(
                list(result["outputs"].shape),
                [self.batch_size, self.seq_length, self.hidden_size])
--- a/transformers/tests/modeling_xlnet_test.py
+++ b/transformers/tests/modeling_xlnet_test.py
@@ -150,6 +150,12 @@ class XLNetModelTest(CommonTestCases.CommonModelTester):
                "outputs": outputs,
            }

+            config.mem_len = 0
+            model = XLNetModel(config)
+            model.eval()
+            no_mems_outputs = model(input_ids_1)
+            self.parent.assertEqual(len(no_mems_outputs), 1)
+
            self.parent.assertListEqual(
                list(result["outputs"].size()),
                [self.batch_size, self.seq_length, self.hidden_size])
--- a/transformers/tests/tokenization_bert_test.py
+++ b/transformers/tests/tokenization_bert_test.py
@@ -131,8 +131,8 @@ class BertTokenizationTest(CommonTestCases.CommonTokenizerTester):
        text = tokenizer.encode("sequence builders")
        text_2 = tokenizer.encode("multi-sequence build")

-        encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
-        encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)

        assert encoded_sentence == [101] + text + [102]
        assert encoded_pair == [101] + text + [102] + text_2 + [102]
--- a/transformers/tests/tokenization_ctrl_test.py
+++ b/transformers/tests/tokenization_ctrl_test.py
@@ -0,0 +1,69 @@
+# coding=utf-8
+# Copyright 2018 Salesforce and HuggingFace Inc. team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import os
+import unittest
+import json
+from io import open
+
+from transformers.tokenization_ctrl import CTRLTokenizer, VOCAB_FILES_NAMES
+
+from .tokenization_tests_commons import CommonTestCases
+
+class CTRLTokenizationTest(CommonTestCases.CommonTokenizerTester):
+
+    tokenizer_class = CTRLTokenizer
+
+    def setUp(self):
+        super(CTRLTokenizationTest, self).setUp()
+
+        # Adapted from Sennrich et al. 2015 and https://github.com/rsennrich/subword-nmt
+        vocab = ['adapt', 're@@', 'a@@', 'apt', 'c@@', 't', '<unk>']
+        vocab_tokens = dict(zip(vocab, range(len(vocab))))
+        merges = ["#version: 0.2", 'a p', 'ap t</w>', 'r e', 'a d', 'ad apt</w>', '']
+        self.special_tokens_map = {"unk_token": "<unk>"}
+
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['vocab_file'])
+        self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES['merges_file'])
+        with open(self.vocab_file, "w", encoding="utf-8") as fp:
+            fp.write(json.dumps(vocab_tokens) + "\n")
+        with open(self.merges_file, "w", encoding="utf-8") as fp:
+            fp.write("\n".join(merges))
+
+    def get_tokenizer(self, **kwargs):
+        kwargs.update(self.special_tokens_map)
+        return CTRLTokenizer.from_pretrained(self.tmpdirname, **kwargs)
+
+    def get_input_output_texts(self):
+        input_text = u"adapt react readapt apt"
+        output_text = u"adapt react readapt apt"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = CTRLTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
+        text = "adapt react readapt apt"
+        bpe_tokens = 'adapt re@@ a@@ c@@ t re@@ adapt apt'.split()
+        tokens = tokenizer.tokenize(text)
+        self.assertListEqual(tokens, bpe_tokens)
+
+        input_tokens = tokens + [tokenizer.unk_token]
+
+        input_bpe_tokens = [0, 1, 2, 4, 5, 1, 0, 3, 6]
+        self.assertListEqual(
+            tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/transformers/tests/tokenization_distilbert_test.py
+++ b/transformers/tests/tokenization_distilbert_test.py
@@ -36,8 +36,8 @@ class DistilBertTokenizationTest(BertTokenizationTest):
        text = tokenizer.encode("sequence builders")
        text_2 = tokenizer.encode("multi-sequence build")

-        encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
-        encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)

        assert encoded_sentence == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id]
        assert encoded_pair == [tokenizer.cls_token_id] + text + [tokenizer.sep_token_id] + \
--- a/transformers/tests/tokenization_roberta_test.py
+++ b/transformers/tests/tokenization_roberta_test.py
@@ -87,8 +87,8 @@ class RobertaTokenizationTest(CommonTestCases.CommonTokenizerTester):
        encoded_text_from_decode = tokenizer.encode("sequence builders", add_special_tokens=True)
        encoded_pair_from_decode = tokenizer.encode("sequence builders", "multi-sequence build", add_special_tokens=True)

-        encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
-        encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)

        assert encoded_sentence == encoded_text_from_decode
        assert encoded_pair == encoded_pair_from_decode
--- a/transformers/tests/tokenization_tests_commons.py
+++ b/transformers/tests/tokenization_tests_commons.py
@@ -193,12 +193,12 @@ class CommonTestCases:

            tokenizer = self.get_tokenizer()

-            if tokenizer.add_special_tokens_sequence_pair.__qualname__.split('.')[0] != "PreTrainedTokenizer":
+            if tokenizer.build_inputs_with_special_tokens.__qualname__.split('.')[0] != "PreTrainedTokenizer":
                seq_0 = "Test this method."
                seq_1 = "With these inputs."
                information = tokenizer.encode_plus(seq_0, seq_1, add_special_tokens=True)
                sequences, mask = information["input_ids"], information["token_type_ids"]
-                assert len(sequences) == len(mask)
+                self.assertEqual(len(sequences), len(mask))

        def test_number_of_added_tokens(self):
            tokenizer = self.get_tokenizer()
@@ -211,7 +211,7 @@ class CommonTestCases:

            # Method is implemented (e.g. not GPT-2)
            if len(attached_sequences) != 2:
-                assert tokenizer.num_added_tokens(pair=True) == len(attached_sequences) - len(sequences)
+                self.assertEqual(tokenizer.num_added_tokens(pair=True), len(attached_sequences) - len(sequences))

        def test_maximum_encoding_length_single_input(self):
            tokenizer = self.get_tokenizer()
@@ -227,10 +227,10 @@ class CommonTestCases:
            truncated_sequence = information["input_ids"]
            overflowing_tokens = information["overflowing_tokens"]

-            assert len(overflowing_tokens) == 2 + stride
-            assert overflowing_tokens == sequence[-(2 + stride):]
-            assert len(truncated_sequence) == total_length - 2
-            assert truncated_sequence == tokenizer.add_special_tokens_single_sequence(sequence[:-2])
+            self.assertEqual(len(overflowing_tokens), 2 + stride)
+            self.assertEqual(overflowing_tokens, sequence[-(2 + stride):])
+            self.assertEqual(len(truncated_sequence), total_length - 2)
+            self.assertEqual(truncated_sequence, tokenizer.build_inputs_with_special_tokens(sequence[:-2]))

        def test_maximum_encoding_length_pair_input(self):
            tokenizer = self.get_tokenizer()
@@ -243,26 +243,26 @@ class CommonTestCases:
            sequence_1_no_special_tokens = tokenizer.encode(seq_1)

            sequence = tokenizer.encode(seq_0, seq_1, add_special_tokens=True)
-            truncated_second_sequence = tokenizer.add_special_tokens_sequence_pair(
+            truncated_second_sequence = tokenizer.build_inputs_with_special_tokens(
                tokenizer.encode(seq_0),
                tokenizer.encode(seq_1)[:-2]
            )

            information = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2, add_special_tokens=True,
-                                                stride=stride, truncate_first_sequence=False)
+                                                stride=stride, truncation_strategy='only_second')
            information_first_truncated = tokenizer.encode_plus(seq_0, seq_1, max_length=len(sequence) - 2,
                                                                add_special_tokens=True, stride=stride,
-                                                                truncate_first_sequence=True)
+                                                                truncation_strategy='only_first')

            truncated_sequence = information["input_ids"]
            overflowing_tokens = information["overflowing_tokens"]
            overflowing_tokens_first_truncated = information_first_truncated["overflowing_tokens"]

-            assert len(overflowing_tokens) == 2 + stride
-            assert overflowing_tokens == sequence_1_no_special_tokens[-(2 + stride):]
-            assert overflowing_tokens_first_truncated == sequence_0_no_special_tokens[-(2 + stride):]
-            assert len(truncated_sequence) == len(sequence) - 2
-            assert truncated_sequence == truncated_second_sequence
+            self.assertEqual(len(overflowing_tokens), 2 + stride)
+            self.assertEqual(overflowing_tokens, sequence_1_no_special_tokens[-(2 + stride):])
+            self.assertEqual(overflowing_tokens_first_truncated, sequence_0_no_special_tokens[-(2 + stride):])
+            self.assertEqual(len(truncated_sequence), len(sequence) - 2)
+            self.assertEqual(truncated_sequence, truncated_second_sequence)

        def test_encode_input_type(self):
            tokenizer = self.get_tokenizer()
@@ -273,5 +273,43 @@ class CommonTestCases:
            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            formatted_input = tokenizer.encode(sequence, add_special_tokens=True)

-            assert tokenizer.encode(tokens, add_special_tokens=True) == formatted_input
-            assert tokenizer.encode(input_ids, add_special_tokens=True) == formatted_input
+            self.assertEqual(tokenizer.encode(tokens, add_special_tokens=True), formatted_input)
+            self.assertEqual(tokenizer.encode(input_ids, add_special_tokens=True), formatted_input)
+
+        def test_special_tokens_mask(self):
+            tokenizer = self.get_tokenizer()
+
+            sequence_0 = "Encode this."
+            sequence_1 = "This one too please."
+
+            # Testing single inputs
+            encoded_sequence = tokenizer.encode(sequence_0)
+            encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
+            encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+            special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+            self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+            filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
+            filtered_sequence = [x for x in filtered_sequence if x is not None]
+            self.assertEqual(encoded_sequence, filtered_sequence)
+
+            # Testing inputs pairs
+            encoded_sequence = tokenizer.encode(sequence_0) + tokenizer.encode(sequence_1)
+            encoded_sequence_dict = tokenizer.encode_plus(sequence_0, sequence_1, add_special_tokens=True)
+            encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+            special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+            self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+            filtered_sequence = [(x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)]
+            filtered_sequence = [x for x in filtered_sequence if x is not None]
+            self.assertEqual(encoded_sequence, filtered_sequence)
+
+            # Testing with already existing special tokens
+            if tokenizer.cls_token_id == tokenizer.unk_token_id and tokenizer.cls_token_id == tokenizer.unk_token_id:
+                tokenizer.add_special_tokens({'cls_token': '</s>', 'sep_token': '<s>'})
+            encoded_sequence_dict = tokenizer.encode_plus(sequence_0, add_special_tokens=True)
+            encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+            special_tokens_mask_orig = encoded_sequence_dict["special_tokens_mask"]
+            special_tokens_mask = tokenizer.get_special_tokens_mask(encoded_sequence_w_special, already_has_special_tokens=True)
+            self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+            self.assertEqual(special_tokens_mask_orig, special_tokens_mask)
--- a/transformers/tests/tokenization_xlm_test.py
+++ b/transformers/tests/tokenization_xlm_test.py
@@ -72,8 +72,8 @@ class XLMTokenizationTest(CommonTestCases.CommonTokenizerTester):
        text = tokenizer.encode("sequence builders")
        text_2 = tokenizer.encode("multi-sequence build")

-        encoded_sentence = tokenizer.add_special_tokens_single_sequence(text)
-        encoded_pair = tokenizer.add_special_tokens_sequence_pair(text, text_2)
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)

        assert encoded_sentence == [1] + text + [1]
        assert encoded_pair == [1] + text + [1] + text_2 + [1]
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Lysandre	3ddce1d74c	Release: 2.1.1	2019-10-11 06:37:49 -04:00
Thomas Wolf	4428aefc63	Merge pull request #1488 from huggingface/pytorch-tpu GLUE on TPU	2019-10-11 16:33:00 +02:00
Thomas Wolf	3b43b01872	Merge pull request #1482 from huggingface/tf2_integration_tests Integration of TF 2.0 models with other Keras modules	2019-10-11 16:25:43 +02:00
thomwolf	4b8f3e8f32	adding citation	2019-10-11 16:18:16 +02:00
thomwolf	18a3cef7d5	no nans	2019-10-11 16:09:42 +02:00
thomwolf	1f5d9513d8	fix test	2019-10-11 15:55:01 +02:00
thomwolf	0f9fc4fbde	adding option to desactivate past/memory outputs	2019-10-11 15:47:08 +02:00
Thomas Wolf	700331b5ec	Merge pull request #1492 from stefan-it/bert-german-dbmdz-models Add new BERT models for German (cased and uncased)	2019-10-11 13:01:52 +02:00
Thomas Wolf	573dde9b44	Merge pull request #1405 from slayton58/xlnet_layer_reorder Re-order XLNet attention head outputs for better perf	2019-10-11 12:10:58 +02:00
Stefan Schweter	5f25a5f367	model: add support for new German BERT models (cased and uncased) from @dbmdz	2019-10-11 10:20:33 +02:00
Luran He	f382a8decd	convert int to str before adding to a str	2019-10-10 19:20:39 -04:00
Lysandre	639f4b7190	Don't save/load when on TPU	2019-10-10 19:17:25 +00:00
Lysandre	d4e7934ac3	GLUE on TPU	2019-10-10 19:03:06 +00:00
thomwolf	2a4fef837a	move Circle-CI from TF2-rc0 to official TF2	2019-10-10 15:57:35 +02:00
thomwolf	751e246087	using tf.print in roberta	2019-10-10 15:47:20 +02:00
thomwolf	c9e8c51946	fixing SequenceSummary head in TF 2.0	2019-10-10 15:16:05 +02:00
thomwolf	da26bae61b	adding more tests on TF and pytorch serialization - updating configuration for better serialization	2019-10-10 14:30:48 +02:00
thomwolf	bb04edb45b	Add tests that TF 2.0 model can be integrated with other Keras modules	2019-10-10 13:08:24 +02:00
Thomas Wolf	6596e3d566	Merge pull request #1454 from bkkaggle/pytorch-built-in-tensorboard Change tensorboard imports to use built-in tensorboard if available	2019-10-10 11:56:55 +02:00
Thomas Wolf	4bc4601192	Merge pull request #1480 from huggingface/fix_ctrl_tokenizer Fixing CTRL tokenizer - Update error messages - XLM-MLM in run_generation	2019-10-10 11:56:20 +02:00
thomwolf	177a721205	move back to simple space spliting	2019-10-10 11:45:47 +02:00
thomwolf	a5997dd81a	better error messages	2019-10-10 11:31:01 +02:00
thomwolf	43a237f15e	switching to moses tokenizer	2019-10-10 10:11:16 +02:00
LysandreJik	036483fae5	Temporary CTRL tokenizer fix	2019-10-09 16:33:15 -04:00
LysandreJik	9c2e0a4acf	Release: 2.1.0	2019-10-09 12:14:03 -04:00
LysandreJik	7fe98d8c18	Update CTRL documentation	2019-10-09 12:12:36 -04:00
LysandreJik	89f86f9661	CTRL added to the documentation	2019-10-09 12:04:06 -04:00
LysandreJik	e17ea08e24	Pycharm folder added to gitignore	2019-10-09 11:32:21 -04:00
Lysandre Debut	2431fea98a	Merge pull request #1383 from keskarnitish/master Adding CTRL	2019-10-09 11:31:05 -04:00
thomwolf	d9e60f4f0d	Merge branch 'master' into pr/1383	2019-10-09 17:25:08 +02:00
Lysandre Debut	e84470ef81	Merge pull request #1384 from huggingface/encoding-qol Quality of life enhancements in encoding + patch MLM masking	2019-10-09 11:18:24 -04:00
thomwolf	07d055f849	higher tolerance	2019-10-09 17:10:04 +02:00
thomwolf	48b438ff2a	doc and conversion	2019-10-09 17:06:30 +02:00
jinoobaek-qz	69629c4f0f	Improve naming and only do regex when necessary	2019-10-09 08:48:40 -04:00
jinoobaek-qz	bf34a252b8	Golden path	2019-10-09 08:48:40 -04:00
jinoobaek-qz	528d3f327b	Improve readability and improve make less assumptions about checkpoint format	2019-10-09 08:48:40 -04:00
jinoobaek-qz	56301bd9e8	Extract method	2019-10-09 08:48:40 -04:00
jinoobaek-qz	d6c5469712	Delete older checkpoint after saving new checkpoint	2019-10-09 08:48:40 -04:00
jinoobaek-qz	54a31f50fb	Add save_total_limit	2019-10-09 08:48:40 -04:00
thomwolf	c19b8e4ae0	fixing CTRL tests and OpenAI GPT tests	2019-10-09 13:51:05 +02:00
thomwolf	6dce6dda1b	fixing TF 2.0 model - adding more severe test on pt/tf equivalence	2019-10-09 11:57:55 +02:00
thomwolf	c56d921dda	adding TF 2.0 model	2019-10-09 11:07:43 +02:00
thomwolf	1c5079952f	simpler distilbert mask - fix tf tests	2019-10-09 04:26:20 +02:00
Thomas Wolf	58b302caf3	Merge pull request #1398 from dveselov/patch-1 Fixed typo in docs README	2019-10-09 03:52:42 +02:00
Thomas Wolf	439fac723a	Merge pull request #1409 from brian41005/master Evaluation result.txt path changing #1286	2019-10-09 03:14:34 +02:00
thomwolf	23b7138ab4	fix #1378 and #1453	2019-10-09 01:54:44 +02:00
Bilal Khan	5ce8d29abe	Change tensorboard imports to use built-in tensorboard if available	2019-10-08 16:29:43 -05:00
Julien Chaumond	d688af19e5	Update link to swift-coreml-transformers cc @lysandrejik	2019-10-08 16:37:52 -04:00
thomwolf	45dc04f33d	tf model [WIP]	2019-10-08 17:37:17 +02:00
thomwolf	248314772f	fix tokenization	2019-10-08 17:19:28 +02:00
thomwolf	03c2c762a6	update tokenizer	2019-10-08 17:12:03 +02:00
thomwolf	3edfa1d6aa	update model to use past	2019-10-08 17:11:58 +02:00
Rémi Louf	f4d41fe33e	Merge pull request #1448 from huggingface/contributing add contribution guidelines	2019-10-08 16:55:34 +02:00
Rémi Louf	45de313a9e	add bullet point on modifying an existing PR	2019-10-08 11:54:10 +02:00
Rémi Louf	ade05b6cef	add code contribution	2019-10-07 23:20:25 +02:00
Rémi Louf	e9c09052a4	add issues and requests guidelines	2019-10-07 22:30:55 +02:00
LysandreJik	8fcc6507ce	Multilingual	2019-10-07 15:02:42 -04:00
Rémi Louf	6e3e1c959e	Merge pull request #1447 from huggingface/dev-requirements Provide requirements.txt for development dependencies	2019-10-07 18:49:26 +02:00
VictorSanh	7ce83b4931	update weights for distilgpt2	2019-10-07 12:30:27 -04:00
VictorSanh	9f81f1cba8	fix convert pt_to_tf2 for custom weights	2019-10-07 12:30:19 -04:00
Rémi Louf	7afd00a661	freeze dev requirements	2019-10-07 17:58:13 +02:00
thomwolf	bd5363cc83	update CTRL configuration	2019-10-07 15:37:30 +02:00
thomwolf	dc89441167	update CTRL pytorch model	2019-10-07 15:37:25 +02:00
thomwolf	320b7a7e01	fix #1416	2019-10-07 14:26:59 +02:00
Thomas Wolf	1615360c71	Merge pull request #1438 from SeanBE/master fix pytorch-transformers migration description in README	2019-10-07 05:02:23 -04:00
seanBE	6dc6c716c5	fix pytorch-transformers migration description in README	2019-10-07 09:59:54 +01:00
Christopher Goh	904158ac4d	Rephrase forward method to reduce ambiguity	2019-10-06 23:40:52 -04:00
Christopher Goh	0f65d8cbbe	Fix some typos in README	2019-10-06 23:40:52 -04:00
LysandreJik	f3e0218fbb	Correct device assignment in run_generation	2019-10-05 21:05:16 -04:00
thomwolf	78ef1a9930	fixes	2019-10-04 17:59:44 -04:00
thomwolf	6c1d0bc066	update encode_plus - add truncation strategies	2019-10-04 17:38:38 -04:00
VictorSanh	0820bb0555	unecessary carriage return	2019-10-04 17:23:15 -04:00
VictorSanh	f5891c3821	run_squad --> run_squad_w_distillation	2019-10-04 17:23:15 -04:00
VictorSanh	764a7923ec	add distillation+finetuning option in run_squad	2019-10-04 17:23:15 -04:00
Lysandre Debut	bb464289ce	New model addition issue template	2019-10-04 16:41:26 -04:00
thomwolf	92c0f2fb90	Merge remote-tracking branch 'origin/julien_multiple-choice' into encoding-qol	2019-10-04 15:48:06 -04:00
Julien Chaumond	9e136ff57c	Honor args.overwrite_cache (h/t @erenup)	2019-10-04 15:00:56 -04:00
LysandreJik	7bddb45a6f	Decode documentaton	2019-10-04 14:27:38 -04:00
keskarnitish	dbed1c5d94	Adding CTRL (squashed commit) adding conversion script adding first draft of modeling & tokenization adding placeholder for test files bunch of changes registering the tokenizer/model/etc tests change link; something is very VERY wrong here weird end-of-word thingy going on i think the tokenization works now ; wrote the unit tests overall structure works;load w next the monster is alive! works after some cleanup as well adding emacs autosave to gitignore currently only supporting the 48 layer one; seems to infer fine on my macbook cleanup fixing some documentation fixing some documentation tests passing? now works on CUDA also adding greedy? adding greedy sampling works well	2019-10-03 22:29:03 -07:00
Thomas Wolf	b3cfd97946	Merge pull request #1373 from TimYagan/fix-css Fixed critical css font-family issues	2019-10-03 19:04:02 -04:00
Lysandre Debut	81a1e12469	Merge pull request #1313 from enzoampil/master Add option to use a 'stop token'	2019-10-03 22:43:57 +00:00
Lysandre Debut	d3f24dfad7	Merge branch 'master' into master	2019-10-03 22:43:09 +00:00
LysandreJik	ecc4f1bdfa	XLM use_lang_embedding flag in run_generation	2019-10-03 17:42:16 -04:00
LysandreJik	c2c2ca0fdb	Added XLM to run_generation, with prompt language selection.	2019-10-03 17:18:48 -04:00
Thomas Wolf	1569610f2d	Merge pull request #1296 from danai-antoniou/add-duplicate-tokens-error Added ValueError for duplicates in list of added tokens	2019-10-03 17:06:17 -04:00
drc10723	e1b2949ae6	DistillBert Documentation Code Example fixes	2019-10-03 15:51:33 -04:00
Simon Layton	899883644f	Fix test fails and warnings Attention output was in bnij ordering instead of ijbn which everything else will expect. This was an oversight on my part, and keeps the attention inputs/outputs identical to the original code. Also moved back from tensor slicing to index_select in rel_shift_bnij to make the tracer happy.	2019-10-03 12:05:15 -04:00
VictorSanh	e2ae9c0b73	fix links in doc index	2019-10-03 11:42:21 -04:00
LysandreJik	aebd83230f	Update naming + remove f string in run_lm_finetuning example	2019-10-03 11:31:36 -04:00
LysandreJik	651bfb7ad5	always_truncate by default	2019-10-03 11:31:36 -04:00
LysandreJik	5ed50a93fb	LM finetuning won't mask special tokens anymore	2019-10-03 11:31:36 -04:00
LysandreJik	cc412edd42	Supports already existing special tokens	2019-10-03 11:31:36 -04:00
LysandreJik	2f259b228e	Sequence IDS	2019-10-03 11:31:36 -04:00
LysandreJik	7c789c337d	Always truncate argument in the encode method	2019-10-03 11:31:36 -04:00
Brian Ma	7af0777910	Update run_glue.py add DistilBert model shortcut into ALL_MODELS	2019-10-03 15:31:11 +00:00
VictorSanh	c1689ac301	fix name	2019-10-03 10:56:39 -04:00
VictorSanh	4a790c40b1	update doc for distil*	2019-10-03 10:54:02 -04:00
VictorSanh	6be46a6e64	update links to new weights	2019-10-03 10:27:11 -04:00
VictorSanh	5f07d8f11a	prepare release	2019-10-03 10:27:11 -04:00
VictorSanh	35071007cb	incoming release 🔥 update links to arxiv preprint	2019-10-03 10:27:11 -04:00
VictorSanh	f1f23ad171	fix buf in convert_pt_chkpt_to_tf2	2019-10-03 10:27:11 -04:00
VictorSanh	2a91f6071f	upddate README - TODO updadte link to paper	2019-10-03 10:27:11 -04:00
VictorSanh	c51e533a5f	update train.py	2019-10-03 10:27:11 -04:00
VictorSanh	a76c3f9cb0	update requirements	2019-10-03 10:27:11 -04:00
VictorSanh	bb9c5ead54	update distiller	2019-10-03 10:27:11 -04:00
VictorSanh	a12ab0a8db	update binarized_data	2019-10-03 10:27:11 -04:00
VictorSanh	4d6dfbd376	update extract	2019-10-03 10:27:11 -04:00
VictorSanh	23edebc079	update extract_distilbert	2019-10-03 10:27:11 -04:00
VictorSanh	cbfcfce205	update token_counts	2019-10-03 10:27:11 -04:00
VictorSanh	19e4ebbe3f	grouped_batch_sampler	2019-10-03 10:27:11 -04:00
VictorSanh	594202a934	lm_seqs_dataset	2019-10-03 10:27:11 -04:00
VictorSanh	38084507c4	add distillation_configs	2019-10-03 10:27:11 -04:00
Simon Layton	9ffda216ec	Fix missed head transpose	2019-10-03 09:23:16 -04:00
Brian Ma	2195c0d5f9	Evaluation result.txt path changing #1286	2019-10-03 12:49:12 +08:00
LysandreJik	ebb32261b1	fix #1401	2019-10-02 17:52:56 -04:00
Simon Layton	d51b589404	Re-order attention head outputs for better perf Significant performance boost over the original orderings on an already somewhat optimised branch this gave me > 2x end-to-end throughput on a squad xlnet fine-tuning task (batch 8, seq-length 612, fp16)	2019-10-02 12:18:21 -04:00
Santiago Castro	63ed224b7c	initialy -> initially	2019-10-02 15:04:18 +00:00
danai-antoniou	a95158518d	Moved duplicate token check	2019-10-02 07:44:15 +01:00
danai-antoniou	d73957899a	Merge branch 'master' of https://github.com/danai-antoniou/pytorch-transformers into add-duplicate-tokens-error	2019-10-02 07:38:50 +01:00
Dima Veselov	cd69bc9c87	Fixed typo in docs README	2019-10-02 03:21:55 +03:00
thomwolf	391db836ab	fix #1260 - remove special logic for decoding pairs of sequence	2019-10-01 19:09:13 -04:00
Thomas Wolf	963529e29b	Merge pull request #1288 from echan00/master Typo with LM Fine tuning script	2019-10-01 18:46:07 -04:00
thomwolf	f7978f70ec	use format instead of f-strings	2019-10-01 18:45:38 -04:00
Thomas Wolf	1e4a191366	Merge pull request #1284 from slayton58/pooler_end_logits_fp16_fix Fix fp16 masking in PoolerEndLogits	2019-10-01 18:40:22 -04:00
thomwolf	c50783e388	Merge branch 'pooler_end_logits_fp16_fix' of https://github.com/slayton58/pytorch-transformers into pr/1284	2019-10-01 18:17:48 -04:00
DenysNahurnyi	6971556ab8	Fix syntax typo in README.md	2019-10-01 14:59:31 -04:00
Julien Chaumond	b350662955	overflowing_tokens do not really make sense here, let's just return a number Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2019-09-30 16:37:09 -04:00
Julien Chaumond	f5bcde0b2f	[multiple-choice] Simplify and use tokenizer.encode_plus	2019-09-30 16:04:55 -04:00
Santosh Gupta	5c3b32d44d	Update README.md Lines 183 - 200, fixed indentation. Line 198, replaced `tokenizer_class` with `BertTokenizer`, since `tokenizer_class` is not defined in the loop it belongs to.	2019-09-30 18:48:01 +00:00
VictorSanh	2dc8cb8734	fix unknown imports (*ForMultipleChoice) in run_multiple_choice	2019-09-29 19:51:01 -04:00
Tim Yagan	0a4ed7192e	Fixed critical css font-family issues Fixed critical css font-family issues to ensure compatibility with multiple webbrowsers	2019-09-29 13:51:01 +02:00
Thomas Wolf	ae50ad91ea	Merge pull request #1362 from FeiWang96/doc fix link	2019-09-28 10:26:42 +02:00
wangfei	60f791631b	Fix link in readme	2019-09-28 16:20:17 +08:00
Ikuya Yamada	a6a6d9e638	fix padding_idx of RoBERTa model	2019-09-27 19:03:55 -04:00
Julien Chaumond	d8b641c839	6 -> 8 models	2019-09-27 17:22:01 -04:00
Julien Chaumond	c6acbdd50a	Close #1304	2019-09-27 17:02:53 -04:00
Thomas Wolf	df7cd9e4e4	Merge pull request #1353 from wendingp/patch-1 Fix some typos	2019-09-27 23:00:34 +02:00
Thomas Wolf	6a17b3c51b	Merge pull request #1355 from agrinh/master Fix tensorflow_dataset glue support	2019-09-27 22:59:54 +02:00
Thomas Wolf	04e9a6f512	Merge pull request #1359 from dennymarcels/patch-1 Update run_lm_finetuning.py	2019-09-27 22:58:19 +02:00
Denny	9478590630	Update run_lm_finetuning.py The previous method, just as phrased, did not exist in the class.	2019-09-27 15:18:42 -03:00
Agrin Hilmkil	795b3e76ff	Add docstring for processor method	2019-09-27 17:32:28 +02:00
Agrin Hilmkil	e31a472801	Fix tensorflow_dataset glue support `glue_convert_examples_to_features` assumed that tensorflow_dataset examples contains the features `'sentence1'` and `'sentence2'`. This commit encapsulates the choice of features in the glue processor and uses that to parse examples.	2019-09-27 17:16:02 +02:00
pj	4f2b6579bf	Fix some typos	2019-09-27 22:55:43 +08:00
Thomas Wolf	ca559826c4	Merge pull request #1349 from ogabrielluiz/master Just some typos	2019-09-27 13:08:00 +02:00
Gabriel Luiz Freitas Almeida	d2de5b9d8c	Just some typos	2019-09-27 07:08:36 -03:00
Thomas Wolf	d83d295763	Merge pull request #1337 from mgrankin/fastdataset faster dataset building	2019-09-27 10:35:12 +02:00
Thomas Wolf	f6de000305	Merge pull request #1346 from BramVanroy/documentation Add small note about the output of hidden states (closes #1332)	2019-09-27 10:30:07 +02:00
BramVanroy	15749bfc10	Add small note about the output of hidden states	2019-09-27 10:01:36 +02:00
thomwolf	da2e47ad15	clean up a little run_tf_glue	2019-09-27 09:41:15 +02:00
thomwolf	528c288fa9	clean up run_tf_glue	2019-09-27 09:40:29 +02:00
VictorSanh	702f589848	fix input in run_glue for distilbert	2019-09-27 00:20:14 -04:00
Julien Chaumond	22d2fded2c	[docs] Fix doc auto-deploy Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2019-09-26 18:22:45 -04:00
Julien Chaumond	fc9faa8a47	[docs] Doc tweaks Co-Authored-By: Lysandre Debut <lysandre.debut@reseau.eseo.fr>	2019-09-26 18:19:51 -04:00
LysandreJik	ecfddc6034	Update RoBERTa and GPT-2 Tokenizer documentation (fix #1343 )	2019-09-26 16:49:03 -04:00
LysandreJik	93f0c5fc72	Repository link in the documentation	2019-09-26 11:45:00 -04:00
thomwolf	6c3b131516	typo in readme/doc	2019-09-26 16:23:28 +02:00
thomwolf	f83b35b77d	Merge branch 'master' of https://github.com/huggingface/pytorch-transformers	2019-09-26 16:14:23 +02:00
thomwolf	4e63c90720	update installation instructions in readme	2019-09-26 16:14:21 +02:00
LysandreJik	7e957237e4	[Doc] XLM + Torch in documentation	2019-09-26 10:08:56 -04:00
LysandreJik	302a4813a5	Doc building requirements [TF2]	2019-09-26 09:57:30 -04:00
mgrankin	f71a4577b8	faster dataset building	2019-09-26 16:53:13 +03:00
LysandreJik	a3e0dbba95	Doc building requirements [TF]	2019-09-26 09:51:14 -04:00
Lysandre Debut	0f92f76ca3	CircleCI reference in README	2019-09-26 08:59:52 -04:00
LysandreJik	4094958df2	Doc building requirements	2019-09-26 08:50:55 -04:00
LysandreJik	7d8b395afa	Doc building requirements	2019-09-26 08:49:31 -04:00
LysandreJik	927904bc91	[doc] pytorch_transformers -> transformers	2019-09-26 08:47:15 -04:00
LysandreJik	294edfd83d	Release version in documentation	2019-09-26 08:16:12 -04:00
LysandreJik	de5e4864cb	Documentation	2019-09-26 08:04:54 -04:00
thomwolf	e4e35296fb	update setup.py metadata	2019-09-26 13:52:24 +02:00
Lorenzo Ampil	4b543c3007	Add option to use a 'stop token' which will be used to truncate the output text to everything till right before the 'stop token'	2019-09-22 21:38:38 +08:00
danai-antoniou	2e6797cc7d	Added valuerror for duplicate added tokens	2019-09-19 15:40:42 +01:00
Erik Chan	f0340eccf9	Typo Typo	2019-09-18 13:42:11 -07:00
Simon Layton	ec94f4e0f8	Fix fp16 masking in PoolerEndLogits Necessary to run xlnet (at least in squad) with `--fp16 --fp16_opt_level="O2"`, otherwise loss is immediately `NaN` and fine-tuning cannot proceed.	2019-09-18 09:30:58 -04:00